Simon Willison's Weblog: files-to-prompt

Long context support in LLM 0.24 using fragments and template plugins

2025-04-07T17:45:11+00:00

LLM 0.24 is now available with new features to help take advantage of the increasingly long input context supported by modern LLMs.

(LLM is my command-line tool and Python library for interacting with LLMs, supported by 20+ plugins adding support for both local and remote models from a bunch of different providers.)

Trying it out

To install LLM with uv (there are several other options):

uv tool install llm

You'll need to either provide an OpenAI API key or install a plugin to use local models or models from other providers:

llm keys set openai
# Paste OpenAI API key here

To upgrade LLM from a previous version:

llm install -U llm

The biggest new feature is fragments. You can now use -f filename or -f url to add one or more fragments to your prompt, which means you can do things like this:

llm -f https://simonwillison.net/2025/Apr/5/llama-4-notes/ 'bullet point summary'

Here's the output from that prompt, exported using llm logs -c --expand --usage. Token cost was 5,372 input, 374 output which works out as 0.103 cents (around 1/10th of a cent) using the default GPT-4o mini model.

Plugins can implement custom fragment loaders with a prefix. The llm-fragments-github plugin adds a github: prefix that can be used to load every text file in a GitHub repository as a list of fragments:

llm install llm-fragments-github
llm -f github:simonw/s3-credentials 'Suggest new features for this tool'

Here's the output. That took 49,856 input tokens for a total cost of 0.7843 cents - nearly a whole cent!

Improving LLM's support for long context models

Long context is one of the most exciting trends in LLMs over the past eighteen months. Saturday's Llama 4 Scout release gave us the first model with a full 10 million token context. Google's Gemini family has several 1-2 million token models, and the baseline for recent models from both OpenAI and Anthropic is 100 or 200 thousand.

Two years ago most models capped out at 8,000 tokens of input. Long context opens up many new interesting ways to apply this class of technology.

I've been using long context models via my files-to-prompt tool to summarize large codebases, explain how they work and even to debug gnarly bugs. As demonstrated above, it's surprisingly inexpensive to drop tens of thousands of tokens into models like GPT-4o mini or most of the Google Gemini series, and the results are often very impressive.

One of LLM's most useful features is that it logs every prompt and response to a SQLite database. This is great for comparing the same prompt against different models and tracking experiments over time - my own database contained thousands of responses from hundreds of different models accumulated over the past couple of years.

This is where long context prompts were starting to be a problem. Since LLM stores the full prompt and response in the database, asking five questions of the same source code could result in five duplicate copies of that text in the database!

The new fragments feature targets this problem head on. Each fragment is stored once in a fragments table, then de-duplicated in the future using a SHA256 hash of its content.

This saves on storage, and also enables features like llm logs -f X for seeing all logged responses that use a particular fragment.

Fragments can be specified in several different ways:

a path to a file
a URL to data online
an alias that's been set against a previous fragment (see llm fragments set)
a hash ID of the content of a fragment
using prefix:argument to specify fragments from a plugin

Asking questions of LLM's documentation

Wouldn't it be neat if LLM could answer questions about its own documentation?

The new llm-docs plugin (built with the new register_fragment_loaders() plugin hook) enables exactly that:

llm install llm-docs
llm -f docs: "How do I embed a binary file?"

The output starts like this:

To embed a binary file using the LLM command-line interface, you can use the llm embed command with the --binary option. Here’s how you can do it:

Make sure you have the appropriate embedding model installed that supports binary input.

Use the following command syntax:
llm embed -m <model_id> --binary -i <path_to_your_binary_file>
Replace <model_id> with the identifier for the embedding model you want to use (e.g., clip for the CLIP model) and <path_to_your_binary_file> with the path to your actual binary file.

(74,570 input, 240 output = 1.1329 cents with GPT-4o mini)

Using -f docs: with just the prefix is the same as using -f docs:llm. The plugin fetches the documentation for your current version of LLM from my new simonw/docs-for-llms repo, which also provides packaged documentation files for my datasette, s3-credentials, shot-scraper and sqlite-utils projects.

Datasette's documentation has got pretty long, so you might need to run that through a Gemini model instead (using the llm-gemini plugin):

llm -f docs:datasette -m gemini-2.0-flash \
  'Build a render_cell plugin that detects and renders markdown'

Here's the output. 132,042 input, 1,129 output with Gemini 2.0 Flash = 1.3656 cents.

You can browse the combined documentation files this uses in docs-for-llm. They're built using GitHub Actions.

llms-txt is a project lead by Jeremy Howard that encourages projects to publish similar files to help LLMs ingest a succinct copy of their documentation.

The new register_template_loaders() plugin hook allows plugins to register prefix:value custom template loaders, for use with the llm -t option.

llm-templates-github and llm-templates-fabric are two new plugins that make use of that hook.

llm-templates-github lets you share and use templates via a public GitHub repository. Here's how to run my Pelican riding a bicycle benchmark against a specific model:

llm install llm-templates-github
llm -t gh:simonw/pelican-svg -m o3-mini

This executes this pelican-svg.yaml template stored in my simonw/llm-templates repository, using a new repository naming convention.

llm -t gh:simonw/pelican-svg will load that pelican-svg.yaml file from the simonw/llm-templates repo. You can also use llm -t gh:simonw/name-of-repo/name-of-template to load a template from a repository that doesn't follow that convention.

To share your own templates, create a repository on GitHub under your user account called llm-templates and start saving .yaml files to it.

llm-templates-fabric provides a similar mechanism for loading templates from Daniel Miessler's extensive fabric collection:

llm install llm-templates-fabric
curl https://simonwillison.net/2025/Apr/6/only-miffy/ | \
  llm -t f:extract_main_idea

A conversation with Daniel was the inspiration for this new plugin hook.

Template and fragment loader plugins so far

Here's the list of fragment and template loader plugins so far from the LLM plugins directory:

llm-templates-github supports loading templates shared on GitHub, e.g. llm -t gh:simonw/pelican-svg.

llm-templates-fabric provides access to the Fabric collection of prompts: cat setup.py | llm -t fabric:explain_code.

llm-fragments-github can load entire GitHub repositories in a single operation: llm -f github:simonw/files-to-prompt 'explain this code'.

llm-hacker-news imports conversations from Hacker News as fragments: llm -f hn:43615912 'summary with illustrative direct quotes'.

Everything else in LLM 0.24

LLM 0.24 is a big release, spanning 51 commits. The release notes cover everything that's new in full - here are a few of my highlights:

The new llm-openai plugin provides support for o1-pro (which is not supported by the OpenAI mechanism used by LLM core). Future OpenAI features will migrate to this plugin instead of LLM core itself.

The problem with OpenAI models being handled by LLM core is that I have to release a whole new version of LLM every time OpenAI releases a new model or feature. Migrating this stuff out to a plugin means I can release new version of that plugin independently of LLM itself - something I frequently do for llm-anthropic and llm-gemini and others.

The new llm-openai plugin uses their Responses API, a new shape of API which I covered last month.

llm -t $URL option can now take a URL to a YAML template. #856

The new custom template loaders are fun, but being able to paste in a URL to a YAML file somewhere provides a simpler way to share templates.

Templates can now store default model options. #845

Attachments can now be stored in templates. #826

The quickest way to create your own template is with the llm prompt ... --save name-of-template command. This now works with attachments, fragments and default model options, each of which is persisted in the template YAML file.

New llm models options family of commands for setting default options for particular models. #829

I built this when I learned that Qwen's QwQ-32b model works best with temperature 0.7 and top p 0.95.

llm prompt -d path-to-sqlite.db option can now be used to write logs to a custom SQLite database. #858

This proved extremely useful for testing fragments - it meant I could run a prompt and save the full response to a separate SQLite database which I could then upload to S3 and share as a link to Datasette Lite.

llm similar -p/--plain option providing more human-readable output than the default JSON. #853

I'd like this to be the default output, but I'm holding off on changing that until LLM 1.0 since it's a breaking change for people building automations against the JSON from llm similar.

Set the LLM_RAISE_ERRORS=1 environment variable to raise errors during prompts rather than suppressing them, which means you can run python -i -m llm 'prompt' and then drop into a debugger on errors with import pdb; pdb.pm(). #817

Really useful for debugging new model plugins.

llm prompt -q gpt -q 4o option - pass -q searchterm one or more times to execute a prompt against the first model that matches all of those strings - useful for if you can't remember the full model ID. #841

Pretty obscure but I found myself needing this. Vendors love releasing models with names like gemini-2.5-pro-exp-03-25, now I can run llm -q gem -q 2.5 -q exp 'say hi' to save me from looking up the model ID.

OpenAI compatible models configured using extra-openai-models.yaml now support supports_schema: true, vision: true and audio: true options. Thanks @adaitche and @giuli007. #819, #843

I don't use this feature myself but it's clearly popular, this isn't the first time I'e had PRs with improvements from the wider community.

Tags: plugins, projects, ai, annotated-release-notes, openai, generative-ai, llms, llm, gemini, long-context, files-to-prompt

Note on 26th March 2025

2025-03-26T06:11:30+00:00

I've added a new content type to my blog: notes. These join my existing types: entries, bookmarks and quotations.

A note is a little bit like a bookmark without a link. They're for short form writing - thoughts or images that don't warrant a full entry with a title. The kind of things I used to post to Twitter, but that don't feel right to cross-post to multiple social networks (Mastodon and Bluesky, for example.)

I was partly inspired by Molly White's short thoughts, notes, links, and musings.

I've been thinking about this for a while, but the amount of work involved in modifying all of the parts of my site that handle the three different content types was daunting. Then this evening I tried running my blog's source code (using files-to-prompt and LLM) through the new Gemini 2.5 Pro:

files-to-prompt . -e py -c | \
  llm -m gemini-2.5-pro-exp-03-25 -s \
  'I want to add a new type of content called a Note,
  similar to quotation and bookmark and entry but it
  only has a markdown text body. Output all of the
  code I need to add for that feature and tell me
  which files to add  the code to.'

Gemini gave me a detailed 13 step plan covering all of the tedious changes I'd been avoiding having to figure out!

The code is in this PR, which touched 18 different files. The whole project took around 45 minutes start to finish.

(I used Claude to brainstorm names for the feature - I had it come up with possible nouns and then "rank those by least pretentious to most pretentious", and "notes" came out on top.)

This is now far too long for a note and should really be upgraded to an entry, but I need to post a first note to make sure everything is working as it should.

Tags: blogging, projects, molly-white, ai-assisted-programming, claude, gemini, files-to-prompt, site-upgrades

Putting Gemini 2.5 Pro through its paces

2025-03-25T20:45:58+00:00

There's a new release from Google Gemini this morning: the first in the Gemini 2.5 series. Google call it "a thinking model, designed to tackle increasingly complex problems". It's already sat at the top of the LM Arena leaderboard, and from initial impressions looks like it may deserve that top spot.

I just released llm-gemini 0.16 adding support for the new model to my LLM command-line tool. Let's try it out.

The pelican riding a bicycle

First up, my classic generate an SVG of a pelican riding a bicycle prompt.

# Upgrade the plugin
llm install -U llm-gemini

# Now run the prompt:
llm -m gemini-2.5-pro-exp-03-25 "Generate an SVG of a pelican riding a bicycle"

It's pretty solid!

Here's the full transcript.

This task is meant to be almost impossible: pelicans are the wrong shape to ride bicycles! Given that, I think this is a good attempt - I like it slightly better than my previous favourite Claude 3.7 Sonnet, which produced this a month ago:

Transcribing audio

I had an MP3 lying around from a previous experiment which mixes English and Spanish. I tried running it with the prompt transcribe to see what would happen:

llm -m gemini-2.5-pro-exp-03-25 'transcribe' \
  -a https://static.simonwillison.net/static/2025/russian-pelican-in-spanish.mp3

I got back this, with timestamps interspersed with the text:

I need you [ 0m0s450ms ] to pretend [ 0m0s880ms ] to be [ 0m0s990ms ] a California [ 0m1s560ms ] brown [ 0m1s850ms ] pelican [ 0m2s320ms ] with [ 0m2s480ms ] a very [ 0m2s990ms ] thick [ 0m3s290ms ] Russian [ 0m3s710ms ] accent, [ 0m4s110ms ] but [ 0m4s540ms ] you [ 0m4s640ms ] talk [ 0m4s830ms ] to me [ 0m4s960ms ] exclusively [ 0m5s660ms ] in Spanish. [ 0m6s200ms ] Oye, [ 0m8s930ms ] camarada, [ 0m9s570ms ] aquí [ 0m10s240ms ] está [ 0m10s590ms ] tu [ 0m10s740ms ] pelícano [ 0m11s370ms ] californiano [ 0m12s320ms ] con [ 0m12s520ms ] acento [ 0m13s100ms ] ruso. [ 0m13s540ms ] Qué [ 0m14s230ms ] tal, [ 0m14s570ms ] tovarisch? [ 0m15s210ms ] Listo [ 0m15s960ms ] para [ 0m16s190ms ] charlar [ 0m16s640ms ] en [ 0m16s750ms ] español? [ 0m17s250ms ] How's [ 0m19s834ms ] your [ 0m19s944ms ] day [ 0m20s134ms ] today? [ 0m20s414ms ] Mi [ 0m22s654ms ] día [ 0m22s934ms ] ha [ 0m23s4ms ] sido [ 0m23s464ms ] volando [ 0m24s204ms ] sobre [ 0m24s594ms ] las [ 0m24s844ms ] olas, [ 0m25s334ms ] buscando [ 0m26s264ms ] peces [ 0m26s954ms ] y [ 0m27s84ms ] disfrutando [ 0m28s14ms ] del [ 0m28s244ms ] sol [ 0m28s664ms ] californiano. [ 0m29s444ms ] Y [ 0m30s314ms ] tú, [ 0m30s614ms ] amigo, ¿ [ 0m31s354ms ] cómo [ 0m31s634ms ] ha [ 0m31s664ms ] estado [ 0m31s984ms ] tu [ 0m32s134ms ] día? [ 0m32s424ms ]

This inspired me to try again, this time including a JSON schema (using LLM's custom schema DSL):

llm -m gemini-2.5-pro-exp-03-25 'transcribe' \
  -a https://static.simonwillison.net/static/2025/russian-pelican-in-spanish.mp3 \
  --schema-multi 'timestamp str: mm:ss,text, language: two letter code'

I got an excellent response from that:

{
  "items": [
    {
      "language": "en",
      "text": "I need you to pretend to be a California brown pelican with a very thick Russian accent, but you talk to me exclusively in Spanish.",
      "timestamp": "00:00"
    },
    {
      "language": "es",
      "text": "Oye, camarada. Aquí está tu pelícano californiano con acento ruso.",
      "timestamp": "00:08"
    },
    {
      "language": "es",
      "text": "¿Qué tal, Tovarish? ¿Listo para charlar en español?",
      "timestamp": "00:13"
    },
    {
      "language": "en",
      "text": "How's your day today?",
      "timestamp": "00:19"
    },
    {
      "language": "es",
      "text": "Mi día ha sido volando sobre las olas, buscando peces y disfrutando del sol californiano.",
      "timestamp": "00:22"
    },
    {
      "language": "es",
      "text": "¿Y tú, amigo, cómo ha estado tu día?",
      "timestamp": "00:30"
    }
  ]
}

I confirmed that the timestamps match the audio. This is fantastic.

Let's try that against a ten minute snippet of a podcast episode I was on:

llm -m gemini-2.5-pro-exp-03-25 \
  'transcribe, first speaker is Christopher, second is Simon' \
  -a ten-minutes-of-podcast.mp3 \
  --schema-multi 'timestamp str: mm:ss, text, speaker_name'

Useful LLM trick: you can use llm logs -c --data to get just the JSON data from the most recent prompt response, so I ran this:

llm logs -c --data | jq

Here's the full output JSON, which starts and ends like this:

{
  "items": [
    {
      "speaker_name": "Christopher",
      "text": "on its own and and it has this sort of like a it's like a you know old tree in the forest, you know, kind of thing that you've built, so.",
      "timestamp": "00:00"
    },
    {
      "speaker_name": "Simon",
      "text": "There's also like I feel like with online writing, never ever like stick something online just expect people to find it. You have to So one of the great things about having a blog is I can be in a conversation about something and somebody ask a question, I can say, oh, I wrote about that two and a half years ago and give people a link.",
      "timestamp": "00:06"
    },

    {
      "speaker_name": "Simon",
      "text": "So on that basis, Chat and I can't remember if the free version of Chat GPT has code interpreter.",
      "timestamp": "09:45"
    },
    {
      "speaker_name": "Simon",
      "text": "I hope I think it does.",
      "timestamp": "09:50"
    },
    {
      "speaker_name": "Christopher",
      "text": "Okay. So this is like the basic paid one, maybe the $20 month because I know there's like a $200 one that's a little steep for like a basic",
      "timestamp": "09:51"
    }
  ]
}

A spot check of the timestamps showed them in the right place. Gemini 2.5 supports long context prompts so it's possible this works well for much longer audio files - it would be interesting to dig deeper and try that out.

Bounding boxes

One of my favourite features of previous Gemini models is their support for bounding boxes: you can prompt them to return boxes around objects in images.

I built a separate tool for experimenting with this feature in August last year, which I described in Building a tool showing how Gemini Pro can return bounding boxes for objects in images. I've now upgraded that tool to add support the new model.

You can access it at tools.simonwillison.net/gemini-bbox - you'll need to provide your own Gemini API key which is sent directly to their API from your browser (it won't be logged by an intermediary).

I tried it out on a challenging photograph of some pelicans... and it worked extremely well:

My prompt was:

Return bounding boxes around pelicans as JSON arrays [ymin, xmin, ymax, xmax]

The Gemini models are all trained to return bounding boxes scaled between 0 and 100. My tool knows how to convert those back to the same dimensions as the input image.

Here's what the visualized result looked like:

It got almost all of them! I like how it didn't draw a box around the one egret that had made it into the photo.

More characteristics of the model

Here's the official model listing in the Gemini docs. Key details:

Input token limit: 1,000,000
Output token limit: 64,000 - this is a huge upgrade, all of the other listed models have 8,192 for this (correction: Gemini 2.0 Flash Thinking also had a 64,000 output length)
Knowledge cut-off: January 2025 - an improvement on Gemini 2.0's August 2024

Gemini 2.5 Pro is a very strong new model

I've hardly scratched the surface when it comes to trying out Gemini 2.5 Pro so far. How's its creative writing? Factual knowledge about the world? Can it write great code in Python, JavaScript, Rust and more?

The Gemini family of models have capabilities that set them apart from other models:

Long context length - Gemini 2.5 Pro supports up to 1 million tokens
Audio input - something which few other models support, certainly not at this length and with this level of timestamp accuracy
Accurate bounding box detection for image inputs

My experiments so far with these capabilities indicate that Gemini 2.5 Pro really is a very strong new model. I'm looking forward to exploring more of what it can do.

Update: it's very good at code

I spent this evening trying it out for coding tasks, and it's very, very impressive. I'm seeing results for Python that feel comparable to my previous favourite Claude 3.7 Sonnet, and appear to be benefitting from Gemini 2.5 Pro's default reasoning mode and long context.

I've been wanting to add a new content type of "notes" to my blog for quite a while now, but I was put off by the anticipated tedium of all of the different places in the codebase that would need to be updated.

That feature is now live. Here are my notes on creating that notes feature using Gemini 2.5 Pro. It crunched through my entire codebase and figured out all of the places I needed to change - 18 files in total, as you can see in the resulting PR. The whole project took about 45 minutes from start to finish - averaging less than three minutes per file I had to modify.

I've thrown a whole bunch of other coding challenges at it, and the bottleneck on evaluating them has become my own mental capacity to review the resulting code!

Here's another, more complex example. This hasn't resulted in actual running code yet but it took a big bite out of an architectural design problem I've been stewing on for a very long time.

My LLM project needs support for tools - a way to teach different LLMs how to request tool execution, then have those tools (implemented in Python) run and return their results back to the models.

Designing this is really hard, because I need to create an abstraction that works across multiple different model providers, each powered by a different plugin.

Could Gemini 2.5 Pro help unblock me by proposing an architectural approach that might work?

I started by combining the Python and Markdown files for my sqlite-utils, llm, llm-gemini, and llm-anthropic repositories into one big document:

files-to-prompt sqlite-utils llm llm-gemini llm-anthropic -e md -e py -c

Piping it through ttok showed that to be 316,098 tokens (using the OpenAI tokenizer, but the Gemini tokenizer is likely a similar number).

Then I fed that all into Gemini 2.5 Pro with the following prompt:

Provide a detailed design proposal for adding tool calling support to LLM

Different model plugins will have different ways of executing tools, but LLM itself should provide both a Python and a CLI abstraction over these differences.

Tool calling involves passing tool definitions to a model, the model then replies with tools it would like executed, the harness code then executes those and passes the result back to the model and continues in a loop until the model stops requesting tools. This should happen as part of the existing llm.Conversation mechanism.

At the Python layer tools should be provided as Python functions that use type hints, which can then be converted into JSON schema using Pydantic - similar to how the existing schema= mechanism works.

For the CLI option tools will be provided as an option passed to the command, similar to how the sqlite-utils convert CLI mechanism works.'

I've been thinking about this problem for over a year now. Gemini 2.5 Pro's response isn't exactly what I'm going to do, but it did include a flurry of genuinely useful suggestions to help me craft my final approach.

I hadn't thought about the need for asyncio support for tool functions at all, so this idea from Gemini 2.5 Pro was very welcome:

The AsyncConversation.prompt() method will mirror the synchronous API, accepting the tools parameter and handling the tool calling loop using async/await for tool execution if the tool functions themselves are async. If tool functions are synchronous, they will be run in a thread pool executor via asyncio.to_thread.

Watching Gemini crunch through some of the more complex problems I've been dragging my heels on reminded me of something Harper Reed wrote about his workflow with LLMs for code:

My hack to-do list is empty because I built everything. I keep thinking of new things and knocking them out while watching a movie or something.

I've been stressing quite a bit about my backlog of incomplete projects recently. I don't think Gemini 2.5 Pro is quite going to work through all of them while I'm watching TV, but it does feel like it's going to help make a significant dent in them.

Tags: google, ai, generative-ai, llms, gemini, vision-llms, pelican-riding-a-bicycle, llm-reasoning, llm-release, files-to-prompt, chatbot-arena

microsoft/playwright-mcp

2025-03-25T01:40:05+00:00

microsoft/playwright-mcp

The Playwright team at Microsoft have released an MCP (Model Context Protocol) server wrapping Playwright, and it's pretty fascinating.

They implemented it on top of the Chrome accessibility tree, so MCP clients (such as the Claude Desktop app) can use it to drive an automated browser and use the accessibility tree to read and navigate pages that they visit.

Trying it out is quite easy if you have Claude Desktop and Node.js installed already. Edit your claude_desktop_config.json file:

code ~/Library/Application\ Support/Claude/claude_desktop_config.json

And add this:

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": [
        "@playwright/mcp@latest"
      ]
    }
  }
}

Now when you launch Claude Desktop various new browser automation tools will be available to it, and you can tell Claude to navigate to a website and interact with it.

I ran the following to get a list of the available tools:

cd /tmp
git clone https://github.com/microsoft/playwright-mcp
cd playwright-mcp/src/tools
files-to-prompt . | llm -m claude-3.7-sonnet \
  'Output a detailed description of these tools'

The full output is here, but here's the truncated tool list:

Navigation Tools (common.ts)

browser_navigate: Navigate to a specific URL

browser_go_back: Navigate back in browser history

browser_go_forward: Navigate forward in browser history

browser_wait: Wait for a specified time in seconds

browser_press_key: Press a keyboard key

browser_save_as_pdf: Save current page as PDF

browser_close: Close the current page

Screenshot and Mouse Tools (screenshot.ts)

browser_screenshot: Take a screenshot of the current page

browser_move_mouse: Move mouse to specific coordinates

browser_click (coordinate-based): Click at specific x,y coordinates

browser_drag (coordinate-based): Drag mouse from one position to another

browser_type (keyboard): Type text and optionally submit

Accessibility Snapshot Tools (snapshot.ts)

browser_snapshot: Capture accessibility structure of the page

browser_click (element-based): Click on a specific element using accessibility reference

browser_drag (element-based): Drag between two elements

browser_hover: Hover over an element

browser_type (element-based): Type text into a specific element

Tags: ai, playwright, generative-ai, llms, anthropic, claude, llm-tool-use, model-context-protocol, files-to-prompt

Here's how I use LLMs to help me write code

2025-03-11T14:09:03+00:00

Online discussions about using Large Language Models to help write code inevitably produce comments from developers who's experiences have been disappointing. They often ask what they're doing wrong - how come some people are reporting such great results when their own experiments have proved lacking?

Using LLMs to write code is difficult and unintuitive. It takes significant effort to figure out the sharp and soft edges of using them in this way, and there's precious little guidance to help people figure out how best to apply them.

If someone tells you that coding with LLMs is easy they are (probably unintentionally) misleading you. They may well have stumbled on to patterns that work, but those patterns do not come naturally to everyone.

I've been getting great results out of LLMs for code for over two years now. Here's my attempt at transferring some of that experience and intution to you.

Set reasonable expectations

Ignore the "AGI" hype - LLMs are still fancy autocomplete. All they do is predict a sequence of tokens - but it turns out writing code is mostly about stringing tokens together in the right order, so they can be extremely useful for this provided you point them in the right direction.

If you assume that this technology will implement your project perfectly without you needing to exercise any of your own skill you'll quickly be disappointed.

Instead, use them to augment your abilities. My current favorite mental model is to think of them as an over-confident pair programming assistant who's lightning fast at looking things up, can churn out relevant examples at a moment's notice and can execute on tedious tasks without complaint.

Over-confident is important. They'll absolutely make mistakes - sometimes subtle, sometimes huge. These mistakes can be deeply inhuman - if a human collaborator hallucinated a non-existent library or method you would instantly lose trust in them.

Don't fall into the trap of anthropomorphizing LLMs and assuming that failures which would discredit a human should discredit the machine in the same way.

When working with LLMs you'll often find things that they just cannot do. Make a note of these - they are useful lessons! They're also valuable examples to stash away for the future - a sign of a strong new model is when it produces usable results for a task that previous models had been unable to handle.

Account for training cut-off dates

A crucial characteristic of any model is its training cut-off date. This is the date at which the data they were trained on stopped being collected. For OpenAI's models this is usually October 2023 or May 2024. Other providers may have more recent dates.

This is extremely important for code, because it influences what libraries they will be familiar with. If the library you are using had a major breaking change since October 2023, some OpenAI models won't know about it!

I gain enough value from LLMs that I now deliberately consider this when picking a library - I try to stick with libraries with good stability and that are popular enough that many examples of them will have made it into the training data. I like applying the principles of boring technology - innovate on your project's unique selling points, stick with tried and tested solutions for everything else.

LLMs can still help you work with libraries that exist outside their training data, but you need to put in more work - you'll need to feed them recent examples of how those libraries should be used as part of your prompt.

This brings us to the most important thing to understand when working with LLMs:

Context is king

Most of the craft of getting good results out of an LLM comes down to managing its context - the text that is part of your current conversation.

This context isn't just the prompt that you have fed it: successful LLM interactions usually take the form of conversations, and the context consists of every message from you and every reply from the LLM that exist in the current conversation thread.

When you start a new conversation you reset that context back to zero. This is important to know, as often the fix for a conversation that has stopped being useful is to wipe the slate clean and start again.

Some LLM coding tools go beyond just the conversation. Claude Projects for example allow you to pre-populate the context with quite a large amount of text - including a recent ability to import code directly from a GitHub repository which I'm using a lot.

Tools like Cursor and VS Code Copilot include context from your current editor session and file layout automatically, and you can sometimes use mechanisms like Cursor's @commands to pull in additional files or documentation.

One of the reasons I mostly work directly with the ChatGPT and Claude web or app interfaces is that it makes it easier for me to understand exactly what is going into the context. LLM tools that obscure that context from me are less effective.

You can use the fact that previous replies are also part of the context to your advantage. For complex coding tasks try getting the LLM to write a simpler version first, check that it works and then iterate on building to the more sophisticated implementation.

I often start a new chat by dumping in existing code to seed that context, then work with the LLM to modify it in some way.

One of my favorite code prompting techniques is to drop in several full examples relating to something I want to build, then prompt the LLM to use them as inspiration for a new project. I wrote about that in detail when I described my JavaScript OCR application that combines Tesseract.js and PDF.js - two libraries I had used in the past and for which I could provide working examples in the prompt.

Ask them for options

Most of my projects start with some open questions: is the thing I'm trying to do possible? What are the potential ways I could implement it? Which of those options are the best?

I use LLMs as part of this initial research phase.

I'll use prompts like "what are options for HTTP libraries in Rust? Include usage examples" - or "what are some useful drag-and-drop libraries in JavaScript? Build me an artifact demonstrating each one" (to Claude).

The training cut-off is relevant here, since it means newer libraries won't be suggested. Usually that's OK - I don't want the latest, I want the most stable and the one that has been around for long enough for the bugs to be ironed out.

If I'm going to use something more recent I'll do that research myself, outside of LLM world.

The best way to start any project is with a prototype that proves that the key requirements of that project can be met. I often find that an LLM can get me to that working prototype within a few minutes of me sitting down with my laptop - or sometimes even while working on my phone.

Tell them exactly what to do

Once I've completed the initial research I change modes dramatically. For production code my LLM usage is much more authoritarian: I treat it like a digital intern, hired to type code for me based on my detailed instructions.

Here's a recent example:

Write a Python function that uses asyncio httpx with this signature:
async def download_db(url, max_size_bytes=5 * 1025 * 1025): -> pathlib.Path
Given a URL, this downloads the database to a temp directory and returns a path to it. BUT it checks the content length header at the start of streaming back that data and, if it's more than the limit, raises an error. When the download finishes it uses sqlite3.connect(...) and then runs a PRAGMA quick_check to confirm the SQLite data is valid - raising an error if not. Finally, if the content length header lies to us - if it says 2MB but we download 3MB - we get an error raised as soon as we notice that problem.

I could write this function myself, but it would take me the better part of fifteen minutes to look up all of the details and get the code working right. Claude knocked it out in 15 seconds.

I find LLMs respond extremely well to function signatures like the one I use here. I get to act as the function designer, the LLM does the work of building the body to my specification.

I'll often follow-up with "Now write me the tests using pytest". Again, I dictate my technology of choice - I want the LLM to save me the time of having to type out the code that's sitting in my head already.

If your reaction to this is "surely typing out the code is faster than typing out an English instruction of it", all I can tell you is that it really isn't for me any more. Code needs to be correct. English has enormous room for shortcuts, and vagaries, and typos, and saying things like "use that popular HTTP library" if you can't remember the name off the top of your head.

The good coding LLMs are excellent at filling in the gaps. They're also much less lazy than me - they'll remember to catch likely exceptions, add accurate docstrings, and annotate code with the relevant types.

You have to test what it writes!

I wrote about this at length last week: the one thing you absolutely cannot outsource to the machine is testing that the code actually works.

Your responsibility as a software developer is to deliver working systems. If you haven't seen it run, it's not a working system. You need to invest in strengthening those manual QA habits.

This may not be glamorous but it's always been a critical part of shipping good code, with or without the involvement of LLMs.

Remember it's a conversation

If I don't like what an LLM has written, they'll never complain at being told to refactor it! "Break that repetitive code out into a function", "use string manipulation methods rather than a regular expression", or even "write that better!" - the code an LLM produces first time is rarely the final implementation, but they can re-type it dozens of times for you without ever getting frustrated or bored.

Occasionally I'll get a great result from my first prompt - more frequently the more I practice - but I expect to need at least a few follow-ups.

I often wonder if this is one of the key tricks that people are missing - a bad initial result isn't a failure, it's a starting point for pushing the model in the direction of the thing you actually want.

Use tools that can run the code for you

An increasing number of LLM coding tools now have the ability to run that code for you. I'm slightly cautious about some of these since there's a possibility of the wrong command causing real damage, so I tend to stick to the ones that run code in a safe sandbox. My favorites right now are:

ChatGPT Code Interpreter, where ChatGPT can write and then execute Python code directly in a Kubernetes sandbox VM managed by OpenAI. This is completely safe - it can't even make outbound network connections so really all that can happen is the temporary filesystem gets mangled and then reset.
Claude Artifacts, where Claude can build you a full HTML+JavaScript+CSS web application that is displayed within the Claude interface. This web app is displayed in a very locked down iframe sandbox, greatly restricting what it can do but preventing problems like accidental exfiltration of your private Claude data.
ChatGPT Canvas is a newer ChatGPT feature with similar capabilites to Claude Artifacts. I have not explored this enough myself yet.

And if you're willing to live a little more dangerously:

Cursor has an "Agent" feature that can do this, as does Windsurf and a growing number of other editors. I haven't spent enough time with these to make recommendations yet.
Aider is the leading open source implementation of these kinds of patterns, and is a great example of dogfooding - recent releases of Aider have been 80%+ written by Aider itself.
Claude Code is Anthropic's new entrant into this space. I'll provide a detailed description of using that tool shortly.

This run-the-code-in-a-loop pattern is so powerful that I chose my core LLM tools for coding based primarily on whether they can safely run and iterate on my code.

Vibe-coding is a great way to learn

Andrej Karpathy coined the term vibe-coding just over a month ago, and it has stuck:

There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. [...] I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it.

Andrej suggests this is "not too bad for throwaway weekend projects". It's also a fantastic way to explore the capabilities of these models - and really fun.

The best way to learn LLMs is to play with them. Throwing absurd ideas at them and vibe-coding until they almost sort-of work is a genuinely useful way to accelerate the rate at which you build intuition for what works and what doesn't.

I've been vibe-coding since before Andrej gave it a name! My simonw/tools GitHub repository has 77 HTML+JavaScript apps and 6 Python apps, and every single one of them was built by prompting LLMs. I have learned so much from building this collection, and I add to it at a rate of several new prototypes per week.

You can try most of mine out directly on tools.simonwillison.net - a GitHub Pages published version of the repo. I wrote more detailed notes on some of these back in October in Everything I built with Claude Artifacts this week.

If you want to see the transcript of the chat used for each one it's almost always linked to in the commit history for that page - or visit the new colophon page for an index that includes all of those links.

A detailed example using Claude Code

While I was writing this article I had the idea for that tools.simonwillison.net/colophon page - I wanted something I could link to that showed the commit history of each of my tools in a more obvious way than GitHub.

I decided to use that as an opportunity to demonstrate my AI-assisted coding process.

For this one I used Claude Code, because I wanted it to be able to run Python code directly against my existing tools repository on my laptop.

Running the /cost command at the end of my session showed me this:

> /cost 
  ⎿  Total cost: $0.61
     Total duration (API): 5m 31.2s
     Total duration (wall): 17m 18.7s

The initial project took me just over 17 minutes from start to finish, and cost me 61 cents in API calls to Anthropic.

I used the authoritarian process where I told the model exactly what I wanted to build. Here's my sequence of prompts (full transcript here).

I started by asking for an initial script to gather the data needed for the new page:

Almost all of the HTML files in this directory were created using Claude prompts, and the details of those prompts are linked in the commit messages. Build a Python script that checks the commit history for each HTML file in turn and extracts any URLs from those commit messages into a list. It should then output a JSON file with this structure: {"pages": {"name-of-file.html": ["url"], {"name-of-file-2.html": ["url1", "url2"], ... - as you can see, some files may have more than one URL in their commit history. The script should be called gather_links.py and it should save a JSON file called gathered_links.json

I really didn't think very hard about this first prompt - it was more of a stream of consciousness that I typed into the bot as I thought about the initial problem.

I inspected the initial result and spotted some problems:

It looks like it just got the start of the URLs, it should be getting the whole URLs which might be to different websites - so just get anything that starts https:// and ends with whitespace or the end of the commit message

Then I changed my mind - I wanted those full commit messages too:

Update the script - I want to capture the full commit messages AND the URLs - the new format should be {"pages": {"aria-live-regions.html": {"commits": [{"hash": hash, "message": message, "date": iso formatted date], "urls": [list of URLs like before]

Providing examples like this is a great shortcut to getting exactly what you want.

Note that at no point have I looked at the code it's written in gather_links.py! This is pure vibe-coding: I'm looking at what it's doing, but I've left the implementation details entirely up to the LLM.

The JSON looked good to me, so I said:

This is working great. Write me a new script called build_colophon.py which looks through that gathered JSON file and builds and saves an HTML page. The page should be mobile friendly and should list every page - with a link to that page - and for each one display the commit messages neatly (convert newlines to br and linkify URLs but no other formatting) - plus the commit message dates and links to the commits themselves which are in https://github.com/simonw/tools

Claude knows how GitHub URLs works, so telling it to link to the commits and providing the repo name was enough for it guess https://github.com/simonw/tools/commit/fd9daf885c924ba277806b3440457d52b0ad90a8 for those commit URLs.

I tend to find Claude has good default taste when it comes to web page design - I said "the page should be mobile friendly" and left it at that.

Claude churned away and built me a page that wasn't right, so I said:

it's not working right. ocr.html had a bunch of commits but in colophon.html there is only one link and heading for the first commit and the rest are shown within that same block - there should be separate HTML chunks with links and formatted dates for each of the other commits. Also the neatly formatted date should include the HH:MM as well as the date

It fixed the bug all on its own, leaving just two changes I decided to make:

it's almost perfect, but each page should have the commits displayed in the opposite order - oldest first

And then:

One last change - the pages are currently listed alphabetically, lets instead list them with the most recently modified at the top

And that was the entire project! Here's build_colophon.py, and the page it generated came out looking pretty good:

There was one remaining task: I needed to deploy the new colophon as part of my site, but I didn't want to check that new colophon.html page into the repository itself. I wanted a custom GitHub Pages build process.

I started a brand new Claude Code session (clearing the slate) to see if that was something Claude could figure out too:

This repo is deployed using GitHub Pages. I want to modify the build slightly - before deploying to GitHub Pages it should run python gather_links.py which will generate a file called gathered_links.json - then it should run python build_colophon.py which will use that file to create colophon.html - I want both gathered_links.json and colophon.html to be published to GitHub pages but I do not want them checked into the repo itself

Unlike the first time, this time I watched what it was doing very closely - I don't know how to customize GitHub Pages builds in this way and I wanted to both learn how to do it and stay cautious that it might hallucinate details and fail at the task.

It told me it wanted to create this file and asked for my permission:

name: Deploy to GitHub Pages

on:
  # Run on push to main branch
  push:
    branches: [ main ]
  # Allow manual trigger
  workflow_dispatch:

# Sets permissions of the GITHUB_TOKEN
permissions:
  contents: read
  pages: write
  id-token: write

# Allow only one concurrent deployment
concurrency:
  group: pages
  cancel-in-progress: true

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Need full history for git log in gather_links.py

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip

      - name: Generate links and build colophon
        run: |
          python gather_links.py
          python build_colophon.py

      - name: Setup Pages
        uses: actions/configure-pages@v4

      - name: Build with Jekyll
        uses: actions/jekyll-build-pages@v1

      - name: Upload artifact
        uses: actions/upload-pages-artifact@v3

  deploy:
    environment:
      name: github-pages
      url: ${{ steps.deployment.outputs.page_url }}
    runs-on: ubuntu-latest
    needs: build
    steps:
      - name: Deploy to GitHub Pages
        id: deployment
        uses: actions/deploy-pages@v4

I guess that's right? I appreciated that comment about "Need full history for git log in gather_links.py" - that's something I could easily have missed.

Then it said it wanted to add those files to .gitignore - sounds like a good idea.

It spat out this summary of its actions so far:

I've created a GitHub Actions workflow for GitHub Pages that:

Creates a .github/workflows/pages.yml file that:

Runs on pushes to main branch

Checkouts the full git history (needed for gather_links.py)

Sets up Python 3.12

Runs both scripts in sequence to generate the files

Builds and deploys to GitHub Pages

Created .gitignore file to prevent the generated files from being committed

Now when you push to main, GitHub Actions will:

Run the gather_links.py script to create gathered_links.json

Run build_colophon.py to create colophon.html

Deploy both files (and all other files) to GitHub Pages

I checked to see how much it had cost:

> /cost 
  ⎿  Total cost: $0.1788
     Total duration (API): 44.6s
     Total duration (wall): 10m 18.6s

So 17 cents and 45 seconds using the Claude API. (I got distracted, hence the 10m of total time.) Here's the full transcript.

The code didn't look like it would irreversibly break anything, so I pushed it to GitHub to see what would happen.

... and it worked! My new colophon page was live.

There's a catch. I watched the GitHub Actions interface while it was running and something didn't look right:

I was expecting that "Test" job, but why were there two separate deploys?

I had a hunch that the previous, default Jekyll deploy was still running, while the new deploy ran at the same time - and it was pure luck of the timing that the new script finished later and over-wrote the result of the original.

It was time to ditch the LLMs and read some documentation!

I found this page on Using custom workflows with GitHub Pages but it didn't tell me what I needed to know.

On another hunch I checked the GitHub Pages settings interface for my repo and found this option:

My repo was set to "Deploy from a branch", so I switched that over to "GitHub Actions".

I manually updated my README.md to add a link to the new Colophon page in this commit, which triggered another build.

This time only two jobs ran, and the end result was the correctly deployed site:

(I later spotted another bug - some of the links inadvertently included <br> tags in their href=, which I fixed with another 11 cent Claude Code session.)

Update: I improved the colophon further by adding AI-generated descriptions of the tools.

Be ready for the human to take over

I got lucky with this example because it helped illustrate my final point: expect to need to take over.

LLMs are no replacement for human intuition and experience. I've spent enough time with GitHub Actions that I know what kind of things to look for, and in this case it was faster for me to step in and finish the project rather than keep on trying to get there with prompts.

The biggest advantage is speed of development

My new colophon page took me just under half an hour from conception to finished, deployed feature.

I'm certain it would have taken me significantly longer without LLM assistance - to the point that I probably wouldn't have bothered to build it at all.

This is why I care so much about the productivity boost I get from LLMs so much: it's not about getting work done faster, it's about being able to ship projects that I wouldn't have been able to justify spending time on at all.

I wrote about this in March 2023: AI-enhanced development makes me more ambitious with my projects. Two years later that effect shows no sign of wearing off.

It's also a great way to accelerate learning new things - today that was how to customize my GitHub Pages builds using Actions, which is something I'll certainly use again in the future.

The fact that LLMs let me execute my ideas faster means I can implement more of them, which means I can learn even more.

LLMs amplify existing expertise

Could anyone else have done this project in the same way? Probably not! My prompting here leaned on 25+ years of professional coding experience, including my previous explorations of GitHub Actions, GitHub Pages, GitHub itself and the LLM tools I put into play.

I also knew that this was going to work. I've spent enough time working with these tools that I was confident that assembling a new HTML page with information pulled from my Git history was entirely within the capabilities of a good LLM.

My prompts reflected that - there was nothing particularly novel here, so I dictated the design, tested the results as it was working and occasionally nudged it to fix a bug.

If I was trying to build a Linux kernel driver - a field I know virtually nothing about - my process would be entirely different.

Bonus: answering questions about codebases

If the idea of using LLMs to write code for you still feels deeply unappealing, there's another use-case for them which you may find more compelling.

Good LLMs are great at answering questions about code.

This is also very low stakes: the worst that can happen is they might get something wrong, which may take you a tiny bit longer to figure out. It's still likely to save you time compared to digging through thousands of lines of code entirely by yourself.

The trick here is to dump the code into a long context model and start asking questions. My current favorite for this is the catchily titled gemini-2.0-pro-exp-02-05, a preview of Google's Gemini 2.0 Pro which is currently free to use via their API.

I used this trick just the other day. I was trying out a new-to-me tool called monolith, a CLI tool written in Rust which downloads a web page and all of its dependent assets (CSS, images etc) and bundles them together into a single archived file.

I was curious as to how it worked, so I cloned it into my temporary directory and ran these commands:

cd /tmp
git clone https://github.com/Y2Z/monolith
cd monolith

files-to-prompt . -c | llm -m gemini-2.0-pro-exp-02-05 \
  -s 'architectural overview as markdown'

I'm using my own files-to-prompt tool (built for me by Claude 3 Opus last year) here to gather the contents of all of the files in the repo into a single stream. Then I pipe that into my LLM tool and tell it (via the llm-gemini plugin) to prompt Gemini 2.0 Pro with a system prompt of "architectural overview as markdown".

This gave me back a detailed document describing how the tool works - which source files do what and, crucially, which Rust crates it was using. I learned that it used reqwest, html5ever, markup5ever_rcdom and cssparser and that it doesn't evaluate JavaScript at all, an important limitation.

I use this trick several times a week. It's a great way to start diving into a new codebase - and often the alternative isn't spending more time on this, it's failing to satisfy my curiosity at all.

I included three more examples in this recent post.

Tags: tools, ai, github-actions, openai, generative-ai, llms, ai-assisted-programming, anthropic, claude, gemini, claude-artifacts, vibe-coding, files-to-prompt, coding-agents, claude-code, prompt-to-app

monolith

2025-03-06T15:37:48+00:00

monolith

Neat CLI tool built in Rust that can create a single packaged HTML file of a web page plus all of its dependencies.

cargo install monolith # or brew install
monolith https://simonwillison.net/ > simonwillison.html

That command produced this 1.5MB single file result. All of the linked images, CSS and JavaScript assets have had their contents inlined into base64 URIs in their src= and href= attributes.

I was intrigued as to how it works, so I dumped the whole repository into Gemini 2.0 Pro and asked for an architectural summary:

cd /tmp
git clone https://github.com/Y2Z/monolith
cd monolith
files-to-prompt . -c | llm -m gemini-2.0-pro-exp-02-05 \
  -s 'architectural overview as markdown'

Here's what I got. Short version: it uses the reqwest, html5ever, markup5ever_rcdom and cssparser crates to fetch and parse HTML and CSS and extract, combine and rewrite the assets. It doesn't currently attempt to run any JavaScript.

Via Comment on Hacker News

Tags: cli, scraping, ai, rust, generative-ai, llms, ai-assisted-programming, files-to-prompt

files-to-prompt 0.6

2025-02-19T06:12:12+00:00

files-to-prompt 0.6

New release of my CLI tool for turning a whole directory of code into a single prompt ready to pipe or paste into an LLM.

Here are the full release notes:

New -m/--markdown option for outputting results as Markdown with each file in a fenced code block. #42
Support for reading a list of files from standard input. Thanks, Ankit Shankar. #44
Here's how to process just files modified within the last day:
find . -mtime -1 | files-to-prompt
You can also use the -0/--null flag to accept lists of file paths separated by null delimiters, which is useful for handling file names with spaces in them:
find . -name "*.txt" -print0 | files-to-prompt -0

I also have a potential fix for a reported bug concerning nested .gitignore files that's currently sitting in a PR. I'm waiting for someone else to confirm that it behaves as they would expect. I've left details in this issue comment, but the short version is that you can try out the version from the PR using this uvx incantation:

uvx --with git+https://github.com/simonw/files-to-prompt@nested-gitignore files-to-prompt

Tags: cli, projects, annotated-release-notes, llms, uv, files-to-prompt

files-to-prompt 0.5

2025-02-14T04:14:21+00:00

files-to-prompt 0.5

My files-to-prompt tool (originally built using Claude 3 Opus back in April) had been accumulating a bunch of issues and PRs - I finally got around to spending some time with it and pushed a fresh release:

New -n/--line-numbers flag for including line numbers in the output. Thanks, Dan Clayton. #38

Fix for utf-8 handling on Windows. Thanks, David Jarman. #36

--ignore patterns are now matched against directory names as well as file names, unless you pass the new --ignore-files-only flag. Thanks, Nick Powell. #30

I use this tool myself on an almost daily basis - it's fantastic for quickly answering questions about code. Recently I've been plugging it into Gemini 2.0 with its 2 million token context length, running recipes like this one:

git clone https://github.com/bytecodealliance/componentize-py
cd componentize-py
files-to-prompt . -c | llm -m gemini-2.0-pro-exp-02-05 \
  -s 'How does this work? Does it include a python compiler or AST trick of some sort?'

I ran that question against the bytecodealliance/componentize-py repo - which provides a tool for turning Python code into compiled WASM - and got this really useful answer.

Here's another example. I decided to have o3-mini review how Datasette handles concurrent SQLite connections from async Python code - so I ran this:

git clone https://github.com/simonw/datasette
cd datasette/datasette
files-to-prompt database.py utils/__init__.py -c | \
  llm -m o3-mini -o reasoning_effort high \
  -s 'Output in markdown a detailed analysis of how this code handles the challenge of running SQLite queries from a Python asyncio application. Explain how it works in the first section, then explore the pros and cons of this design. In a final section propose alternative mechanisms that might work better.'

Here's the result. It did an extremely good job of explaining how my code works - despite being fed just the Python and none of the other documentation. Then it made some solid recommendations for potential alternatives.

I added a couple of follow-up questions (using llm -c) which resulted in a full working prototype of an alternative threadpool mechanism, plus some benchmarks.

One final example: I decided to see if there were any undocumented features in Litestream, so I checked out the repo and ran a prompt against just the .go files in that project:

git clone https://github.com/benbjohnson/litestream
cd litestream
files-to-prompt . -e go -c | llm -m o3-mini \
  -s 'Write extensive user documentation for this project in markdown'

Once again, o3-mini provided a really impressively detailed set of unofficial documentation derived purely from reading the source.

Tags: async, projects, python, sqlite, ai, datasette, webassembly, litestream, generative-ai, llms, ai-assisted-programming, llm, gemini, llm-reasoning, files-to-prompt

Gemini 2.0 is now available to everyone

2025-02-05T16:37:29+00:00

Gemini 2.0 is now available to everyone

Big new Gemini 2.0 releases today:

Gemini 2.0 Pro (Experimental) is Google's "best model yet for coding performance and complex prompts" - currently available as a free preview.
Gemini 2.0 Flash is now generally available.
Gemini 2.0 Flash-Lite looks particularly interesting:

We’ve gotten a lot of positive feedback on the price and speed of 1.5 Flash. We wanted to keep improving quality, while still maintaining cost and speed. So today, we’re introducing 2.0 Flash-Lite, a new model that has better quality than 1.5 Flash, at the same speed and cost. It outperforms 1.5 Flash on the majority of benchmarks.

That means Gemini 2.0 Flash-Lite is priced at 7.5c/million input tokens and 30c/million output tokens - half the price of OpenAI's GPT-4o mini (15c/60c).

Gemini 2.0 Flash isn't much more expensive: 10c/million for text/image input, 70c/million for audio input, 40c/million for output. Again, cheaper than GPT-4o mini.

I pushed a new LLM plugin release, llm-gemini 0.10, adding support for the three new models:

llm install -U llm-gemini
llm keys set gemini
# paste API key here
llm -m gemini-2.0-flash "impress me"
llm -m gemini-2.0-flash-lite-preview-02-05 "impress me"
llm -m gemini-2.0-pro-exp-02-05 "impress me"

Here's the output for those three prompts.

I ran Generate an SVG of a pelican riding a bicycle through the three new models. Here are the results, cheapest to most expensive:

gemini-2.0-flash-lite-preview-02-05

gemini-2.0-flash

gemini-2.0-pro-exp-02-05

Full transcripts here.

I also ran the same prompt I tried with o3-mini the other day:

cd /tmp
git clone https://github.com/simonw/datasette
cd datasette
files-to-prompt datasette -e py -c | \
  llm -m gemini-2.0-pro-exp-02-05 \
  -s 'write extensive documentation for how the permissions system works, as markdown' \
  -o max_output_tokens 10000

Here's the result from that - you can compare that to o3-mini's result here.

Tags: google, ai, generative-ai, llms, llm, gemini, llm-pricing, pelican-riding-a-bicycle, llm-release, files-to-prompt

o3-mini is really good at writing internal documentation

2025-02-05T06:07:40+00:00

o3-mini is really good at writing internal documentation

I wanted to refresh my knowledge of how the Datasette permissions system works today. I already have extensive hand-written documentation for that, but I thought it would be interesting to see if I could derive any insights from running an LLM against the codebase.

o3-mini has an input limit of 200,000 tokens. I used LLM and my files-to-prompt tool to generate the documentation like this:

cd /tmp
git clone https://github.com/simonw/datasette
cd datasette
files-to-prompt datasette -e py -c | \
  llm -m o3-mini -s \
  'write extensive documentation for how the permissions system works, as markdown'

The files-to-prompt command is fed the datasette subdirectory, which contains just the source code for the application - omitting tests (in tests/) and documentation (in docs/).

The -e py option causes it to only include files with a .py extension - skipping all of the HTML and JavaScript files in that hierarchy.

The -c option causes it to output Claude's XML-ish format - a format that works great with other LLMs too.

You can see the output of that command in this Gist.

Then I pipe that result into LLM, requesting the o3-mini OpenAI model and passing the following system prompt:

write extensive documentation for how the permissions system works, as markdown

Specifically requesting Markdown is important.

The prompt used 99,348 input tokens and produced 3,118 output tokens (320 of those were invisible reasoning tokens). That's a cost of 12.3 cents.

Honestly, the results are fantastic. I had to double-check that I hadn't accidentally fed in the documentation by mistake.

(It's possible that the model is picking up additional information about Datasette in its training set, but I've seen similar high quality results from other, newer libraries so I don't think that's a significant factor.)

In this case I already had extensive written documentation of my own, but this was still a useful refresher to help confirm that the code matched my mental model of how everything works.

Documentation of project internals as a category is notorious for going out of date. Having tricks like this to derive usable how-it-works documentation from existing codebases in just a few seconds and at a cost of a few cents is wildly valuable.

Tags: documentation, ai, datasette, openai, generative-ai, llms, ai-assisted-programming, llm, llm-reasoning, o3, files-to-prompt

OpenAI reasoning models: Advice on prompting

2025-02-02T20:56:27+00:00

OpenAI reasoning models: Advice on prompting

OpenAI's documentation for their o1 and o3 "reasoning models" includes some interesting tips on how to best prompt them:

Developer messages are the new system messages: Starting with o1-2024-12-17, reasoning models support developer messages rather than system messages, to align with the chain of command behavior described in the model spec.

This appears to be a purely aesthetic change made for consistency with their instruction hierarchy concept. As far as I can tell the old system prompts continue to work exactly as before - you're encouraged to use the new developer message type but it has no impact on what actually happens.

Since my LLM tool already bakes in a llm --system "system prompt" option which works across multiple different models from different providers I'm not going to rush to adopt this new language!

Use delimiters for clarity: Use delimiters like markdown, XML tags, and section titles to clearly indicate distinct parts of the input, helping the model interpret different sections appropriately.

Anthropic have been encouraging XML-ish delimiters for a while (I say -ish because there's no requirement that the resulting prompt is valid XML). My files-to-prompt tool has a -c option which outputs Claude-style XML, and in my experiments this same option works great with o1 and o3 too:

git clone https://github.com/tursodatabase/limbo
cd limbo/bindings/python

files-to-prompt . -c | llm -m o3-mini \
  -o reasoning_effort high \
  --system 'Write a detailed README with extensive usage examples'

Limit additional context in retrieval-augmented generation (RAG): When providing additional context or documents, include only the most relevant information to prevent the model from overcomplicating its response.

This makes me thing that o1/o3 are not good models to implement RAG on at all - with RAG I like to be able to dump as much extra context into the prompt as possible and leave it to the models to figure out what's relevant.

Try zero shot first, then few shot if needed: Reasoning models often don't need few-shot examples to produce good results, so try to write prompts without examples first. If you have more complex requirements for your desired output, it may help to include a few examples of inputs and desired outputs in your prompt. Just ensure that the examples align very closely with your prompt instructions, as discrepancies between the two may produce poor results.

Providing examples remains the single most powerful prompting tip I know, so it's interesting to see advice here to only switch to examples if zero-shot doesn't work out.

Be very specific about your end goal: In your instructions, try to give very specific parameters for a successful response, and encourage the model to keep reasoning and iterating until it matches your success criteria.

This makes sense: reasoning models "think" until they reach a conclusion, so making the goal as unambiguous as possible leads to better results.

Markdown formatting: Starting with o1-2024-12-17, reasoning models in the API will avoid generating responses with markdown formatting. To signal to the model when you do want markdown formatting in the response, include the string Formatting re-enabled on the first line of your developer message.

This one was a real shock to me! I noticed that o3-mini was outputting • characters instead of Markdown * bullets and initially thought that was a bug.

I first saw this while running this prompt against limbo/bindings/python using files-to-prompt:

git clone https://github.com/tursodatabase/limbo
cd limbo/bindings/python

files-to-prompt . -c | llm -m o3-mini \
  -o reasoning_effort high \
  --system 'Write a detailed README with extensive usage examples'

Here's the full result, which includes text like this (note the weird bullets):

Features
--------
• High‑performance, in‑process database engine written in Rust  
• SQLite‑compatible SQL interface  
• Standard Python DB‑API 2.0–style connection and cursor objects

I ran it again with this modified prompt:

Formatting re-enabled. Write a detailed README with extensive usage examples.

And this time got back proper Markdown, rendered in this Gist. That did a really good job, and included bulleted lists using this valid Markdown syntax instead:

- **`make test`**: Run tests using pytest.
- **`make lint`**: Run linters (via [ruff](https://github.com/astral-sh/ruff)).
- **`make check-requirements`**: Validate that the `requirements.txt` files are in sync with `pyproject.toml`.
- **`make compile-requirements`**: Compile the `requirements.txt` files using pip-tools.

(Using LLMs like this to get me off the ground with under-documented libraries is a trick I use several times a month.)

Update: OpenAI's Nikunj Handa:

we agree this is weird! fwiw, it’s a temporary thing we had to do for the existing o-series models. we’ll fix this in future releases so that you can go back to naturally prompting for markdown or no-markdown.

Via @harjotsgill

Tags: documentation, markdown, ai, openai, prompt-engineering, generative-ai, llms, ai-assisted-programming, llm, rag, o1, llm-reasoning, o3, limbo, files-to-prompt, system-prompts

Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens

2025-01-26T18:54:45+00:00

Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens

Very significant new release from Alibaba's Qwen team. Their openly licensed (sometimes Apache 2, sometimes Qwen license, I've had trouble keeping up) Qwen 2.5 LLM previously had an input token limit of 128,000 tokens. This new model increases that to 1 million, using a new technique called Dual Chunk Attention, first described in this paper from February 2024.

They've released two models on Hugging Face: Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, both requiring CUDA and both under an Apache 2.0 license.

You'll need a lot of VRAM to run them at their full capacity:

VRAM Requirement for processing 1 million-token sequences:

Qwen2.5-7B-Instruct-1M: At least 120GB VRAM (total across GPUs).

Qwen2.5-14B-Instruct-1M: At least 320GB VRAM (total across GPUs).

If your GPUs do not have sufficient VRAM, you can still use Qwen2.5-1M models for shorter tasks.

Qwen recommend using their custom fork of vLLM to serve the models:

You can also use the previous framework that supports Qwen2.5 for inference, but accuracy degradation may occur for sequences exceeding 262,144 tokens.

GGUF quantized versions of the models are already starting to show up. LM Studio's "official model curator" Bartowski published lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF and lmstudio-community/Qwen2.5-14B-Instruct-1M-GGUF - sizes range from 4.09GB to 8.1GB for the 7B model and 7.92GB to 15.7GB for the 14B.

These might not work well yet with the full context lengths as the underlying llama.cpp library may need some changes.

I tried running the 8.1GB 7B model using Ollama on my Mac like this:

ollama run hf.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF:Q8_0

Then with LLM:

llm install llm-ollama
llm models -q qwen # To search for the model ID
# I set a shorter q1m alias:
llm aliases set q1m hf.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF:Q8_0

I tried piping a large prompt in using files-to-prompt like this:

files-to-prompt ~/Dropbox/Development/llm -e py -c | llm -m q1m 'describe this codebase in detail'

That should give me every Python file in my llm project. Piping that through ttok first told me this was 63,014 OpenAI tokens, I expect that count is similar for Qwen.

The result was disappointing: it appeared to describe just the last Python file that stream. Then I noticed the token usage report:

2,048 input, 999 output

This suggests to me that something's not working right here - maybe the Ollama hosting framework is truncating the input, or maybe there's a problem with the GGUF I'm using?

I'll update this post when I figure out how to run longer prompts through the new Qwen model using GGUF weights on a Mac.

Update: It turns out Ollama has a num_ctx option which defaults to 2048, affecting the input context length. I tried this:

files-to-prompt \
  ~/Dropbox/Development/llm \
  -e py -c | \
llm -m q1m 'describe this codebase in detail' \
 -o num_ctx 80000

But I quickly ran out of RAM (I have 64GB but a lot of that was in use already) and hit Ctrl+C to avoid crashing my computer. I need to experiment a bit to figure out how much RAM is used for what context size.

Awni Hannun shared tips for running mlx-community/Qwen2.5-7B-Instruct-1M-4bit using MLX, which should work for up to 250,000 tokens. They ran 120,000 tokens and reported:

Peak RAM for prompt filling was 22GB

Peak RAM for generation 12GB

Prompt filling took 350 seconds on an M2 Ultra

Generation ran at 31 tokens-per-second on M2 Ultra

Via VB

Tags: ai, generative-ai, local-llms, llms, llm, qwen, mlx, ollama, long-context, llama-cpp, llm-release, files-to-prompt, lm-studio, ai-in-china

Introducing Limbo: A complete rewrite of SQLite in Rust

2024-12-10T19:25:21+00:00

Introducing Limbo: A complete rewrite of SQLite in Rust

This looks absurdly ambitious:

Our goal is to build a reimplementation of SQLite from scratch, fully compatible at the language and file format level, with the same or higher reliability SQLite is known for, but with full memory safety and on a new, modern architecture.

The Turso team behind it have been maintaining their libSQL fork for two years now, so they're well equipped to take on a challenge of this magnitude.

SQLite is justifiably famous for its meticulous approach to testing. Limbo plans to take an entirely different approach based on "Deterministic Simulation Testing" - a modern technique pioneered by FoundationDB and now spearheaded by Antithesis, the company Turso have been working with on their previous testing projects.

Another bold claim (emphasis mine):

We have both added DST facilities to the core of the database, and partnered with Antithesis to achieve a level of reliability in the database that lives up to SQLite’s reputation.

[...] With DST, we believe we can achieve an even higher degree of robustness than SQLite, since it is easier to simulate unlikely scenarios in a simulator, test years of execution with different event orderings, and upon finding issues, reproduce them 100% reliably.

The two most interesting features that Limbo is planning to offer are first-party WASM support and fully asynchronous I/O:

SQLite itself has a synchronous interface, meaning driver authors who want asynchronous behavior need to have the extra complication of using helper threads. Because SQLite queries tend to be fast, since no network round trips are involved, a lot of those drivers just settle for a synchronous interface. [...]

Limbo is designed to be asynchronous from the ground up. It extends sqlite3_step, the main entry point API to SQLite, to be asynchronous, allowing it to return to the caller if data is not ready to consume immediately.

Datasette provides an async API for executing SQLite queries which is backed by all manner of complex thread management - I would be very interested in a native asyncio Python library for talking to SQLite database files.

I successfully tried out Limbo's Python bindings against a demo SQLite test database using uv like this:

uv run --with pylimbo python
>>> import limbo
>>> conn = limbo.connect("/tmp/demo.db")
>>> cursor = conn.cursor()
>>> print(cursor.execute("select * from foo").fetchall())

It crashed when I tried against a more complex SQLite database that included SQLite FTS tables.

The Python bindings aren't yet documented, so I piped them through LLM and had the new google-exp-1206 model write this initial documentation for me:

files-to-prompt limbo/bindings/python -c | llm -m gemini-exp-1206 -s 'write extensive usage documentation in markdown, including realistic usage examples'

Via Hacker News

Tags: documentation, open-source, python, sqlite, rust, ai-assisted-programming, llm, uv, limbo, files-to-prompt

Generating documentation from tests using files-to-prompt and LLM

2024-11-05T22:37:20+00:00

Generating documentation from tests using files-to-prompt and LLM

I was experimenting with the wasmtime-py Python library today (for executing WebAssembly programs from inside CPython) and I found the existing API docs didn't quite show me what I wanted to know.

The project has a comprehensive test suite so I tried seeing if I could generate documentation using that:

cd /tmp
git clone https://github.com/bytecodealliance/wasmtime-py
files-to-prompt -e py wasmtime-py/tests -c | \
  llm -m claude-3.5-sonnet -s \
  'write detailed usage documentation including realistic examples'

More notes in my TIL. You can see the full Claude transcript here - I think this worked really well!

Tags: documentation, ai, webassembly, generative-ai, llms, ai-assisted-programming, llm, claude, claude-3-5-sonnet, files-to-prompt

files-to-prompt 0.4

2024-10-16T23:29:08+00:00

files-to-prompt 0.4

New release of my files-to-prompt tool adding an option for filtering just for files with a specific extension.

The following command will output Claude XML-style markup for all Python and Markdown files in the current directory, and copy that to the macOS clipboard ready to be pasted into an LLM:

files-to-prompt . -e py -e md -c | pbcopy

Tags: projects, python, llms, files-to-prompt

An LLM TDD loop

2024-10-13T19:37:47+00:00

An LLM TDD loop

Super neat demo by David Winterbottom, who wrapped my LLM and files-to-prompt tools in a short Bash script that can be fed a file full of Python unit tests and an empty implementation file and will then iterate on that file in a loop until the tests pass.

Via @codeinthehole

Tags: python, testing, ai, pytest, generative-ai, llms, ai-assisted-programming, llm, files-to-prompt

lm.rs: run inference on Language Models locally on the CPU with Rust

2024-10-11T19:33:34+00:00

lm.rs: run inference on Language Models locally on the CPU with Rust

Impressive new LLM inference implementation in Rust by Samuel Vitorino. I tried it just now on an M2 Mac with 64GB of RAM and got very snappy performance for this Q8 Llama 3.2 1B, with Activity Monitor reporting 980% CPU usage over 13 threads.

Here's how I compiled the library and ran the model:

cd /tmp
git clone https://github.com/samuel-vitorino/lm.rs
cd lm.rs
RUSTFLAGS="-C target-cpu=native" cargo build --release --bin chat
curl -LO 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/tokenizer.bin?download=true'
curl -LO 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/llama3.2-1b-it-q80.lmrs?download=true'
./target/release/chat --model llama3.2-1b-it-q80.lmrs --show-metrics

That --show-metrics option added this at the end of a response:

Speed: 26.41 tok/s

It looks like the performance is helped by two key dependencies: wide, which provides data types optimized for SIMD operations and rayon for running parallel iterators across multiple cores (used for matrix multiplication).

(I used LLM and files-to-prompt to help figure this out.)

Via Hacker News

Tags: ai, rust, generative-ai, llms, ai-assisted-programming, llm, files-to-prompt

Llama 3.2

2024-09-25T20:28:04+00:00

Llama 3.2

In further evidence that AI labs are terrible at naming things, Llama 3.2 is a huge upgrade to the Llama 3 series - they've released their first multi-modal vision models!

Today, we’re releasing Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B), and lightweight, text-only models (1B and 3B) that fit onto edge and mobile devices, including pre-trained and instruction-tuned versions.

The 1B and 3B text-only models are exciting too, with a 128,000 token context length and optimized for edge devices (Qualcomm and MediaTek hardware get called out specifically).

Meta partnered directly with Ollama to help with distribution, here's the Ollama blog post. They only support the two smaller text-only models at the moment - this command will get the 3B model (2GB):

ollama run llama3.2

And for the 1B model (a 1.3GB download):

ollama run llama3.2:1b

I had to first upgrade my Ollama by clicking on the icon in my macOS task tray and selecting "Restart to update".

The two vision models are coming to Ollama "very soon".

Once you have fetched the Ollama model you can access it from my LLM command-line tool like this:

pipx install llm
llm install llm-ollama
llm chat -m llama3.2:1b

I tried running my djp codebase through that tiny 1B model just now and got a surprisingly good result - by no means comprehensive, but way better than I would ever expect from a model of that size:

files-to-prompt **/*.py -c | llm -m llama3.2:1b --system 'describe this code'

Here's a portion of the output:

The first section defines several test functions using the @djp.hookimpl decorator from the djp library. These hook implementations allow you to intercept and manipulate Django's behavior.

test_middleware_order: This function checks that the middleware order is correct by comparing the MIDDLEWARE setting with a predefined list.

test_middleware: This function tests various aspects of middleware:

It retrieves the response from the URL /from-plugin/ using the Client object, which simulates a request to this view.

It checks that certain values are present in the response:

X-DJP-Middleware-After

X-DJP-Middleware

X-DJP-Middleware-Before [...]

I found the GGUF file that had been downloaded by Ollama in my ~/.ollama/models/blobs directory. The following command let me run that model directly in LLM using the llm-gguf plugin:

llm install llm-gguf
llm gguf register-model ~/.ollama/models/blobs/sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 -a llama321b
llm chat -m llama321b

Meta themselves claim impressive performance against other existing models:

Our evaluation suggests that the Llama 3.2 vision models are competitive with leading foundation models, Claude 3 Haiku and GPT4o-mini on image recognition and a range of visual understanding tasks. The 3B model outperforms the Gemma 2 2.6B and Phi 3.5-mini models on tasks such as following instructions, summarization, prompt rewriting, and tool-use, while the 1B is competitive with Gemma.

Here's the Llama 3.2 collection on Hugging Face. You need to accept the new Llama 3.2 Community License Agreement there in order to download those models.

You can try the four new models out via the Chatbot Arena - navigate to "Direct Chat" there and select them from the dropdown menu. You can upload images directly to the chat there to try out the vision features.

Tags: ai, generative-ai, llama, local-llms, llms, vision-llms, meta, ollama, llm-release, files-to-prompt, chatbot-arena

Solving a bug with o1-preview, files-to-prompt and LLM

2024-09-25T18:41:13+00:00

Solving a bug with o1-preview, files-to-prompt and LLM

I added a new feature to DJP this morning: you can now have plugins specify their middleware in terms of how it should be positioned relative to other middleware - inserted directly before or directly after django.middleware.common.CommonMiddleware for example.

At one point I got stuck with a weird test failure, and after ten minutes of head scratching I decided to pipe the entire thing into OpenAI's o1-preview to see if it could spot the problem. I used files-to-prompt to gather the code and LLM to run the prompt:

files-to-prompt **/*.py -c | llm -m o1-preview "
The middleware test is failing showing all of these - why is MiddlewareAfter repeated so many times?

['MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware5', 'MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware2', 'MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware5', 'MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware4', 'MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware5', 'MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware2', 'MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware5', 'MiddlewareAfter', 'Middleware3', 'MiddlewareAfter', 'Middleware', 'MiddlewareBefore']"

The model whirled away for a few seconds and spat out an explanation of the problem - one of my middleware classes was accidentally calling self.get_response(request) in two different places.

I did enjoy how o1 attempted to reference the relevant Django documentation and then half-repeated, half-hallucinated a quote from it:

This took 2,538 input tokens and 4,354 output tokens - by my calculations at $15/million input and $60/million output that prompt cost just under 30 cents.

Tags: ai, openai, generative-ai, llms, ai-assisted-programming, llm, o1, djp, llm-reasoning, files-to-prompt

Jiter

2024-09-22T20:03:07+00:00

Jiter

One of the challenges in dealing with LLM streaming APIs is the need to parse partial JSON - until the stream has ended you won't have a complete valid JSON object, but you may want to display components of that JSON as they become available.

I've solved this previously using the ijson streaming JSON library, see my previous TIL.

Today I found out about Jiter, a new option from the team behind Pydantic. It's written in Rust and extracted from pydantic-core, so the Python wrapper for it can be installed using:

pip install jiter

You can feed it an incomplete JSON bytes object and use partial_mode="on" to parse the valid subset:

import jiter
partial_json = b'{"name": "John", "age": 30, "city": "New Yor'
jiter.from_json(partial_json, partial_mode="on")
# {'name': 'John', 'age': 30}

Or use partial_mode="trailing-strings" to include incomplete string fields too:

jiter.from_json(partial_json, partial_mode="trailing-strings")
# {'name': 'John', 'age': 30, 'city': 'New Yor'}

The current README was a little thin, so I submiitted a PR with some extra examples. I got some help from files-to-prompt and Claude 3.5 Sonnet):

cd crates/jiter-python/ && files-to-prompt -c README.md tests | llm -m claude-3.5-sonnet --system 'write a new README with comprehensive documentation'

Via jackmpcollins on Hacker News

Tags: json, python, rust, ai-assisted-programming, pydantic, files-to-prompt

files-to-prompt 0.3

2024-09-09T05:57:35+00:00

files-to-prompt 0.3

New version of my files-to-prompt CLI tool for turning a bunch of files into a prompt suitable for piping to an LLM, described here previously.

It now has a -c/--cxml flag for outputting the files in Claude XML-ish notation (XML-ish because it's not actually valid XML) using the format Anthropic describe as recommended for long context:

files-to-prompt llm-*/README.md --cxml | llm -m claude-3.5-sonnet \
  --system 'return an HTML page about these plugins with usage examples' \
  > /tmp/fancy.html

Here's what that gave me.

The format itself looks something like this:

<documents>
<document index="1">
<source>llm-anyscale-endpoints/README.md</source>
<document_content>
# llm-anyscale-endpoints
...
</document_content>
</document>
</documents>

Tags: cli, projects, tools, ai, prompt-engineering, generative-ai, llms, anthropic, claude, files-to-prompt

json-flatten, now with format documentation

2024-09-07T05:43:01+00:00

json-flatten, now with format documentation

json-flatten is a fun little Python library I put together a few years ago for converting JSON data into a flat key-value format, suitable for inclusion in an HTML form or query string. It lets you take a structure like this one:

{"foo": {"bar": [1, True, None]}

And convert it into key-value pairs like this:

foo.bar.[0]$int=1
foo.bar.[1]$bool=True
foo.bar.[2]$none=None

The flatten(dictionary) function function converts to that format, and unflatten(dictionary) converts back again.

I was considering the library for a project today and realized that the 0.3 README was a little thin - it showed how to use the library but didn't provide full details of the format it used.

On a hunch, I decided to see if files-to-prompt plus LLM plus Claude 3.5 Sonnet could write that documentation for me. I ran this command:

files-to-prompt *.py | llm -m claude-3.5-sonnet --system 'write detailed documentation in markdown describing the format used to represent JSON and nested JSON as key/value pairs, include a table as well'

That *.py picked up both json_flatten.py and test_json_flatten.py - I figured the test file had enough examples in that it should act as a good source of information for the documentation.

This worked really well! You can see the first draft it produced here.

It included before and after examples in the documentation. I didn't fully trust these to be accurate, so I gave it this follow-up prompt:

llm -c "Rewrite that document to use the Python cog library to generate the examples"

I'm a big fan of Cog for maintaining examples in READMEs that are generated by code. Cog has been around for a couple of decades now so it was a safe bet that Claude would know about it.

This almost worked - it produced valid Cog syntax like the following:

[[[cog
example = {
"fruits": ["apple", "banana", "cherry"]
}

cog.out("```json\n")
cog.out(str(example))
cog.out("\n```\n")
cog.out("Flattened:\n```\n")
for key, value in flatten(example).items():
    cog.out(f"{key}: {value}\n")
cog.out("```\n")
]]]
[[[end]]]

But that wasn't entirely right, because it forgot to include the Markdown comments that would hide the Cog syntax, which should have looked like this:

<!-- [[[cog -->
...
<!-- ]]] -->
...
<!-- [[[end]]] -->

I could have prompted it to correct itself, but at this point I decided to take over and edit the rest of the documentation by hand.

The end result was documentation that I'm really happy with, and that I probably wouldn't have bothered to write if Claude hadn't got me started.

Tags: json, projects, ai, generative-ai, llms, ai-assisted-programming, llm, anthropic, claude, claude-3-5-sonnet, files-to-prompt

Language models on the command-line

2024-06-17T16:44:39+00:00

I gave a talk about accessing Large Language Models from the command-line last week as part of the Mastering LLMs: A Conference For Developers & Data Scientists six week long online conference. The talk focused on my LLM Python command-line utility and ways you can use it (and its plugins) to explore LLMs and use them for useful tasks.

The talk was recorded and is available on YouTube. Here I've turned it into an annotated presentation, with detailed notes and screenshots (there were no slides) to accompany the video.

LLM is a tool I started building last year to help run LLM prompts directly from a command-line terminal. Instructions for installing it are here - you can use pipx install llm or pip install llm or brew install llm.

Once installed you can use it with OpenAI models by running llm keys set openai and pasting in your OpenAI key - or install plugins to use models by other providers, including models you can run locally.

Basic usage is very simple: once you've configured your OpenAI key, you can run prompts against their models like this:

llm 'five great names for a pet pelican'

The output will stream to your terminal, or you can redirect it to a file like this:

llm 'five great names for a pet pelican' > pelicans.txt

You can use the -c (for continue) option to send follow-up prompts as part of the same ongoing conversation:

llm -c 'now do walruses'
# ...
llm -c justify those!'

LLM supports additional models via plugins. These are listed in the LLM plugins directory, with dozens of plugins for both remote API-hosted models as well as models you can run directly on your own computer.

Here I'm using the llm-claude-3 plugin, which provides access to the Anthropic Claude 3 family of models.

I really like these models. Claude 3 Opus is about equivalent to GPT-4o in terms of quality. Claude 3 Haiku is both cheaper and better than GPT-3.5, and can handle 100,000 input tokens including images.

llm install llm-claude-3
llm keys set claude
# <Paste key here>
# Now list available models
llm models
# Then run a prompt
llm -m claude-3-haiku-20240307 'say hi in spanish with a flourish'
# Or use the haiku alias
llm -m haiku 'say hi in spanish with a flourish'

Every prompt and response run through the LLM tool is permanently logged to a SQLite database, as described here.

This command shows the path to that database:

llm logs path

If you install Datasette you can use it to browse your SQLite database like this, using a terminal trick where the output of one command is passed to another (with double quotes to avoid any problems caused by the space in the directory name):

datasette "$(llm logs path)"

Here's my searchable database of 2,434 responses I've logged from using LLM on my laptop, running in Datasette.

Earlier we ran a prompt and then sent two follow-up prompts to it using the llm -c option. Those are stored in the database as three responses that are part of the same conversation.

Datasette supports facets, which provide a quick overview of unique value counts within the data. I've used GPT-3.5 turbo 383 times, Claude 3 Opus 334 times and a whole bunch of other models.

Datasette provides search and filtering too - here are the prompts I've run against Claude 3 Opus that match "pelican".

I have an experimental feature that lets me pass images to some models using the -i filename.png option. More on that in this issue.

llm-cmd is an example of a plugin that adds an extra sub-command to LLM - in this case one that takes a description of a task and turns that into a pre-populated shell command. I wrote more about that in llm cmd undo last git commit—a new plugin for LLM.

There are plenty of ways to run local models using LLM. One of my favourite is with the llm-gpt4all plugin, which builds on top of the excellent GPT4All Python library by Nomic AI.

To install that plugin:

llm install llm-gpt4all

Then llm models to list the new models. Each model will be downloaded the first time you try running a prompt through it.

I used this to run Mistral-7B Instruct - an extremely high quality small (~4GB) model:

llm -m mistral-7b-instruct-v0 'five great names for a pet seagull, with explanations'

You can run Activity Monitory to see the resources the model is using.

Running prompts like this is inefficient, because it loads the full model into memory, runs the prompt and then shuts down the program again.

Instead, you can use the llm chat command which keeps the model in memory across multiple prompts:

llm chat -m mistral-7b-instruct-v0

Another option is to run Ollama, which runs its own local server hosting models. The llm-ollama plugin can then be used to run prompts through Ollama from LLM.

Another really exciting option for running models is llamafile, a project sponsored by Mozilla that uses Justine Tunney's Cosmopolitan to compile a binary that bundles both a model and the software needed to serve it, in a single file that can execute on several different operating systems.

I've written more about that here:

Running LLaVA using a Llamafile is particularly fun - it's an openly licensed model that can accept images as input as well. It's pretty amazing the results you can get from that, running as a single binary on your laptop.

Grab that from Mozilla/llava-v1.5-7b-llamafile on Hugging Face.

LLM can talk to Llamafile instances via the llm-llamafile plugin.

Now that we can run prompts from our terminal, we can start assembling software by writing scripts.

Here's a Bash script I wrote to summarize conversations on Hacker News, using longer context models such as Claude 3 Haiku or Google Gemini 1.5 or GPT-4o.

I wrote more about this in Summarizing Hacker News discussion themes with Claude and LLM.

The script works by hitting the Hacker News Algolia API to return the full, nested JSON structure of the conversation (e.g. this JSON endpoint for this conversation), then runs that through jq to turn it into text, then pipes that into a model using LLM.

If you want to contribute to LLM itself, a great way to do that is to write plugins that support new models. I have an extensive tutorial describing how to do that.

Another fun trick is to use templates to save and execute parameterized prompts, documented here. The easiest way to create a template is with the --save option like this:

llm --system 'You are a sentient cheesecake' -m gpt-4o --save cheesecake

Now you can chat with a cheesecake:

llm chat -t cheesecake

Being able to pipe content into llm is a really important feature.

The simplest way to do this is with cat to send in files. This command summarizes the content of a provided file:

cat demos.md | llm -s 'summary, short'

Once you've done this you can send follow-up prompts with -c.

llm -c 'no much much much shorter, and in haikus'

My files-to-prompt command can turn a directory of files into a single prompt, suitable for piping into LLM.

I wrote more about this, including how I developed it, in Building files-to-prompt entirely using Claude 3 Opus.

Another tool I frequently use with LLM is shot-scraper - my command-line tool for screenshotting and scraping websites.

shot-scraper can run JavaScript directly against a page and output the result back to the terminal, suitable for piping into LLM.

Google hate being scraped. Let's use it to scrape Google.

Google search results have a structure where each link on the page is an <h3> element wrapped in a link.

We can scrape that using the following terminal command:

shot-scraper javascript 'https://www.google.com/search?q=nytimes+slop' '
Array.from(
  document.querySelectorAll("h3"),
  el => ({href: el.parentNode.href, title: el.innerText})
)'

This will load up the search results page in an invisible browser, then execute JavaScript that extracts the results and returns them as JSON.

The results as JSON include the href and title of each of those search results.

We could send that to LLM, but I'd like to grab the search snippets as well.

Here's the more complex recipe that also fetches the search snippets visible on the page:

shot-scraper javascript 'https://www.google.com/search?q=nytimes+slop' '
() => {
    function findParentWithHveid(element) {
        while (element && !element.hasAttribute("data-hveid")) {
            element = element.parentElement;
        }
        return element;
    }
    return Array.from(
        document.querySelectorAll("h3"),
        el => findParentWithHveid(el).innerText
    );
}' | llm -s 'describe slop'

At the end it pipes them into LLM with instructions to use that context to "describe slop".

... and it works! We get back an answer from the LLM that summarizes the search results that we just scraped.

We have implemented basic RAG - Retrieval Augmented Generation, where search results are used to answer a question - using a terminal script that scrapes search results from Google and pipes them into an LLM.

Speaking of RAG... a common technique for implementing that pattern is to take advantage of embeddings and vector search to find content that is semantically similar to the user's question, without necessarily matching on exact keywords.

I wrote an extensive introduction to embeddings in Embeddings: What they are and why they matter.

LLM includes support for calculating, storing and searching embeddings through its llm embed-models, llm embed and llm embed-multi commands, documented here.

The llm embed-models command lists currently available embedding models - the OpenAI models plus any that have been added by plugins.

The llm embed command can embed content. This will output a list of floating point numbers for the specified content, using the OpenAI 3-small embedding model.

llm embed -m 3-small -c "hello there"

Add -f hex to get that out as hexadecimal. Neither of these formats are particularly useful on their own!

The llm embed-multi command is much more useful. It can run embeddings against content in bulk - from a CSV or JSON file, from a directory full of content or even from a SQLite database. Those embedding vectors will be stored in SQLite ready to be used for search or similarity queries.

I collect bookmarks (I call them "blogmarks") on my blog - links with short descriptions. I have over 7,178 of them.

Let's create embeddings for all of those using LLM.

I used the SQLite database version of my blog available from https://datasette.simonwillison.net/simonwillisonblog.db (a 90MB file).

The content lives in the blog_blogmark SQLite table, which I can explore using Datasette (in this case the Datasette Desktop macOS Electron app).

First step is to compose a SQL query returning the data we want to embed. The first column will be treated as a unique identifier to store along with the embedding vector, and any subsequent columns will be used as input to the embedding model.

select id, link_url, link_title, commentary from blog_blogmark

I can run this command to embed all of the content:

llm embed-multi links \
  -d simonwillisonblog.db \
  --sql 'select id, link_url, link_title, commentary from blog_blogmark' \
  -m 3-small --store

This will create an embedding collection called "links". It will run the SQL query we created before, using the OpenAI 3-small model. The --store link means it will store a copy of the text in the database as well - without that it would just store identifiers and we would need to use those to look up the text later on when running queries.

The llm embed-multi command shows a progress bar for how far it has got.

While the command is running we can use Datasette to watch as the embeddings table is filled with data - one row for each of the items we are embedding, each storing a big ugly binary blob of data representing the embedding vector (in this storage format).

And now we can run searches! This command finds everything in the "links" collection that is most similar to the search term "things that make me angry":

llm similar links -c 'things that make me angry' -d simonwillisonblog.db

We can implement another version of RAG on top of this as well, by finding similar documents to our search term and then piping those results back into LLM to execute a prompt.

I wrote more about this in Embedding paragraphs from my blog with E5-large-v2.

During the Q&A I mentioned that LLM can talk to anything that provides an OpenAI-compatible API endpoint using just configuration, no extra code. That's described in the documentation here.

I also showed LLM's Python API documentation, but warned that this is not yet considered stable as I'm not 100% happy with how this API works yet.

The final question from the audience was about my personal productivity.

I wrote a bit about this a few years ago in Coping strategies for the serial project hoarder - how I use extensive documentation and unit tests to allow me to work on way more projects at once by ensuring I don't have to remember the details of any of them.

My other trick is that I tend to pick projects that fit my unique combination of previous experiences. I built LLM because I already had experience with LLM APIs, Python CLI tools (using Click) and plugin systems (using Pluggy). As a result I happened to be one of the best positioned people in the world to build a plugin-based CLI tool for working with LLMs!

Colophon

Here's how I turned the YouTube video of this talk into an annotated presentation:

I downloaded a .mp4 version of the talk from YouTube using yt-dlp.
I ran that through MacWhisper to create my own transcript for copying extracts from into my write-up - although this time I didn't end up using any of the transcript text.
I played the video (at 2x speed) in QuickTime Player and used the capture.sh script described here to grab screenshots of the individual interesting frames that I wanted to use for my post.
I loaded those screenshots into my annotated presentation tool (which I described in this post) and used that to run OCR against them for alt text and to add commentary to accompany each screenshot.
I assembled the result into this finished blog entry, adding intro text and the YouTube embed as HTML.

Tags: cli, projects, my-talks, ai, datasette, openai, generative-ai, local-llms, llms, llm, anthropic, annotated-talks, llamafile, ollama, files-to-prompt, macwhisper

Building files-to-prompt entirely using Claude 3 Opus

2024-04-08T20:40:08+00:00

files-to-prompt is a new tool I built to help me pipe several files at once into prompts to LLMs such as Claude and GPT-4.

When combined with my LLM command-line tool it lets you do things like this:

files-to-prompt README.md files_to_prompt | llm -m opus \
  --system 'Update this README to reflect this functionality'

I wrote files-to-prompt almost entirely using Claude 3 Opus, llm-claude-3 and files-to-prompt itself, once it was functional enough to be useful.

Building the initial tool

I started with my click-app cookiecutter template. This can quickly spin up a skeleton of a new Python command-line tool using the Click library:

cookiecutter gh:simonw/click-app
  [1/6] app_name (): files-to-prompt
  [2/6] description (): Concatenate a directory full of files into a single prompt for use with LLMs
  [3/6] hyphenated (files-to-prompt): 
  [4/6] underscored (files_to_prompt): 
  [5/6] github_username (): simonw
  [6/6] author_name (): Simon Willison

I opened the files_to_prompt/cli.py file it created for me and used it to spec out how the initial version should work:

import click


@click.command()
@click.argument(
    "path",
    type=click.Path
)
@click.option(
    "--include-hidden",
    is_flag=True,
    help="Include files and folders starting with .",
)
@click.version_option()
def cli(path, include_hidden):
    """
    Takes a path to a folder and outputs every file in that folder,
    recursively, each one preceeded with its filename like this:

    path/to/file.py
    ----
    Contents of file.py goes here

    ---
    path/to/file2.py
    ---
    ...
    """

This code defines a tool that takes a positional path argument and an optional --include-hidden flag, along with a Python docstring describing what it should do which will be displayed as the --help output.

I'd originally intended to actually build the tool... but once I got to this point I realized that this might be enough information for Claude 3 Opus to do the rest of the work for me.

So I ran this command:

cat files_to_prompt/cli.py | llm -m opus \
  --system 'finish this code for me'

You can see the full transcript here, but from this single prompt Claude 3 Opus produced the exact implementation I needed:

    for root, dirs, files in os.walk(path):
        if not include_hidden:
            dirs[:] = [d for d in dirs if not d.startswith(".")]
            files = [f for f in files if not f.startswith(".")]

        for file in files:
            file_path = os.path.join(root, file)
            with open(file_path, "r") as f:
                file_contents = f.read()

            click.echo(file_path)
            click.echo("---")
            click.echo(file_contents)
            click.echo()
            click.echo("---")

I pasted that into by cli.py module and tried running it... and it worked first time.

I tried a more challenging prompt next:

cat files_to_prompt/cli.py | llm -m opus '
modify this code to add a --ignore-gitignore flag which defaults toFalse.
Without this flag the code should read any .gitignore files and filter
content based on the rules within them. With the flag it should not do
that.'

I forgot to use the --system option here, so the argument prompt I passed to llm was concatenated directly onto the end of the code rather than being treated as a system/instruction prompt. This turned out not to matter.

.gitignore is a tricky format to handle. Claude 3 Opus didn't do a comprehensive implementation, but it did support the subset that I cared about for this step of the development process.

Here's what it gave me (full transcript here):

import os
import click
from pathlib import Path
from fnmatch import fnmatch


def should_ignore(path, gitignore_rules):
    for rule in gitignore_rules:
        if fnmatch(path, rule):
            return True
    return False


def read_gitignore(path):
    gitignore_path = os.path.join(path, ".gitignore")
    if os.path.isfile(gitignore_path):
        with open(gitignore_path, "r") as f:
            return [line.strip() for line in f if line.strip() and not line.startswith("#")]
    return []


@click.command()
@click.argument(
    "path",
    type=click.Path(exists=True, file_okay=False, dir_okay=True, readable=True),
)
@click.option(
    "--include-hidden",
    is_flag=True,
    help="Include files and folders starting with .",
)
@click.option(
    "--ignore-gitignore",
    is_flag=True,
    help="Ignore .gitignore files and include all files",
)
@click.version_option()
def cli(path, include_hidden, ignore_gitignore):
    """
    Takes a path to a folder and outputs every file in that folder,
    recursively, each one preceded with its filename like this:

    path/to/file.py
    ----
    Contents of file.py goes here

    ---
    path/to/file2.py
    ---
    ...
    """
    gitignore_rules = [] if ignore_gitignore else read_gitignore(path)

    for root, dirs, files in os.walk(path):
        if not include_hidden:
            dirs[:] = [d for d in dirs if not d.startswith(".")]
            files = [f for f in files if not f.startswith(".")]

        if not ignore_gitignore:
            gitignore_rules.extend(read_gitignore(root))
            dirs[:] = [d for d in dirs if not should_ignore(os.path.join(root, d), gitignore_rules)]
            files = [f for f in files if not should_ignore(os.path.join(root, f), gitignore_rules)]

        for file in files:
            file_path = os.path.join(root, file)
            with open(file_path, "r") as f:
                file_contents = f.read()

            click.echo(file_path)
            click.echo("---")
            click.echo(file_contents)
            click.echo()
            click.echo("---")

From that point on I was hooked: could I use the new files-to-prompt tool to build the rest of itself?

Since I'd already bootstrapped it to a point where I could use it, I ran it for real for the first time:

files-to-prompt . | llm -m opus --system '
fill out the test_files_to_prompt.py with tests, use tmpdir fixture
to create example directories to run the command against - tests
should cover both of the optional flags'

Here I'm sending it quite a bit of code - the implementation, the README and the Apache 2 License are all included in that prompt.

It wrote me these tests:

import os
from click.testing import CliRunner
from files_to_prompt.cli import cli

def test_basic_functionality(tmpdir):
    runner = CliRunner()
    with tmpdir.as_cwd():
        os.makedirs("test_dir")
        with open("test_dir/file1.txt", "w") as f:
            f.write("Contents of file1")
        with open("test_dir/file2.txt", "w") as f:
            f.write("Contents of file2")
        
        result = runner.invoke(cli, ["test_dir"])
        assert result.exit_code == 0
        assert "test_dir/file1.txt" in result.output
        assert "Contents of file1" in result.output
        assert "test_dir/file2.txt" in result.output
        assert "Contents of file2" in result.output

def test_include_hidden(tmpdir):
    runner = CliRunner()
    with tmpdir.as_cwd():
        os.makedirs("test_dir")
        with open("test_dir/.hidden.txt", "w") as f:
            f.write("Contents of hidden file")
        
        result = runner.invoke(cli, ["test_dir"])
        assert result.exit_code == 0
        assert "test_dir/.hidden.txt" not in result.output
        
        result = runner.invoke(cli, ["test_dir", "--include-hidden"])
        assert result.exit_code == 0
        assert "test_dir/.hidden.txt" in result.output
        assert "Contents of hidden file" in result.output

def test_ignore_gitignore(tmpdir):
    runner = CliRunner()
    with tmpdir.as_cwd():
        os.makedirs("test_dir")
        with open("test_dir/.gitignore", "w") as f:
            f.write("ignored.txt")
        with open("test_dir/ignored.txt", "w") as f:
            f.write("This file should be ignored")
        with open("test_dir/included.txt", "w") as f:
            f.write("This file should be included")
        
        result = runner.invoke(cli, ["test_dir"])
        assert result.exit_code == 0
        assert "test_dir/ignored.txt" not in result.output
        assert "test_dir/included.txt" in result.output
        
        result = runner.invoke(cli, ["test_dir", "--ignore-gitignore"])
        assert result.exit_code == 0
        assert "test_dir/ignored.txt" in result.output
        assert "This file should be ignored" in result.output
        assert "test_dir/included.txt" in result.output

Then I ran them... and they failed. Here's the error message:

    def test_ignore_gitignore(tmpdir):
        runner = CliRunner()
        with tmpdir.as_cwd():
            os.makedirs("test_dir")
            with open("test_dir/.gitignore", "w") as f:
                f.write("ignored.txt")
            with open("test_dir/ignored.txt", "w") as f:
                f.write("This file should be ignored")
            with open("test_dir/included.txt", "w") as f:
                f.write("This file should be included")
    
            result = runner.invoke(cli, ["test_dir"])
            assert result.exit_code == 0
>           assert "test_dir/ignored.txt" not in result.output
E           AssertionError: assert 'test_dir/ignored.txt' not in 'test_dir/in...red\n\n---\n'
E             
E             'test_dir/ignored.txt' is contained here:
E               uded
E               
E               ---
E               test_dir/ignored.txt
E               ---...
E             
E             ...Full output truncated (3 lines hidden), use '-vv' to show

I saved that message to `error.txt and ran this command to send it back to Claude 3 Opus:

cat error.txt | llm -c

The llm -c command stands for "continue" - it will take any input or prompt and pass that back to the model along with the previous logged prompts and responses, equivalent to asking a follow-up question in an interface like ChatGPT.

Claude gave me back a corrected version of the should_ignore() function:

def should_ignore(path, gitignore_rules):
    for rule in gitignore_rules:
        if fnmatch(os.path.basename(path), rule):
            return True
        if os.path.isdir(path) and fnmatch(os.path.basename(path) + "/", rule):
            return True
    return False

Sending follow-on prompts with llm -c seemed to be working well, so I ran this:

llm -c 'Fill out the Usage section of the README for this project - just return that section as markdown'

And it gave me back a pretty decent README:

Usage

To use files-to-prompt, simply provide the path to the directory you want to process:
files-to-prompt path/to/directory
This will output the contents of every file in the directory and its subdirectories, with each file preceded by its relative path and separated by ---.

Options
--include-hidden: Include files and folders starting with . (hidden files and directories).
files-to-prompt path/to/directory --include-hidden
--ignore-gitignore: Ignore .gitignore files and include all files.
files-to-prompt path/to/directory --ignore-gitignore
Example

Suppose you have a directory structure like this:
my_directory/
├── file1.txt
├── file2.txt
├── .hidden_file.txt
└── subdirectory/
    └── file3.txt
Running files-to-prompt my_directory will output:
my_directory/file1.txt
---
Contents of file1.txt
---
my_directory/file2.txt
---
Contents of file2.txt
---
my_directory/subdirectory/file3.txt
---
Contents of file3.txt
---
If you run files-to-prompt my_directory --include-hidden, the output will also include .hidden_file.txt:
my_directory/.hidden_file.txt
---
Contents of .hidden_file.txt
---
...

I particularly liked the directory structure diagram.

Here's the full transcript including my follow-ups.

I committed and pushed everything I had so far to GitHub.

After one last review of the README I noticed it had used the phrase "simply provide the path to the directory". I don't like using words like simply in documentation, so I fixed that.

And I shipped version 0.1 of the software! Almost every line of code, tests and documentation written by Claude 3 Opus.

Iterating on the project

I've added several features since that initial implementation, almost all of which were primarily written by prompting Claude 3 Opus.

Issue #2: Take multiple arguments for files and directories to include changed the tool such that files-to-prompt README.md tests/ would include both the README.md file and all files in the tests/ directory.

The sequence of prompts to get there was as follows:

cat files_to_prompt/cli.py | llm -m opus --system '
Modify this file. It should take multiple arguments in a variable called paths.
Each of those argumets might be a path to a file or it might be a path to a
directory - if any of the arguments do not correspoind to a file or directory
it should raise a click error.

It should then do what it does already but for all files 
files-recursively-contained-within-folders that are passed to it.

It should still obey the gitignore logic.'

Then these to update the tests:

files-to-prompt files_to_prompt tests | llm -m opus --system '
rewrite the tests to cover the ability to pass multiple files and
folders to the tool'

files-to-prompt files_to_prompt tests | llm -m opus --system '
add one last test which tests .gitignore and include_hidden against
an example that mixes single files and directories of files together
in one invocation'

I didn't like the filenames it was using in that last test, so I used symbex to extract just the implementation of that test and told it to rewrite it:

symbex test_mixed_paths_with_options | llm -m opus --system '
rewrite this test so the filenames are more obvious, thinks like
ignored_in_gitignore.txt'

And this to add one last test that combined all of the options:

llm -c 'add a last bit to that test for
["test_dir", "single_file.txt", "--ignore-gitignore", "--include-hidden"]'

The issue includes links to the full transcripts for the above.

Updating a diff from a pull request

I quietly released files-to-prompt two weeks ago. Dipam Vasani had spotted it and opened a pull request adding the ability to ignore specific files, by passing --ignore-patterns '*.md' as an option.

The problem was... I'd landed some of my own changes before I got around to reviewing his PR - so it would no longer cleanly apply.

It turns out I could resolve that problem using Claude 3 Opus as well, by asking it to figure out the change from Dipam's diff.

I pulled a copy of his PR as a diff like this:

wget 'https://github.com/simonw/files-to-prompt/pull/4.diff'

Then I fed both the diff and the relevant files from the project into Claude:

files-to-prompt 4.diff files_to_prompt/cli.py tests/test_files_to_prompt.py | \
  llm -m opus --system \
  'Apply the change described in the diff to the project - return updated cli.py and tests'

It didn't quite work - it reverted one of my earlier changes. So I prompted:

llm -c 'you undid the change where it could handle multiple paths -
I want to keep that, I only want to add the new --ignore-patterns option'

And that time it worked! Transcript here.

I merged Claude's work into the existing PR to ensure Dipam got credit for his work, then landed it and pushed it out in a release.

Was this worthwhile?

As an exercise in testing the limits of what's possible with command-line LLM access and the current most powerful available LLM, this was absolutely worthwhile. I got working software with comprehensive tests and documentation, and had a lot of fun experimenting with prompts along the way.

It's worth noting that this project was incredibly low stakes. files-to-prompt is a tiny tool that does something very simple. Any bugs or design flaws really don't matter. It's perfect for trying out this alternative approach to development.

I also got the software built a whole lot faster than if I'd written it myself, and with features like .gitignore support (albeit rudimentary) that I may not have bothered with working alone. That's a good example of a feature that's just fiddly enough that I might decide not to invest the time needed to get it to work.

Is this the best possible version of this software? Definitely not. But with comprehensive documentation and automated tests it's high enough quality that I'm not ashamed to release it with my name on it.

A year ago I might have felt guilty about using LLMs to write code for me in this way. I'm over that now: I'm still doing the work, but I now have a powerful tool that can help accelerate the process.

Using this pattern for real work

I've since used the same pattern for some smaller modifications to some of my more significant projects. This morning I used it to upgrade my datasette-cors plugin to add support for new features I had added to the underlying asgi-cors library. Here's the prompt sequence I used:

files-to-prompt ../asgi-cors/asgi_cors.py datasette_cors.py | llm -m opus -s \
'Output a new datasette_cors.py plugin that adds headers and methods and max_age config options'

files-to-prompt test_datasette_cors.py | llm -c \
  'Update these tests to exercise the new options as well'

cat README.md | llm -c \
  'Update the README to document the new config options'

And the full transcript.

I reviewed this code very carefully before landing it. It's absolutely what I would have written myself without assistance from Claude.

Time elapsed for this change? The first prompt was logged at 16:42:11 and the last at 16:44:24, so just over two minutes followed by a couple more minutes for the review. The associated issue was open for five minutes total.

Tags: cli, projects, ai, prompt-engineering, generative-ai, llms, ai-assisted-programming, llm, anthropic, claude, files-to-prompt