Simon Willison's Weblog: symbex

llm-fragment-symbex

2025-04-23T14:25:38+00:00

I released a new LLM fragment loader plugin that builds on top of my Symbex project.

Symbex is a CLI tool I wrote that can run against a folder full of Python code and output functions, classes, methods or just their docstrings and signatures, using the Python AST module to parse the code.

llm-fragments-symbex brings that ability directly to LLM. It lets you do things like this:

llm install llm-fragments-symbex
llm -f symbex:path/to/project -s 'Describe this codebase'

I just ran that against my LLM project itself like this:

cd llm
llm -f symbex:. -s 'guess what this code does'

Here's the full output, which starts like this:

This code listing appears to be an index or dump of Python functions, classes, and methods primarily belonging to a codebase related to large language models (LLMs). It covers a broad functionality set related to managing LLMs, embeddings, templates, plugins, logging, and command-line interface (CLI) utilities for interaction with language models. [...]

That page also shows the input generated by the fragment - here's a representative extract:

# from llm.cli import resolve_attachment
def resolve_attachment(value):
    """Resolve an attachment from a string value which could be:
    - "-" for stdin
    - A URL
    - A file path

    Returns an Attachment object.
    Raises AttachmentError if the attachment cannot be resolved."""

# from llm.cli import AttachmentType
class AttachmentType:

    def convert(self, value, param, ctx):

# from llm.cli import resolve_attachment_with_type
def resolve_attachment_with_type(value: str, mimetype: str) -> Attachment:

If your Python code has good docstrings and type annotations, this should hopefully be a shortcut for providing full API documentation to a model without needing to dump in the entire codebase.

The above example used 13,471 input tokens and 781 output tokens, using openai/gpt-4.1-mini. That model is extremely cheap, so the total cost was 0.6638 cents - less than a cent.

The plugin itself was mostly written by o4-mini using the llm-fragments-github plugin to load the simonw/symbex and simonw/llm-hacker-news repositories as example code:

llm \
  -f github:simonw/symbex \
  -f github:simonw/llm-hacker-news \
  -s "Write a new plugin as a single llm_fragments_symbex.py file which
   provides a custom loader which can be used like this:
   llm -f symbex:path/to/folder - it then loads in all of the python
   function signatures with their docstrings from that folder using
   the same trick that symbex uses, effectively the same as running
   symbex . '*' '*.*' --docs --imports -n" \
   -m openai/o4-mini -o reasoning_effort high

Here's the response. 27,819 input, 2,918 output = 4.344 cents.

In working on this project I identified and fixed a minor cosmetic defect in Symbex itself. Technically this is a breaking change (it changes the output) so I shipped that as Symbex 2.0.

Tags: cli, projects, ai, generative-ai, llms, ai-assisted-programming, symbex, llm

Symbex 1.4

2023-09-05T17:29:25+00:00

Symbex 1.4

New release of my Symbex tool for finding symbols (functions, methods and classes) in a Python codebase. Symbex can now output matching symbols in JSON, CSV or TSV in addition to plain text.

I designed this feature for compatibility with the new “llm embed-multi” command—so you can now use Symbex to find every Python function in a nested directory and then pipe them to LLM to calculate embeddings for every one of them.

I tried it on my projects directory and embedded over 13,000 functions in just a few minutes! Next step is to figure out what kind of interesting things I can do with all of those embeddings.

Tags: projects, ai, generative-ai, embeddings, symbex, llm

Weeknotes: Self-hosted language models with LLM plugins, a new Datasette tutorial, a dozen package releases, a dozen TILs

2023-07-16T05:55:54+00:00

A lot of stuff to cover from the past two and a half weeks.

LLM and self-hosted language model plugins

My biggest project was the new version of my LLM tool for interacting with Large Language Models. LLM now accepts plugins for adding alternative language models to the tool, meaning it's now applicable to more than just the OpenAI collection.

I figured out quite a few of the details of this while offline on a camping trip up in the Northern California redwoods, which forced the issue on figuring out how to work with LLMs that I could host on my own computer because I didn't have a connection to access the OpenAI APIs.

Comprehensive documentation is sorely lacking in the world of generative AI. I've decided to push back against that for LLM, so I spent a bunch of time working on an extremely comprehensive tutorial for writing a plugin that adds a new language model to the tool:

Writing a plugin to support a new model

As part of researching this tutorial I finally figured out how to build a Python package using just a pyproject.toml file, with no setup.py or setup.cfg or anything else like that. I wrote that up in detail in Python packages with pyproject.toml and nothing else, and I've started using that pattern for all of my new Python packages.

LLM also now includes a Python API for interacting with models, which provides an abstraction that works the same for the OpenAI models and for other models (including self-hosted models) installed via plugins. Here's the documentation for that - it ends up looking like this:

import llm

model = llm.get_model("gpt-3.5-turbo")
model.key = 'YOUR_API_KEY_HERE'
response = model.prompt("Five surprising names for a pet pelican")
for chunk in response:
    print(chunk, end="")

To use another model, just swap its name in for gpt-3.5-turbo. The self-hosted models provided by the llm-gpt4all plugin work the same way:

pip install llm-gpt4all

Then:

import llm

model = llm.get_model("ggml-vicuna-7b-1")
response = model.prompt("Five surprising names for a pet pelican")
# You can do this instead of looping through the chunks:
print(response.text())

I've released three plugins so far:

llm-gpt4all with 17 self-hosted models from the GPT4All project.
llm-palm with Google's PaLM 2 language model, via their API.
llm-mpt30b providing the 19GB MPT-30B model, using TheBloke/mpt-30B-GGML.

I'm looking forward to someone else following the tutorial and releasing their own plugin!

A new tutorial: Data analysis with SQLite and Python

I presented this as a 2hr45m tutorial at PyCon a few months ago. The video is now available, and I like to try to turn these kinds of things into more permanent documentation.

The Datasette website has a growing collection of tutorials, and I decided to make that the final home for this one too.

Data analysis with SQLite and Python now has the full 2hr45m video plus an improved version of the handout I used for the talk. The written material there there should also be valuable for people who don't want to spend nearly three hours watching the video!

As part of putting that page together I solved a problem I've been wanting to figure out for a long time: I figured out a way to build a custom Jinja block tag that looks like this:

{% markdown %}
# This will be rendered as markdown

- Bulleted
- List
{% endmarkdown %}

I released that in datasette-render-markdown 2.2. I also wrote up a TIL on Custom Jinja template tags with attributes describing the pattern I used.

One bonus feature for that tutorial: I decided to drop in a nested table of contents, automatically derived from the HTML headers on the page.

I wrote the code for this entirely using the new ChatGPT Code Interpreter, which can write Python based on your description and, crucially, execute it and see if it works.

Here's my ChatGPT transcript showing how I built the feature.

I've been using ChatGPT Code Interpreter for a few months now, and I'm completely hooked: I think it's the most interesting thing in the whole AI space at the moment.

I participated in a Code Interpreter Latent Space episode to talk about it, which ended up drawing 17,000 listeners on Twitter Spaces and is now also available as a podcast episode, neatly edited together by swyx.

Symbex --check and --rexec

Symbex is my Python CLI tool for quickly finding Python functions and classes and outputting either the full code or just the signature of the matching symbol. I first wrote about that here.

symbex 1.1 adds two new features.

symbex --function --undocumented --check

This new --check mode is designed to run in Continuous Integration environments. If it finds any symbols matching the filters (in this case functions that are missing their docstring) it returns a non-zero exit code, which will fail the CI step.

It's an imitation of black . --check - the idea is that Symbex can now be used to enforce code quality issues like docstrings and the presence of type annotations.

The other new feature is --rexec. This is an extension of the existing --replace feature, which lets you find a symbol in your code and replace its body with new code.

--rexec takes a shell expression. The body of the matching symbol will be piped into that command, and its output will be used as the replacement.

Which means you can do things like this:

symbex my_function \
  --rexec "llm --system 'add type hints and a docstring'"

This will find def my_function() and its body, pass that through llm (using the gpt-3.5-turbo default model, but you can specify -m gpt-4 or any other model to use something else), and then take the output and update the file in-place with the new implementation.

As a demo, I ran it against this:

def my_function(a, b):
    return a + b + 3

And got back:

def my_function(a: int, b: int) -> int:
    """
    Returns the sum of two integers (a and b) plus 3.

    Parameters:
    a (int): The first integer.
    b (int): The second integer.

    Returns:
    int: The sum of a and b plus 3.
    """
    return a + b + 3

Obviously this is fraught with danger, and you should only run this against code that has already been committed to Git and hence can be easily recovered... but it's a really fun trick!

ttok --encode --decode

ttok is my CLI tool for counting tokens, as used by LLM models such as GPT-4. ttok 0.2 adds a requested feature to help make tokens easier to understand, best illustrated by this demo:

ttok Hello world
# Outputs 2 - the number of tokens
ttok Hello world --encode
# Outputs 9906 1917 - the encoded tokens
ttok 9906 1917 --decode
# Outputs Hello world - decoding the tokens back again
ttok Hello world --encode --tokens
# Outputs [b'Hello', b' world']

Being able to easily see the encoded tokens including whitespace (the b' world' part) is particularly useful for understanding how the tokens all fit together.

I wrote more about GPT tokenization in understanding GPT tokenizers.

TIL this week

Using tree-sitter with Python - 2023-07-14
Auto-formatting YAML files with yamlfmt - 2023-07-13
Quickly testing code in a different Python version using pyenv - 2023-07-10
Using git-filter-repo to set commit dates to author dates - 2023-07-10
Using OpenAI functions and their Python library for data extraction - 2023-07-10
Python packages with pyproject.toml and nothing else - 2023-07-08
Syntax highlighted code examples in Datasette - 2023-07-02
Custom Jinja template tags with attributes - 2023-07-02
Local wildcard DNS on macOS with dnsmasq - 2023-06-30
A Discord bot to expand issue links to a private GitHub repository - 2023-06-30
Bulk editing status in GitHub Projects - 2023-06-29
CLI tools hidden in the Python standard library - 2023-06-29

Releases this week

symbex 1.1 - 2023-07-16
Find the Python code for specified symbols
llm-mpt30b 0.1 - 2023-07-12
LLM plugin adding support for the MPT-30B language model
llm-markov 0.1 - 2023-07-12
Plugin for LLM adding a Markov chain generating model
llm-gpt4all 0.1 - 2023-07-12
Plugin for LLM adding support for the GPT4All collection of models
llm-palm 0.1 - 2023-07-12
Plugin for LLM adding support for Google's PaLM 2 model
llm 0.5 - 2023-07-12
Access large language models from the command-line
ttok 0.2 - 2023-07-10
Count and truncate text based on tokens
strip-tags 0.5.1 - 2023-07-09
CLI tool for stripping tags from HTML
pocket-to-sqlite 0.2.3 - 2023-07-09
Create a SQLite database containing data from your Pocket account
datasette-render-markdown 2.2 - 2023-07-02
Datasette plugin for rendering Markdown
asgi-proxy-lib 0.1a0 - 2023-07-01
An ASGI function for proxying to a backend over HTTP
datasette-upload-csvs 0.8.3 - 2023-06-28
Datasette plugin for uploading CSV files and converting them to database tables

Tags: plugins, projects, tutorials, ai, datasette, weeknotes, sqlite-utils, generative-ai, local-llms, llms, symbex, llm

Weeknotes: symbex, LLM prompt templates, a bit of a break

2023-06-27T16:30:57+00:00

I had a holiday to the UK for a family wedding anniversary and mostly took the time off... except for building symbex, which became one of those projects that kept on inspiring new features.

I've also been working on some major improvements to my LLM tool for working with language models from the command-line.

symbex

I introduced symbex in symbex: search Python code for functions and classes, then pipe them into a LLM. It's since grown a bunch more features across 12 total releases.

symbex is a tool for searching Python code. The initial goal was to make it quick to find and output the body of a specific Python function or class, such that you could then pipe it to LLM to process it with GPT-3.5 or GPT-4:

symbex find_symbol_nodes \
  | llm -m gpt4 --system 'Describe this code succinctly'

Output:

This code defines a function find_symbol_nodes that takes in three arguments: code (string), filename (string), and symbols (iterable of strings). The function parses the given code and searches for AST nodes (Class, Function, AsyncFunction) that match the provided symbols. It returns a list of tuple pairs containing matched nodes and their corresponding class names or None.

When piping to a language model token count is really important - the goal is to provide the shortest amount of text that gives the model enough to produce interesting results.

So... I added a -s/--signatures option which returns just the function or class signature:

symbex find_symbol_nodes -s

Output:

# File: symbex/lib.py Line: 13
def find_symbol_nodes(code: str, filename: str, symbols: Iterable[str]) -> List[Tuple[(AST, Optional[str])]]

Add --docstrings to include the docstring. Add -i/--imports for an import line, and -n/--no-file to suppress that # File comment - so -in combines both of hose options:

symbex find_symbol_nodes -s --docstrings -in

# from symbex.lib import find_symbol_nodes
def find_symbol_nodes(code: str, filename: str, symbols: Iterable[str]) -> List[Tuple[(AST, Optional[str])]]
    "Returns ast Nodes matching symbols"

Being able to see type annotations and docstrings tells you a lot about the code. This gave me an idea for an extra set of features: filters that could be used to only return symbols that were documented, or undocumented, or included or were missing type signatures:

--async: Filter async functions
--function: Filter functions
--class: Filter classes
--documented: Filter functions with docstrings
--undocumented: Filter functions without docstrings
--typed: Filter functions with type annotations
--untyped: Filter functions without type annotations
--partially-typed: Filter functions with partial type annotations
--fully-typed: Filter functions with full type annotations

So now you can use symbex to get a feel for how well typed or documented your code is:

# See all symbols lacking a docstring:
symbex -s --undocumented

# All functions that are missing type annotations:
symbex -s --function --untyped

The README has comprehensive documentation on everything else the tool can do.

LLM prompt templates

My LLM tool is shaping up in some interesting directions as well.

The big new released feature is prompt templates.

A template is a file that looks like this:

system: Summarize this text in the voice of $voice
model: gpt-4

This can be installed using llm templates edit summary, which opens a text editor (using the $EDITOR environment variable).

Once installed, you can use it like this:

curl -s 'https://til.simonwillison.net/macos/imovie-slides-and-audio' | \
  strip-tags -m | \
  llm -t summarize -p voice 'Extremely sarcastic GlaDOS'

Oh, bravo, Simon. You've really outdone yourself. Apparently, the highlight of his day was turning an old talk into a video using iMovie. After a truly heart-stopping struggle with the Ken Burns effect, he finally, and I mean finally, tuned the slide duration to match the audio. And then, hold your applause, he met the enormous challenge of publishing it on YouTube. We were all waiting with bated breath. Oh, but wouldn't it be exciting to note that his estimated 1.03GB video was actually a shockingly smaller size? I can't contain my excitement. He also used Pixelmator for a custom title slide, as YouTube prefers a size of 1280x720px - ground-breaking information, truly.

The idea here is to make it easy to create reusable template snippets, for all sorts of purposes. git diff | llm -t diff could generate a commit message, cat file.py | llm -t explain could explain code etc.

LLM plugins

These are still baking, but this is the feature I'm most excited about. I'm adding plugins to LLM, inspired by plugins in Datasette.

I'm planning the following categories of plugins to start with:

Command plugins. These will allow extra commands to be added to the llm tool - llm search or llm embed or similar.
Template plugins. Imagine being able to install extra prompt templates using llm install name-of-package.
Model plugins. I want LLM to be able to use more than just GPT-3.5 and GPT-4. I have a branch with an example plugin that can call Google's PaLM 2 model via Google Vertex, and I hope to support many other LLM families with additional plugins, including models that can run locally via llama.cpp and similar.
Function plugins. Once I get the new OpenAI functions mechanism working, I'd like to be able to install plugins that make new functions available to be executed by the LLM!

All of this is under active development at the moment. I'll write more about it once I get it working.

Entries these weeks

Releases these weeks

sqlite-utils 3.33 - 2023-06-26
Python CLI utility and library for manipulating SQLite databases
datasette-render-images 0.4 - 2023-06-14
Datasette plugin that renders binary blob images using data-uris

TIL these weeks

TOML in Python - 2023-06-26
Automatically maintaining Homebrew formulas using GitHub Actions - 2023-06-21
Using ChatGPT Browse to name a Python package - 2023-06-18
Syncing slide images and audio in iMovie - 2023-06-15
Using fs_usage to see what files a process is using - 2023-06-15
Running OpenAI's large context models using llm - 2023-06-13
Consecutive groups in SQL using window functions - 2023-06-08

Tags: projects, ai, weeknotes, generative-ai, llms, symbex, llm

Symbex: search Python code for functions and classes, then pipe them into a LLM

2023-06-18T22:11:12+00:00

I just released a new Python CLI tool called Symbex. It's a search tool, loosely inspired by ripgrep, which lets you search Python code for functions and classes by name or wildcard, then see just the source code of those matching entities.

Searching for functions and classes

Here's an example of what it can do. Running in my datasette/ folder:

symbex inspect_hash

Output:

# File: datasette/inspect.py Line: 17
def inspect_hash(path):
    """Calculate the hash of a database, efficiently."""
    m = hashlib.sha256()
    with path.open("rb") as fp:
        while True:
            data = fp.read(HASH_BLOCK_SIZE)
            if not data:
                break
            m.update(data)

    return m.hexdigest()

I gave it the name of a function (classes work too) and it searched all subfolders of the current directory, found that function and output it to my terminal.

Why is this more useful than ripgrep or any of the many other tools that can do this?

I partly built this to have fun learning Python's ast module, but it's mainly designed to complement my LLM CLI tool for running large language model prompts.

Code explanations with a large language model

Check this out:

symbex inspect_hash | llm --system 'explain succinctly'

Output:

This function calculates the hash of a database file efficiently by reading the file in blocks and updating the hash object using SHA256 algorithm from the hashlib module. The resulting hash value is returned as a hexadecimal string.

This is pretty cool!

llm --system 'explain succinctly' runs a prompt against gpt-3.5-turbo using "explain succinctly" as the system prompt.

The system prompt says what to do, then the content piped to the tool is treated as the data that should be processed.

Using wildcards and guessing what a tool does from tests

That was a pretty basic example. Here's something more fun:

symbex 'test*csv*' | llm --system 'based on these tests guess what this tool does'

I'm using the test*csv* wildcard here to find all of my test functions that mention CSV in the name. There are 18 of those, with names like test_table_csv_with_labels and test_table_csv_cors_headers.

I can count the number of tokens in the test output like this (see llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs for an intro to my ttok tool):

symbex 'test_*csv*' | ttok

That's significantly below the 4000 token limit for gpt-3.5-turbo.

Here's what I get when I ask it "based on these tests guess what this tool does":

This tool seems to be a web application that provides a user interface to query databases and export query results in various formats, including HTML, JSON, and CSV. The tests verify that the tool correctly generates HTML pages with links to export query results in JSON or CSV format, exports tables to CSV format with custom options (such as whether to include labels and nullable labels), and performs streaming of CSV data. The tests also check that the tool sets appropriate HTTP headers, allows CORS, and handles errors and large CSV files gracefully.

That's a good summary of what Datasette does!

I was worried that it might be cheating since ChatGPT already knows about Datasette, but I tried this:

symbex 'test_*csv*' | grep datasette

And got no results, so at least that keyword wasn't being leaked in the test details somehow.

Refactoring code

Let's try something a whole lot more useful:

symbex Request | llm --system 'add type hints to this'

This locates the Request class in Datasette - this one here, and starts adding Python type hints to it. The output started out like this (that code has no type hints at all at the moment):

from typing import Dict, Any, Awaitable
from http.cookies import SimpleCookie
from urllib.parse import urlunparse, parse_qs, parse_qsl

from .multidict import MultiParams


class Request:
    def __init__(self, scope: Dict[str, Any], receive: Awaitable) -> None:
        self.scope = scope
        self.receive = receive

    def __repr__(self) -> str:
        return '<asgi.Request method="{}" url="{}">'.format(self.method, self.url)

    @property
    def method(self) -> str:
        return self.scope["method"]

    @property
    def url(self) -> str:
        return urlunparse(
            (self.scheme, self.host, self.path, None, self.query_string, None)
        )

    @property
    def url_vars(self) -> Dict[str, str]:
        return (self.scope.get("url_route") or {}).get("kwargs") or {}
    
    # ...

Now this is getting impressive! Obviously I wouldn't just check code like this in without a comprehensive review and likely adjusting many of the decisions it's made, but this is a very good starting point - especially for the tiny amount of effort it takes to get started.

Picking a name for the tool

The most time-consuming part of this project ended up being picking the name!

Originally I planned to call it py-grep. I checked https://pypi.org/project/py-grep/ and it was available, so I spun up the first version of the tool and attempted to upload it to PyPI.

PyPI gave me an error, because the name was too similar to the existing pygrep package. On the one hand that's totally fair, but it was annoying that I couldn't check for availability without attempting an upload.

I turned to ChatGPT to start brainstorming new names. I didn't use regular ChatGPT though: I fired up ChatGPT Browse, which could both read my README and, with some prompting, could learn to check if names were taken itself!

I wrote up the full process for this in a TIL: Using ChatGPT Browse to name a Python package.

Tags: cli, projects, python, ai, generative-ai, chatgpt, llms, symbex