Simon Willison's Weblog: derek-willis

How OpenElections Uses LLMs

2025-06-19T18:26:38+00:00

The OpenElections project collects detailed election data for the USA, all the way down to the precinct level. This is a surprisingly hard problem: while county and state-level results are widely available, precinct-level results are published in thousands of different ad-hoc ways and rarely aggregated once the election result has been announced.

A lot of those precinct results are published as image-filled PDFs.

Derek Willis has recently started leaning on Gemini to help parse those PDFs into CSV data:

For parsing image PDFs into CSV files, Google’s Gemini is my model of choice, for two main reasons. First, the results are usually very, very accurate (with a few caveats I’ll detail below), and second, Gemini’s large context window means it’s possible to work with PDF files that can be multiple MBs in size.

Is this piece he shares the process and prompts for a real-world expert level data entry project, assisted by Gemini.

This example from Limestone County, Texas is a great illustration of how tricky this problem can get. Getting traditional OCR software to correctly interpret multi-column layouts like this always requires some level of manual intervention:

Derek's prompt against Gemini 2.5 Pro throws in an example, some special instructions and a note about the two column format:

Produce a CSV file from the attached PDF based on this example:

county,precinct,office,district,party,candidate,votes,absentee,early_voting,election_day
Limestone,Precinct 101,Registered Voters,,,,1858,,,
Limestone,Precinct 101,Ballots Cast,,,,1160,,,
Limestone,Precinct 101,President,,REP,Donald J. Trump,879,,,
Limestone,Precinct 101,President,,DEM,Kamala D. Harris,271,,,
Limestone,Precinct 101,President,,LIB,Chase Oliver,1,,,
Limestone,Precinct 101,President,,GRN,Jill Stein,4,,,
Limestone,Precinct 101,President,,,Write-ins,1,,,

Skip Write-ins with candidate names and rows with "Cast Votes", "Not Assigned", "Rejected write-in votes", "Unresolved write-in votes" or "Contest Totals". Do not extract any values that end in "%"

Use the following offices:

President/Vice President -> President
United States Senator -> U.S. Senate
US Representative -> U.S. House
State Senator -> State Senate

Quote all office and candidate values. The results are split into two columns on each page; parse the left column first and then the right column.

A spot-check and a few manual tweaks and the result against a 42 page PDF was exactly what was needed.

How about something harder? The results for Cameron County came as more than 600 pages and looked like this - note the hole-punch holes that obscure some of the text!

This file had to be split into chunks of 100 pages each, and the entire process still took a full hour of work - but the resulting table matched up with the official vote totals.

I love how realistic this example is. AI data entry like this isn't a silver bullet - there's still a bunch of work needed to verify the results and creative thinking needed to work through limitations - but it represents a very real improvement in how small teams can take on projects of this scale.

In the six weeks since we started working on Texas precinct results, we’ve been able to convert them for more than half of the state’s 254 counties, including many image PDFs like the ones on display here. That pace simply wouldn’t be possible with data entry or traditional OCR software.

Via Hacker News

Tags: data-journalism, derek-willis, ocr, ai, generative-ai, llms, gemini, vision-llms, structured-extraction

Political Email Extraction Leaderboard

2025-04-08T23:22:41+00:00

Political Email Extraction Leaderboard

Derek Willis collects "political fundraising emails from just about every committee" - 3,000-12,000 a month - and has created an LLM benchmark from 1,000 of them that he collected last November.

He explains the leaderboard in this blog post. The goal is to have an LLM correctly identify the the committee name from the disclaimer text included in the email.

Here's the code he uses to run prompts using Ollama. It uses this system prompt:

Produce a JSON object with the following keys: 'committee', which is the name of the committee in the disclaimer that begins with Paid for by but does not include 'Paid for by', the committee address or the treasurer name. If no committee is present, the value of 'committee' should be None. Also add a key called 'sender', which is the name of the person, if any, mentioned as the author of the email. If there is no person named, the value is None. Do not include any other text, no yapping.

Gemini 2.5 Pro tops the leaderboard at the moment with 95.40%, but the new Mistral Small 3.1 manages 5th place with 85.70%, pretty good for a local model!

I said we need our own evals in my talk at the NICAR Data Journalism conference last month, without realizing Derek has been running one since January.

Via @dwillis.bsky.social

Tags: data-journalism, derek-willis, ai, prompt-engineering, generative-ai, llms, mistral, gemini, evals, ollama, system-prompts

Six short video demos of LLM and Datasette projects

2025-01-22T02:09:54+00:00

Last Friday Alex Garcia and I hosted a new kind of Datasette Public Office Hours session, inviting members of the Datasette community to share short demos of projects that they had built. The session lasted just over an hour and featured demos from six different people.

We broadcast live on YouTube, but I've now edited the session into separate videos. These are listed below, along with project summaries and show notes for each presentation.

You can also watch all six videos in this YouTube playlist.

llm-logs-feedback by Matthias Lübken

llm-logs-feedback is a plugin by Matthias Lübken for LLM which adds the ability to store feedback on prompt responses, using new llm feedback+1 and llm feedback-1 commands. These also accept an optional comment, and the feedback is stored in a feedback table in SQLite.

You can install the plugin from PyPI like this:

llm install llm-logs-feedback

The full plugin implementation is in the llm_logs_feedback.py file in Matthias' GitHub repository.

llm-model-gateway and llm-consortium by Thomas Hughes

Tommy Hughes has been developing a whole array of LLM plugins, including his llm-plugin-generator which is a plugin that can help write new plugins!

He started by demonstrating llm-model-gateway, a plugin that adds a llm serve command which starts a localhost server running an imitation of the OpenAI API against LLM models:

llm install llm-model-gateway
llm serve

Tommy's main demo was of llm-consortium, a plugin which can use a consortium of collaborating LLM models working together to solve problems.

llm install llm-consortium
llm consortium "Best way to prepare avocado toast" \
  --models claude-3.5-sonnet \
  --models gpt-4o \
  --arbiter gegemini-1.5-pro-latest \
  --confidence-threshold 0.8 \
  --max-iterations 3 \
  --output results.json

I ran this and got back:

The best way to prepare avocado toast involves selecting quality ingredients and careful preparation. Start with a thick slice of good quality bread, such as sourdough or whole grain. Toast it to your preferred level of crispness. While the bread is toasting, prepare a ripe avocado. Cut it in half, remove the pit, and scoop the flesh into a bowl. Mash the avocado with a fork, leaving some chunks for texture. Season with salt, black pepper, and a squeeze of fresh lemon juice to prevent browning. Optional additions include a touch of red pepper flakes.

Once the toast is ready, let it cool slightly before spreading the seasoned avocado evenly over it. Consider lightly rubbing a cut garlic clove on the warm toast for an extra layer of flavor (optional).

Enhance your avocado toast with your favorite toppings. Popular choices include: everything bagel seasoning, sliced tomatoes, radishes, a poached or fried egg (for added protein), microgreens, smoked salmon (for a more savory option), feta cheese crumbles, or a drizzle of hot sauce. For a finishing touch, drizzle with high-quality olive oil and sprinkle with sesame or chia seeds for added texture.

Consider dietary needs when choosing toppings. For example, those following a low-carb diet might skip the tomatoes and opt for more protein and healthy fats.

Finally, pay attention to presentation. Arrange the toppings neatly for a visually appealing toast. Serve immediately to enjoy the fresh flavors and crispy toast.

But the really interesting thing is the full log of the prompts and responses sent to Claude 3.5 Sonnet and GPT-4o, followed by a combined prompt to Gemini 1.5 Pro to have it arbitrate between the two responses. You can see the full logged prompts and responses here. Here's that results.json output file.

Congressional Travel Explorer with Derek Willis

Derek Willis teaches data journalism at the Philip Merrill College of Journalism at the University of Maryland. For a recent project his students built a Congressional Travel Explorer interactive using Datasette, AWS Extract and Claude 3.5 Sonnet to analyze travel disclosures from members of Congress.

One of the outcomes from the project was this story in Politico: Members of Congress have taken hundreds of AIPAC-funded trips to Israel in the past decade.

llm-questioncache with Nat Knight

llm-questioncache builds on top of https://llm.datasette.io/ to cache answers to questions, using embeddings to return similar answers if they have already been stored.

Using embeddings for de-duplication of similar questions is an interesting way to apply LLM's embeddings feature.

Improvements to Datasette Enrichments with Simon Willison

I've demonstrated improvements I've been making to Datasette's Enrichments system over the past few weeks.

Enrichments allow you to apply an operation - such as geocoding, a QuickJS JavaScript transformation or an LLM prompt - against selected rows within a table.

The latest release of datasette-enrichments adds visible progress bars and the ability to pause, resume and cancel an enrichment job that is running against a table.

Datasette comments, pins and write UI with Alex Garcia

We finished with three plugin demos from Alex, showcasing collaborative features we have been developing for Datasette Cloud:

datasette-write-ui provides tools for editing and adding data to Datasette tables. A new feature here is the ability to shift-click a row to open the editing interface for that row.
datasette-pins allows users to pin tables and databases to their Datasette home page, making them easier to find.
datasette-comments adds a commenting interface to Datasette, allowing users to leave comments on individual rows in a table.

Tags: community, data-journalism, demos, derek-willis, youtube, ai, datasette, alex-garcia, generative-ai, llms, llm, enrichments, datasette-public-office-hours

Teaching News Apps with Codespaces

2023-03-23T00:39:33+00:00

Teaching News Apps with Codespaces

Derek Willis used GitHub Codespaces for the latest data journalism class he taught, and it eliminated the painful process of trying to get students on an assortment of Mac, Windows and Chromebook laptops all to a point where they could start working and learning together.

Via @derekwillis

Tags: data-journalism, derek-willis, github, teaching, github-codespaces

How much can you learn from just two columns?

2020-06-15T18:46:57+00:00

Derek Willis shared an intriguing dataset this morning: a table showing every Twitter account followed by an official GOP congressional Twitter account.

He published it here using Datasette. It's a single table containing 385,979 rows - each row is a username, account_name pair, where username is the Twitter account that is being followed and account_name is the congressional Twitter account that's following it.

Here's some sample data:

username	account_name
njhotline	SenatorTimScott
emilykpierce	RobWittman
jessblevinsoh	OHPressSec
familylink	SenMikeLee
howardsnowdon	RepThomasMassie
pattidomm	SenCapito

How much can we learn from just these two columns?

Which accounts have the most GOP congressional followers?

Let's start with a simple aggregation: which accounts on Twitter have the most GOP congressional followers?

select
  username,
  count(*) as num_gop_followers
from
  following
group by
  username
order by
  num_gop_followers desc

All we're doing here is counting the number of times a unique username (an account that is being followed) shows up in our table, then sorting by those counts.

Here are the result. The top ten are:

username	num_gop_followers
housegop	231
gopleader	229
realdonaldtrump	219
vp	216
speakerryan	207
whitehouse	207
stevescalise	198
chadpergram	195
potus	195
foxnews	187

Adding a "view more" link

Wouldn't it be useful if you could see which accounts those 231 followers of @housegop were?

We can do that in Datasette without a SQL query - we can instead use the form on the table page to construct a filter - or construct a querystring URL directly. Here are the 5 GOP congressional accounts following @cityofdallas:

https://official-gop-following.herokuapp.com/following/following?username=cityofdallas

Let's add that link to our original top-followed query. Datasette automatically links any value that begins with https://, so we can use SQL concatenation trick (with the || concatenation operator) to construct that URL as part of the query:


select
  username,
  count(*) as num_gop_followers,
  'https://official-gop-following.herokuapp.com/following/following?username=' || username as list_of_gop_followers
from
  following
group by
  username
order by
  num_gop_followers desc

Here's that query. The first five rows look like this:

username	num_gop_followers	list_of_gop_followers
housegop	231	https://official-gop-following.herokuapp.com/following/following?username=housegop
gopleader	229	https://official-gop-following.herokuapp.com/following/following?username=gopleader
realdonaldtrump	219	https://official-gop-following.herokuapp.com/following/following?username=realdonaldtrump
vp	216	https://official-gop-following.herokuapp.com/following/following?username=vp
speakerryan	207	https://official-gop-following.herokuapp.com/following/following?username=speakerryan

Congressional accounts who aren't following certain accounts

Since there are only 279 congressional GOP Twitter accounts, how about seeing who are the 279 - 219 = 60 accounts that aren't following @realdonaldtrump?

Let's construct a SQL query for this, using a sub-select:

select
  distinct account_name
from
  following
where
  account_name not in (
    select
      account_name
    from
      following
    where
      username = 'realdonaldtrump'
  )

Here that is in Datasette.

A neat thing we can do here is to parametrize that query. We can swap the hard-coded 'realdonaldtrump' value for a named parameter, :name, instead:

select
    distinct account_name
  from
    following
  where
    account_name not in (
      select
        account_name
      from
        following
      where
        username = :name
    )

Now when we visit that in Datasette it looks like this:

We can add ?name=realdonaldtrump to the URL (or submit the form and save the resulting URL) to link to results for one individual.

We've essentially created a new mini-application here - complete with an input form - just by bookmarking a URL in Datasette.

Let's make the query a bit more interesting by including a count of the number of accounts those congress-people ARE following, and sorting by that.

select
  account_name,
  count(*) as num_accounts_they_follow
from
  following
where
  account_name not in (
    select
      account_name
    from
      following
    where
      username = 'realdonaldtrump'
  )
group by
  account_name
order by
  num_accounts_they_follow desc

Here are the results.

account_name	num_accounts_they_follow
ChuckGrassley	13475
VernBuchanan	8560
CynthiaLummis	5793
GovAbbott	4423
SenatorTimScott	3846

@ChuckGrassley follows 13,475 accounts but none of them are the president!

Most similar accounts, based on number of shared follows

One last query. This time we're going to look at which accounts are "most similar" to each other, based on the largest overlap of follows. Here's the SQL for that:

select
  :name as representative,
  account_name as similar_representative,
  count(*) as num_shared_follows
from
  following
where
  username in (
    select
      username
    from
      following
    where
      account_name = :name
  )
  and account_name != :name
group by
  account_name
order by
  num_shared_follows desc

Again, we're using a :name placeholder. Here are the congressional accounts that are most similar to @MikeKellyPA.

What else can you do?

I'm pretty impressed at how much insight can be gained using SQL against just a two column table.

This post started as a Twitter thread. Charles Arthur suggested cross-referencing this against other sources such as the GovTrack ideology analysis of congressional candidates. This is a great idea! It's also very feasible, given that much of the data underlying GovTrack is available on GitHub. Import that into Datasette alongside Derek's follower data and you could construct some very interesting SQL joins indeed.

Tags: derek-willis, politics, sql, twitter, datasette

Represent

2008-12-29T22:10:26+00:00

Represent

Andrei Scheinkman and Derek Willis describe how they built the NYTimes Represent feature using GeoDjango and PostGIS.

Tags: andrei-scheinkman, derek-willis, django, geodjango, geospatial, new-york-times, postgis, postgresql, python

Represent and GeoDjango

2008-12-20T21:07:55+00:00

Represent and GeoDjango

The NYTimes new Represent application is built on GeoDjango.

Tags: derek-willis, django, geodjango, new-york-times, python, represent

Django, iCal and vObject

2007-08-01T11:09:13+00:00

Django, iCal and vObject

Easy iCal generation for Django using vObject.

Tags: derek-willis, django, icalendar, python, vobject