Simon Willison's Weblog: open-data

cityofaustin/atd-data-tech issues

2025-05-20T18:18:39+00:00

I stumbled across this today while looking for interesting frequently updated data sources from local governments. It turns out the City of Austin's Transportation Data & Technology Services department run everything out of a public GitHub issues instance, which currently has 20,225 closed and 2,002 open issues. They also publish an exported copy of the issues data through the data.austintexas.gov open data portal.

Tags: github, open-data, github-issues

OpenTimes

2025-03-17T22:49:59+00:00

OpenTimes

Spectacular new open geospatial project by Dan Snow:

OpenTimes is a database of pre-computed, point-to-point travel times between United States Census geographies. It lets you download bulk travel time data for free and with no limits.

Here's what I get for travel times by car from El Granada, California:

The technical details are fascinating:

The entire OpenTimes backend is just static Parquet files on Cloudflare's R2. There's no RDBMS or running service, just files and a CDN. The whole thing costs about $10/month to host and costs nothing to serve. In my opinion, this is a great way to serve infrequently updated, large public datasets at low cost (as long as you partition the files correctly).

Sure enough, R2 pricing charges "based on the total volume of data stored" - $0.015 / GB-month for standard storage, then $0.36 / million requests for "Class B" operations which include reads. They charge nothing for outbound bandwidth.

All travel times were calculated by pre-building the inputs (OSM, OSRM networks) and then distributing the compute over hundreds of GitHub Actions jobs. This worked shockingly well for this specific workload (and was also completely free).

Here's a GitHub Actions run of the calculate-times.yaml workflow which uses a matrix to run 255 jobs!

Relevant YAML:

  matrix:
    year: ${{ fromJSON(needs.setup-jobs.outputs.years) }}
    state: ${{ fromJSON(needs.setup-jobs.outputs.states) }}

Where those JSON files were created by the previous step, which reads in the year and state values from this params.yaml file.

The query layer uses a single DuckDB database file with views that point to static Parquet files via HTTP. This lets you query a table with hundreds of billions of records after downloading just the ~5MB pointer file.

This is a really creative use of DuckDB's feature that lets you run queries against large data from a laptop using HTTP range queries to avoid downloading the whole thing.

The README shows how to use that from R and Python - I got this working in the duckdb client (brew install duckdb):

INSTALL httpfs;
LOAD httpfs;
ATTACH 'https://data.opentimes.org/databases/0.0.1.duckdb' AS opentimes;

SELECT origin_id, destination_id, duration_sec
  FROM opentimes.public.times
  WHERE version = '0.0.1'
      AND mode = 'car'
      AND year = '2024'
      AND geography = 'tract'
      AND state = '17'
      AND origin_id LIKE '17031%' limit 10;

In answer to a question about adding public transit times Dan said:

In the next year or so maybe. The biggest obstacles to adding public transit are:

Collecting all the necessary scheduling data (e.g. GTFS feeds) for every transit system in the county. Not insurmountable since there are services that do this currently.

Finding a routing engine that can compute nation-scale travel time matrices quickly. Currently, the two fastest open-source engines I've tried (OSRM and Valhalla) don't support public transit for matrix calculations and the engines that do support public transit (R5, OpenTripPlanner, etc.) are too slow.

GTFS is a popular CSV-based format for sharing transit schedules - here's an official list of available feed directories.

This whole project feels to me like a great example of the baked data architectural pattern in action.

Via Hacker News

Tags: census, geospatial, open-data, openstreetmap, cloudflare, parquet, github-actions, baked-data, duckdb, http-range-requests

Overture Maps Foundation Releases Its First World-Wide Open Map Dataset

2023-07-27T16:45:09+00:00

Overture Maps Foundation Releases Its First World-Wide Open Map Dataset

The Overture Maps Foundation is a collaboration lead by Amazon, Meta, Microsoft and TomTom dedicated to producing “reliable, easy-to-use, and interoperable open map data”.

Yesterday they put out their first release and it’s pretty astonishing: four different layers of geodata, covering Places of Interest (shops, restaurants, attractions etc), administrative boundaries, building outlines and transportation networks.

The data is available as Parquet. I just downloaded the 8GB places dataset and can confirm that it contains 59 million listings from around the world—I filtered to just places in my local town and a spot check showed that recently opened businesses (last 12 months) were present and the details all looked accurate.

The places data is licensed under “Community Data License Agreement – Permissive” which looks like the only restriction is that you have to include that license when you further share the data.

Tags: geospatial, open-data, parquet, meta, overture

Quoting Saloni Dattani

2022-10-25T22:48:06+00:00

Most researchers don’t share their data. If you’ve ever read the words “data is available upon request" in an academic paper, and emailed the authors to request it, the chances that you'll actually receive the data are just 7 percent. The rest of the time, the authors have lost access to their data, changed emails, or are too busy or unwilling.

— Saloni Dattani

Tags: open-data, science

Weeknotes: datasette-socrata, and the last 10%...

2022-06-19T03:26:52+00:00

... takes 90% of the work. I continue to work towards a preview of the new Datasette Cloud, and keep finding new "just one more things" to delay inviting in users.

Aside from continuing to work on that, my big project in the last week was a blog entry: Twenty years of my blog, in which I celebrated twenty years since starting this site by pulling together a selection of highlights from over the years.

I've actually updated that entry a few times over the past few days as I remembered new highlights I forgot to include - the Twitter thread that accompanies the entry has those updates, starting here.

datasette-socrata

I've been thinking a lot about the Datasette Cloud onboarding experience: how can I help new users understand what Datasette can be used for as quickly as possible?

I want to get them to a point where they are interacting with a freshly created table of data. I can provide some examples, but I've always thought that one of the biggest opportunities for Datasette lies in working with the kind of data released by governments through their Open Data portals. This is especially true for its usage in the field of data journalism.

Many open data portals - including the one for San Francisco - are powered by a piece of software called Socrata. And it offers a pretty comprehensive API.

datasette-socrata is a new Datasette plugin which can import data from Socrata instances. Give it the URL to a Socrata dataset (like this one, my perennial favourite, listing all 195,000+ trees managed by the city of San Francisco) and it will import that data and its associated metadata into a brand new table.

It's pretty neat! It even shows you a progress bar, since some of these datasets can get pretty large:

As part of building this I ran into the interesting question of what a plugin like this should do if the system it is running on runs out of disk space?

I'm still working through that, but I'm experimenting with a new type of Datasette plugin for it: datasette-low-disk-space-hook, which introduces a new plugin hook (low_disk_space(datasette)) which other plugins can use to report a situation where disk space is running out.

I wrote a TIL about that here: Registering new Datasette plugin hooks by defining them in other plugins.

I may use this same trick for a future upgrade to datasette-graphql, to allow additional plugins to register custom GraphQL mutations.

sqlite-utils 3.27

In working on datasette-socrata I was inspired to push out a new release of sqlite-utils. Here are the annotated release notes:

Documentation now uses the Furo Sphinx theme. (#435)

I wrote about this a few weeks ago - the new documentation theme is now live for the stable documentation.

Code examples in documentation now have a "copy to clipboard" button. (#436)

I made this change to Datasette first - the sphinx-copybutton plugin adds a neat "copy" button next to every code example.

I also like how this encourages ensuring that every example will work if people directly copy and paste it.

sqlite_utils.utils.utils.rows_from_file() is now a documented API, see Reading rows from a file. (#443)

Francesco Frassinelli filed an issue about this utility function, which wasn't actually part of the documented stable API, but I saw no reason not to promote it.

The function incorporates the logic that the sqlite-utils CLI tool uses to automatically detect if a provided file is CSV, TSV or JSON and detect the CSV delimeter and other settings.

rows_from_file() has two new parameters to help handle CSV files with rows that contain more values than are listed in that CSV file's headings: ignore_extras=True and extras_key="name-of-key". (#440)

It turns out csv.DictReader in the Python standard library has a mechanism for handling CSV rows that contain too many commas.

In working on this I found a bug in mypy which I reported here, but it turned out to be a dupe of an already fixed issue.

sqlite_utils.utils.maximize_csv_field_size_limit() helper function for increasing the field size limit for reading CSV files to its maximum, see Setting the maximum CSV field size limit. (#442)

This is a workaround for the following Python error:

_csv.Error: field larger than field limit (131072)

It's an error that occurs when a field in a CSV file is longer than a default length.

Saying "yeah, I want to be able to handle the maximum length possible" is surprisingly hard - Python doesn't let you set a maximum, and can throw errors depending on the platform if you set a number too high. Here's the idiom that works, which is encapsulated by the new utility function:

field_size_limit = sys.maxsize

while True:
    try:
        csv_std.field_size_limit(field_size_limit)
        break
    except OverflowError:
        field_size_limit = int(field_size_limit / 10)

table.search(where=, where_args=) parameters for adding additional WHERE clauses to a search query. The where= parameter is available on table.search_sql(...) as well. See Searching with table.search(). (#441)

This was a feature suggestion from Tim Head.

Fixed bug where table.detect_fts() and other search-related functions could fail if two FTS-enabled tables had names that were prefixes of each other. (#434)

This was quite a gnarly bug. sqlite-utils attempts to detect if a table has an associated full-text search table by looking through the schema for another table that has a definition like this one:

CREATE VIRTUAL TABLE "searchable_fts"
USING FTS4 (
    text1,
    text2,
    [name with . and spaces],
    content="searchable"
)

I was checking for content="searchable" using a LIKE query:

SELECT name FROM sqlite_master
WHERE rootpage = 0
AND
sql LIKE '%VIRTUAL TABLE%USING FTS%content=%searchable%'

But this would incorrectly match strings such as content="searchable2" as well!

Releases this week

datasette-socrata: 0.3 - (4 releases total) - 2022-06-17
Import data from Socrata into Datasette
datasette-low-disk-space-hook: 0.1 - (2 releases total) - 2022-06-17
Datasette plugin providing the low_disk_space hook for other plugins to check for low disk space
sqlite-utils: 3.27 - (101 releases total) - 2022-06-15
Python CLI utility and library for manipulating SQLite databases
datasette-ics: 0.5.1 - (4 releases total) - 2022-06-10
Datasette plugin for outputting iCalendar files
datasette-upload-csvs: 0.7.1 - (9 releases total) - 2022-06-09
Datasette plugin for uploading CSV files and converting them to database tables

TIL this week

Tags: open-data, plugins, datasette, weeknotes, datasette-cloud, sqlite-utils, annotated-release-notes

Usable Data

2019-01-11T18:33:18+00:00

Usable Data

A Paul Ford essay from February 2016 in which he advocates for SQLite as the ideal format for sharing interesting data. I don’t know how I missed this one—it predates Datasette, but it perfectly captures the benefits that I’m trying to expose with the project. “In my dream universe, there would be a massive searchable torrent site filled with open, explorable data sets, in SQLite format, some with full text search indexes already in place.”

Via Twitter

Tags: open-data, paul-ford, sqlite, datasette

Quoting Five stars of open data

2018-04-17T04:20:28+00:00

A rating system for open data proposed by Tim Berners-Lee, founder of the World Wide Web. To score the maximum five stars, data must (1) be available on the Web under an open licence, (2) be in the form of structured data, (3) be in a non-proprietary file format, (4) use URIs as its identifiers (see also RDF), (5) include links to other data sources (see linked data). To score 3 stars, it must satisfy all of (1)-(3), etc.

— Five stars of open data

Tags: open-data, tim-berners-lee

GOV.UK Registers

2017-11-07T15:31:46+00:00

GOV.UK Registers

Canonical sources of “lists of information” intended for use by GDS teams building software for the UK government, but available for anyone. 17 registers are “ready for use”, 45 are “in progress”. Covers things like the FCO’s country list, the official list of prison estates, and DEFRA’s list of public bodies in England that manage drainage systems.

Via Terence Eden

Tags: datagov, government, open-data, gov-uk

Exploring United States Policing Data Using Python

2017-10-29T16:58:36+00:00

Exploring United States Policing Data Using Python

Outstanding introduction to data analysis with Jupyter and Pandas.

Tags: open-data, pandas, python, jupyter

OpenCorporates

2010-12-22T11:52:00+00:00

OpenCorporates

“The Open Database Of The Corporate World”—a URL for every UK company.

Via Open Knowledge Foundation Blog

Tags: open-data, recovered

Doing things with Ordnance Survey OpenData

2010-05-20T15:22:00+00:00

Doing things with Ordnance Survey OpenData

Jo Walsh’s guide to processing Ordnance Survey OpenData using PostgreSQL and PostGIS.

Tags: mapping, open-data, ordnancesurvey, postgis, postgresql, recovered, jo-walsh

Preview: Freebase Gridworks

2010-03-27T18:43:42+00:00

Preview: Freebase Gridworks

If my experience with government datasets has taught me anything, it’s that most datasets are collected by human beings (probably using Excel) and human beings are inconsistent. The first step in any data related project inevitably involves cleaning up the data. The Freebase team must run up against this all the time, and it looks like they’re tackling the problem head-on. Freebase Gridworks is just a screencast preview at the moment but an open source release is promised “within a month”—and the tool looks absolutely fantastic. DabbleDB-style data refactoring of spreadsheet data, running on your desktop but with the UI served in a browser. Full undo, a JavaScript-based expression language, powerful faceting and the ability to “reconcile” data against Freebase types (matching up country names, for example). I can’t wait to get my hands on this.

Via Jon Udell

Tags: cleanup, dabbledb, data, freebase, gridworks, javascript, open-data

No PDFs!

2009-11-01T12:04:36+00:00

No PDFs!

The Sunlight Foundation point out that PDFs are a terrible way of implementing “more transparent government” due to their general lack of structure. At the Guardian (and I’m sure at other newspapers) we waste an absurd amount of time manually extracting data from PDF files and turning it in to something more useful. Even CSV is significantly more useful for many types of information.

Tags: adobe, csv, open-data, opengovernment, pdf, sunlightfoundation

Show Us a Better Way

2008-07-04T09:36:43+00:00

Show Us a Better Way

The UK Government’s Power of Information Taskforce are running a mashup competition (a.k.a. “ideas for new products that could improve the way public information is communicated”) with a £20,000 prize fund and gigabytes of brand new data and APIs. This is a great opportunity for the software community to demonstrate how important this kind of open data really is.

Tags: apis, mashups, open-data, powerofinformation, ukgovernment

Freeing the postcode

2006-11-17T17:29:59+00:00

UK postcodes have some interesting characteristics: a full six character post code identifies an average of around 14 house holds, and postcodes are mainly hierarchical - W1W will always be contained within W1 for example. They're useful for a huge range of interesting things.

The problem is that the postcode database (of nearly 1.8 million postcodes) is owned by the Royal Mail and licensed at a not inconsiderable fee of between £150 and £9,000 per year.

Free the postcode was set up a while ago to try to remedy this situation, by asking people to enter their postcode along with the latitude/longitude coordinates collected from a GPS. Having people enter coordinates from online mapping services is no good as EU database law may see that as a derivative work. It's had some success, but the GPS requirement has seriously stunted its growth.

Then a few weeks ago, npemap.org.uk launched. It's an interface for browsing scans of out-of-copyright maps from the 1950s (credits at the bottom of the FAQ). The site asks people to enter post codes based on that old mapping data, which can then be placed in the public domain.

If you haven't already done so, you should go and add any postcodes that you know about now. It takes no time at all, and is especially important if you live in one of the 230 districts for which no data has yet been collected.

You can grab the data they've already collected from here. There's a really cool interactive visualisation of their data here, based on previous work by Chris Lightfoot using the commercially licensed postcode database.

Tags: npemap, open-data, postcode, royalmail