Simon Willison's Weblog: inaturalist

inaturalist-clumper 0.1

2026-05-15T23:53:11+00:00

Part of the infrastructure I use for publishing my iNaturalist sightings on my blog. I've been running this in production for a few weeks now, inspiring some iterations on how it works, so I decided to ship a 0.1 release.

You can see an example of the output in this JSON file.

Tags: projects, inaturalist

Sightings

2026-05-02T17:26:40+00:00

/elsewhere/sightings/

I have a new camera (a Canon R6 Mark II) so I'm taking a lot more photos of birds. I share my best wildlife photos on iNaturalist, and based on yesterday's successful prototype I decided to add those to my blog.

I built this feature on my phone using Claude Code for web, as an extension of my beats system for syndicating external content. Here's the PR and prompt.

As with my other forms of incoming syndicated content sightings show up on the homepage, the date archive pages, and in site search results.

I back-populated over a decade of iNaturalist sightings, which means you that if you search for lemur you'll see my lemur photos from Madagascar in 2019!

Tags: blogging, photography, wildlife, ai, inaturalist, generative-ai, llms, ai-assisted-programming, claude-code

iNaturalist Sightings

2026-05-01T19:35:41+00:00

Tool: iNaturalist Sightings

I wanted to see my iNaturalist observations - across two separate accounts - grouped by when they occurred. I'm camping this weekend so I built this entirely on my phone using Claude Code for web.

I started by building an inaturalist-clumper Python CLI for fetching and "clumping" observations - by default clumps use observations within 2 hours and 5km of each other.

Then I setup simonw/inaturalist-clumps as a Git scraping repository to run that tool and record the result to clumps.json.

That JSON file is hosted on GitHub, which means it can be fetched by JavaScript using CORS.

Finally I ran this prompt against my simonw/tools repo:

Build inat-sightings.html - an app that does a fetch() against https://raw.githubusercontent.com/simonw/inaturalist-clumps/refs/heads/main/clumps.json and then displays all of the observations on one page using the https://static.inaturalist.org/photos/538073008/small.jpg small.jpg URLs for the thumbnails - with loading=lazy - but when a thumbnail is clicked showing the large.jpg in an HTML modal. Both small and large should include the common species names if available

Tags: tools, ai, inaturalist, generative-ai, llms, claude-code

Building and deploying a custom site using GitHub Actions and GitHub Pages

2025-03-18T20:17:34+00:00

Building and deploying a custom site using GitHub Actions and GitHub Pages

I figured out a minimal example of how to use GitHub Actions to run custom scripts to build a website and then publish that static site to GitHub Pages. I turned the example into a template repository, which should make getting started for a new project extremely quick.

I've needed this for various projects over the years, but today I finally put these notes together while setting up a system for scraping the iNaturalist API for recent sightings of the California Brown Pelican and converting those into an Atom feed that I can subscribe to in NetNewsWire:

I got Claude to write me the script that converts the scraped JSON to atom.

Update: I just found out iNaturalist have their own atom feeds! Here's their own feed of recent Pelican observations.

Tags: atom, github, netnewswire, inaturalist, github-actions, git-scraping, ai-assisted-programming

Weeknotes: Rocky Beaches, Datasette 0.48, a commit history of my database

2020-08-21T00:52:16+00:00

This week I helped Natalie launch Rocky Beaches, shipped Datasette 0.48 and several releases of datasette-graphql, upgraded the CSRF protection for datasette-upload-csvs and figured out how to get a commit log of changes to my blog by backing up its database to a GitHub repository.

Rocky Beaches

Natalie released the first version of rockybeaches.com this week. It's a site that helps you find places to go tidepooling (known as rockpooling in the UK) and figure out the best times to go based on low tide times.

I helped out with the backend for the site, mainly as an excuse to further explore the idea of using Datasette to power full websites (previously explored with Niche Museums and my TILs).

The site uses a pattern I've been really enjoying: it's essentially a static dynamic site. Pages are dynamically rendered by Datasette using Jinja templates and a SQLite database, but the database itself is treated as a static asset: it's built at deploy time by this GitHub Actions workflow and deployed (currently to Vercel) as a binary asset along with the code.

The build script uses yaml-to-sqlite to load two YAML files - places.yml and stations.yml - and create the stations and places database tables.

It then runs two custom Python scripts to fetch relevant data for those places from iNaturalist and the NOAA Tides & Currents API.

The data all ends up in the Datasette instance that powers the site - you can browse it at www.rockybeaches.com/data or interact with it using GraphQL API at www.rockybeaches.com/graphql

The code is a little convoluted at the moment - I'm still iterating towards the best patterns for building websites like this using Datasette - but I'm very pleased with the productivity and performance that this approach produced.

Datasette 0.48

Highlights from Datasette 0.48 release notes:

Datasette documentation now lives at docs.datasette.io
The extra_template_vars, extra_css_urls, extra_js_urls and extra_body_script plugin hooks now all accept the same arguments. See extra_template_vars(template, database, table, columns, view_name, request, datasette) for details. (#939)
Those hooks now accept a new columns argument detailing the table columns that will be rendered on that page. (#938)

I released a new version of datasette-cluster-map that takes advantage of the new columns argument to only inject Leaflet maps JavaScript onto the page if the table being rendered includes latitude and longitude columns - previously the plugin would load extra code on pages that weren't going to render a map at all. That's now running on https://global-power-plants.datasettes.com/.

datasette-graphql

Using datasette-graphql for Rocky Beaches inspired me to add two new features:

A new graphql() Jinja custom template function that lets you execute custom GraphQL queries inside a Datasette template page - which turns out to be a pretty elegant way for the template to load exactly the data that it needs in order to render the page. Here's how Rocky Beaches uses that. Issue 50.
Some of the iNaturalist data that Rocky Beaches uses is stored as JSON data in text columns in SQLite - mainly because I was too lazy to model it out as tables. This was coming out of the GraphQL API as strings-containing-JSON, so I added a json_columns plugin configuration mechanism for turning those into Graphene GenericScalar fields - see issue 53 for details.

I also landed a big performance improvement. The plugin works by introspecting the database and generating a GraphQL schema that represents those tables, columns and views. For tables with a lot of tables this can get expensive, and the introspection was being run on every request.

I didn't want to require a server restart any time the schema changed, so I didn't want to cache the schema in-memory. Ideally it would be cached but the cache would become invalid any time the schema itself changed.

It turns out SQLite has a mechanism for this: the PRAGMA schema_version statement, which returns an integer version number that changes any time the underlying schema is changed (e.g. a table is added or modified).

I built a quick datasette-schema-versions plugin to try this feature out (in less than twenty minutes thanks to my datasette-plugin cookiecutter template) and prove to myself that it works. Then I built a caching mechanism for datasette-graphql that uses the current schema_version as the cache key. See issue 51 for details.

asgi-csrf and datasette-upload-csvs

datasette-upload-csvs is a Datasette plugin that adds a form for uploading CSV files and converting them to SQLite tables.

Datasette 0.44 added CSRF protection, which broke the plugin. I fixed that this week, but it took some extra work because file uploads use the multipart/form-data HTTP mechanism and my asgi-csrf library didn't support that.

I fixed that this week, but the code was quite complicated. Since asgi-csrf is a security library I decided to aim for 100% code coverage, the first time I've done that for one of my projects.

I got there with the help of codecov.io and pytest-cov. I wrote up what I learned about those tools in a TIL.

Backing up my blog database to a GitHub repository

I really like keeping content in a git repository (see Rocky Beaches and Niche Museums). Every content management system I've ever been has eventually desired revision control, and modeling that in a database and adding it to an existing project is always a huge pain.

I have 18 years of content on this blog. I want that backed up to git - and this week I realized I have the tools to do that already.

db-to-sqlite is my tool for taking any SQL Alchemy supported database (so far tested with MySQL and PostgreSQL) and exporting it into a SQLite database.

sqlite-diffable is a very early stage tool I built last year. The idea is to dump a SQLite database out to disk in a way that is designed to work well with git diffs. Each table is dumped out as newline-delimited JSON, one row per line.

So... how about converting my blog's PostgreSQL database to SQLite, then dumping it to disk with sqlite-diffable and committing the result to a git repository? And then running that in a GitHub Action?

Here's the workflow. It does exactly that, with a few extra steps: it only grabs a subset of my tables, and it redacts the password column from my auth_user table so that my hashed password isn't exposed in the backup.

I now have a commit log of changes to my blog's database!

I've set it to run nightly, but I can trigger it manually by clicking a button too.

TIL this week

Releases this week

datasette-graphql 0.14 - 2020-08-20
datasette-graphql 0.13 - 2020-08-19
datasette-schema-versions 0.1 - 2020-08-19
datasette-graphql 0.12.3 - 2020-08-19
github-to-sqlite 2.5 - 2020-08-18
datasette-publish-vercel 0.8 - 2020-08-17
datasette-cluster-map 0.12 - 2020-08-16
datasette 0.48 - 2020-08-16
datasette-graphql 0.12.2 - 2020-08-16
datasette-saved-queries 0.2.1 - 2020-08-15
datasette 0.47.3 - 2020-08-15
datasette-upload-csvs 0.5 - 2020-08-15
asgi-csrf 0.7 - 2020-08-15
asgi-csrf 0.7a0 - 2020-08-15
asgi-csrf 0.7a0 - 2020-08-15
datasette-cluster-map 0.11.1 - 2020-08-14
datasette-cluster-map 0.11 - 2020-08-14
datasette-graphql 0.12.1 - 2020-08-13

Tags: csrf, databases, git, github, natalie-downe, projects, graphql, datasette, inaturalist, weeknotes

Practical Deep Learning for Coders 2019

2019-01-26T00:32:52+00:00

Practical Deep Learning for Coders 2019

The deep learning evening course I took a few months ago has now been shared online in full, and it’s outstanding. “After the first lesson you’ll be able to train a state-of-the-art image classification model on your own data”—can confirm: after just the first lesson I built a bobcat v.s. cougar classifier using photos from iNaturalist.

The biggest thing I learned from the course is how powerful transfer learning is. I used to think you needed a huge amount of data to get good results from deep learning. That’s no longer true: you can take an existing model (eg ResNet for image classification) and train on top of it.

ResNet can classify images as 1,000 classes (house, cat, etc)—training an extra few hundred images of e.g. Bobcats vs Cougars only takes a couple of minutes on a GPU and can give you crazily accurate results.

It works because the pre-trained model can already pick up really subtle details—fur patterns, ear shapes etc—so you only need to train a few more layers on it for it to be able to classify against the patterns in your new set of training images.

And this doesnt just work for image classification! Natural language processing benefits from transfer learning too: take an existing model trained on the entire corpus of Wikipedia (so it knows patterns from sentence structures) and you can build IMDB sentiment analysis on top. That’s in lesson 4.

Via @simonw

Tags: machine-learning, inaturalist

Develop Your Naturalist Superpowers with Observable Notebooks and iNaturalist

2018-12-18T22:39:19+00:00

Develop Your Naturalist Superpowers with Observable Notebooks and iNaturalist

Natalie’s article for this year’s 24 ways advent calendar shows how you can use Observable notebooks to quickly build interactive visualizations against web APIs. She uses the iNaturalist API to show species of Nudibranchs that you might see in a given month, plus a Vega-powered graph of sightings over the course of the year. This really inspired me to think harder about how I can use Observable to solve some of my API debugging needs, and I’ve already spun up a couple of private Notebooks to exercise new APIs that I’m building at work. It’s a huge productivity boost.

Via @natbat

Tags: natalie-downe, webapis, inaturalist, observable, nudibranchs

Automatically playing science communication games with transfer learning and fastai

2018-10-29T03:16:33+00:00

This weekend was the 9th annual Science Hack Day San Francisco, which was also the 100th Science Hack Day held worldwide.

Natalie and I decided to combine our interests and build something fun.

I’m currently enrolled in Jeremy Howard’s Deep Learning course so I figured this was a great opportunity to try out some computer vision.

Natalie runs the SciComm Games calendar and accompanying @SciCommGames bot to promote and catalogue science communication hashtag games on Twitter.

Hashtag games? Natalie explains them here - essentially they are games run by scientists on Twitter to foster public engagement around an animal or topic by challenging people to identify if a photo is a #cougarOrNot or participate in a #TrickyBirdID or identify #CrowOrNo or many others.

Combining the two… we decided to build a bot that automatically plays these games using computer vision. So far it’s just trying #cougarOrNot - you can see the bot in action at @critter_vision.

Training data from iNaturalist

In order to build a machine learning model, you need to start out with some training data.

I’m a big fan of iNaturalist, a citizen science project that encourages users to upload photographs of wildlife (and plants) they have seen and have their observations verified by a community. Natalie and I used it to build owlsnearme.com earlier this year - the API in particular is fantastic.

iNaturalist has over 5,000 verified sightings of felines (cougars, bobcats, domestic cats and more) in the USA.

The raw data is available as a paginated JSON API. The medium sized photos are just the right size for training a neural network.

I started by grabbing 5,000 images and saving them to disk with a filename that reflected their identified species:

Bobcat_9005106.jpg
Domestic-Cat_10068710.jpg
Bobcat_15713672.jpg
Domestic-Cat_6755280.jpg
Mountain-Lion_9075705.jpg

Building a model

I’m only one week into the fast.ai course so this really isn’t particularly sophisticated yet, but it was just about good enough to power our hack.

The main technique we are learning in the course is called transfer learning, and it really is shockingly effective. Instead of training a model from scratch you start out with a pre-trained model and use some extra labelled images to train a small number of extra layers.

The initial model we are using is ResNet-34, a 34-layer neural network trained on 1,000 labelled categories in the ImageNet corpus.

In class, we learned to use this technique to get 94% accuracy against the Oxford-IIIT Pet Dataset - around 7,000 images covering 12 cat breeds and 25 dog breeds. In 2012 the researchers at Oxford were able to get 59.21% using a sophisticated model - it 2018 we can get 94% with transfer learning and just a few lines of code.

I started with an example provided in class, which loads and trains images from files on disk using a regular expression that extracts the labels from the filenames.

My full Jupyter notebook is inaturalist-cats.ipynb - the key training code is as follows:

from fastai import *
from fastai.vision import *
cat_images_path = Path('/home/jupyter/.fastai/data/inaturalist-usa-cats/images')
cat_fnames = get_image_files(cat_images_path)
cat_data = ImageDataBunch.from_name_re(
    cat_images_path,
    cat_fnames,
    r'/([^/]+)_\d+.jpg$',
    ds_tfms=get_transforms(),
    size=224
)
cat_data.normalize(imagenet_stats)
cat_learn = ConvLearner(cat_data, models.resnet34, metrics=error_rate)
cat_learn.fit_one_cycle(4)
# Save the generated model to disk
cat_learn.save("usa-inaturalist-cats")

Calling cat_learn.save("usa-inaturalist-cats") created an 84MB file on disk at /home/jupyter/.fastai/data/inaturalist-usa-cats/images/models/usa-inaturalist-cats.pth - I used scp to copy that model down to my laptop.

This model gave me a 24% error rate which is pretty terrible - others on the course have been getting error rates less than 10% for all kinds of interesting problems. My focus was to get a model deployed as an API though so I haven’t spent any additional time fine-tuning things yet.

Deploying the model as an API

The fastai library strongly encourages training against a GPU, using pytorch and PyCUDA. I’ve been using n1-highmem-8 Google Cloud Platform instance with an attached Tesla P4, then running everything in a Jupyter notebook there. This costs around $0.38 an hour - fine for a few hours of training, but way too expensive to permanently host a model.

Thankfully, while a GPU is essential for productively training models it’s not nearly as important for evaluating them against new data. pytorch can run in CPU mode for that just fine on standard hardware, and the fastai README includes instructions on installing it for a CPU using pip.

I started out by ensuring I could execute my generated model on my own laptop (since pytorch doesn’t yet work with the GPU built into the Macbook Pro). Once I had that working, I used the resulting code to write a tiny Starlette-powered API server. The code for that can be found in in cougar.py.

fastai is under very heavy development and the latest version doesn’t quite have a clean way of loading a model from disk without also including the initial training images, so I had to hack around quite a bit to get this working using clues from the fastai forums. I expect this to get much easier over the next few weeks as the library continues to evolve based on feedback from the current course.

To deploy the API I wrote a Dockerfile and shipped it to Zeit Now. Now remains my go-to choice for this kind of project, though unfortunately their new (and brilliant) v2 platform imposes a 100MB image size limit - not nearly enough when the model file itself weights in at 83 MB. Thankfully it’s still possible to specify their v1 cloud which is more forgiving for larger applications.

Here’s the result: an API which can accept either the URL to an image or an uploaded image file: https://cougar-or-not.now.sh/ - try it out with a cougar and a bobcat.

The Twitter Bot

Natalie built the Twitter bot. It runs as a scheduled task on Heroku and works by checking for new #cougarOrNot tweets from Dr. Michelle LaRue, extracting any images, passing them to my API and replying with a tweet that summarizes the results. Take a look at its recent replies to get a feel for how well it is doing.

Amusingly, Dr. LaRue frequently tweets memes to promote upcoming competitions and marks them with the same hashtag. The bot appears to think that most of the memes are bobcats! I should definitely spend some time tuning that model.

Science Hack Day was great fun. A big thanks to the organizing team, and congrats to all of the other participants. I’m really looking forward to the next one.

Plus… we won a medal!

Enjoyed #scienceHackday this weekend, made & launched a cool machine learning hack to process images & work out if they have a cougar in them or not! #CougarOrNot @critter_vision

... we won a medal!

Bot code: https://t.co/W2jZcGCnFr
Machine learning API: https://t.co/swNiKlcTp0 pic.twitter.com/dcdIhNZy63
— Natbat (@Natbat) October 29, 2018

Tags: computer-vision, machine-learning, natalie-downe, inaturalist, fastai, transferlearning, jeremy-howard, starlette

owlsnearme source code on GitHub

2018-02-04T22:33:34+00:00

owlsnearme source code on GitHub

Here’s the source code for our new owlsnearme.com project. It’s a single-page React application that pulls all of its data from the iNaturalist API. We built it this weekend with the SuperbOwl kick-off as a hard deadline so it’s not the most beautiful React code, but it’s a nice demonstration of how React (and create-react-app in particular) can be used for rapid development.

Tags: github, javascript, natalie-downe, projects, react, inaturalist

Owls Near Me

2018-02-04T22:26:29+00:00

Owls Near Me

Back in 2010 Natalie and I shipped owlsnearyou.com—a website for finding your nearest owls, using data from the sadly deceased WildlifeNearYou (RIP). To celebrate #SuperbOwl Sunday we rebuilt the same concept on top of the excellent iNaturalist API. Search for a place to see which owls have been spotted there, or click the magic button to geolocate your device and see which owls have been spotted in your nearby area!

Tags: natalie-downe, projects, wildlifenearyou, inaturalist

6M observations total! Where has iNaturalist grown in 80 days with 1 million new observations?

2018-01-28T20:18:58+00:00

6M observations total! Where has iNaturalist grown in 80 days with 1 million new observations?

Citizen science app iNaturalist is seeing explosive growth at the moment—they’ve been around for nearly a decade but 1/6 of the observations posted to the site were added in just the past few months. Having tried the latest version of their iPhone app it’s easy to see why: snap a photo of some nature and upload it to the app and it will use surprisingly effective machine learning to suggest the genus or even the individual species. Submit the observation and within a few minutes other iNaturalist community members will confirm the identification or suggest a correction. It’s brilliantly well executed and an utter delight to use.

Tags: computer-vision, crowdsourcing, machine-learning, science, citizenscience, inaturalist