Simon Willison's Weblog: crowdsourcing

Trying to end the pandemic a little earlier with VaccinateCA

2021-02-28T05:40:28+00:00

This week I got involved with the VaccinateCA effort. We are trying to end the pandemic a little earlier, by building the most accurate database possible of vaccination locations and availability in California.

VaccinateCA

I’ve been following this project for a while through Twitter, mainly via Patrick McKenzie - here’s his tweet about the project from January 20th.

https://t.co/JrD5mb4TAN calls medical professionals daily to ask who they could vaccinate and how to get in line. We publish this, covering the entire state of California, to help more people get their vaccines faster. Please tell your friends and networks.
- Patrick McKenzie (@patio11) January 20, 2021

The core idea is one of those things that sounds obviously correct the moment you hear it. The Covid vaccination roll-out is decentralized and pretty chaotic. VaccinateCA figured out that the best way to figure out where the vaccine is available is to call the places that are distributing it - pharmacies, hospitals, clinics - as often as possible and ask if they have any in stock, who is eligible for the shot and how people can sign up for an appointment.

What We've Learned (So Far) by Patrick talks about lessons learned in the first 42 days of the project.

There are three public-facing components to VaccinateCA:

www.vaccinateca.com is a website to help you find available vaccines near you.
help.vaccinateca is the web app used by volunteers who make calls - it provides a script and buttons to submit information gleaned from the call. If you’re interested in volunteering there’s information on the website.
api.vaccinateca is the public API, which is documented here and is also used by the end-user facing website. It provides a full dump of collected location data, plus information on county policies and large-scale providers (pharmacy chains, health care providers).

The system currently mostly runs on Airtable, and takes advantage of pretty much every feature of that platform.

Why I got involved

Jesse Vincent convinced me to get involved. It turns out to be a perfect fit for both my interests and my skills and experience.

I’ve built crowdsourcing platforms before - for MP’s expense reports at the Guardian, and then for conference and event listings with our startup, Lanyrd.

VaccinateCA is a very data-heavy organization: the key goal is to build a comprehensive database of vaccine locations and availability. My background in data journalism and the last three years I’ve spent working on Datasette have given me a wealth of relevant experience here.

And finally… VaccinateCA are quickly running up against the limits of what you can sensibly do with Airtable - especially given Airtable’s hard limit at 100,000 records. They need to port critical tables to a custom PostgreSQL database, while maintaining as much as possible the agility that Airtable has enabled for them.

Django is a great fit for this kind of challenge, and I know quite a bit about both Django and using Django to quickly build robust, scalable and maintainable applications!

So I spent this week starting a Django replacement for the Airtable backend used by the volunteer calling application. I hope to get to feature parity (at least as an API backend that the application can write to) in the next few days, to demonstrate that a switch-over is both possible and a good idea.

What about Datasette?

On Monday I spun up a Datasette instance at vaccinateca.datasette.io (underlying repository) against data from the public VaccinateCA API. The map visualization of all of the locations instantly proved useful in helping spot locations that had incorrectly been located with latitudes and longitudes outside of California.

I hope to use Datasette for a variety of tasks like this, but it shouldn’t be the core of the solution. VaccinateCA is the perfect example of a problem that needs to be solved with Boring Technology - it needs to Just Work, and time that could be spent learning exciting new technologies needs to be spent building what’s needed as quickly, robustly and risk-free as possible.

That said, I’m already starting to experiment with the new JSONField introduced in Django 3.1 - I’m hoping that a few JSON columns can help compensate for the lack of flexibility compared to Airtable, which makes it ridiculously easy for anyone to add additional columns.

(To be fair JSONField has been a feature of the Django PostgreSQL Django extension since version 1.9 in 2015 so it’s just about made it into the boring technology bucket by now.)

Also this week

Working on VaccinateCA has given me a chance to use some of my tools in new and interesting ways, so I got to ship a bunch of small fixes, detailed in Releases this week below.

On Friday I gave a talk at Speakeasy JS, "the JavaScript meetup for 🥼 mad science, 🧙‍♂️ hacking, and 🧪 experiments" about why "SQL in your client-side JavaScript is a great idea". The video for that is on YouTube and I plan to provide a full write-up soon.

I also recorded a five minute lightning talk about Git Scraping for next week's NICAR 2021 data journalism conference.

I also made a few small cosmetic upgrades to the way tags are displayed on my blog - they now show with a rounded border and purple background, and include a count of items published with that tag. My tags page is one example of where I've now applied this style.

TIL this week

Releases this week

flatten-single-item-arrays: 0.1 - 2021-02-25
Given a JSON list of objects, flatten any keys which always contain single item arrays to just a single value
datasette-auth-github: 0.13.1 - (25 releases total) - 2021-02-25
Datasette plugin that authenticates users against GitHub
datasette-block: 0.1.1 - (2 releases total) - 2021-02-25
Block all access to specific path prefixes
github-contents: 0.2 - 2021-02-24
Python class for reading and writing data to a GitHub repository
csv-diff: 1.1 - (9 releases total) - 2021-02-23
Python CLI tool and library for diffing CSV and JSON files
sqlite-transform: 0.4 - (5 releases total) - 2021-02-22
Tool for running transformations on columns in a SQLite database
airtable-export: 0.5 - (7 releases total) - 2021-02-22
Export Airtable data to YAML, JSON or SQLite files on disk

Tags: crowdsourcing, django, postgresql, patrick-mckenzie, datasette, weeknotes, covid19, vaccinate-ca, personal-news, jesse-vincent

free-for.dev

2019-12-26T10:03:27+00:00

free-for.dev

It’s pretty amazing how much you can build on free tiers these days—perfect for experimenting with side-projects. free-for.dev collects free SaaS tools for developers via pull request, and has had contributions from over 500 people.

Via @addyosmani

Tags: crowdsourcing, tools

6M observations total! Where has iNaturalist grown in 80 days with 1 million new observations?

2018-01-28T20:18:58+00:00

6M observations total! Where has iNaturalist grown in 80 days with 1 million new observations?

Citizen science app iNaturalist is seeing explosive growth at the moment—they’ve been around for nearly a decade but 1/6 of the observations posted to the site were added in just the past few months. Having tried the latest version of their iPhone app it’s easy to see why: snap a photo of some nature and upload it to the app and it will use surprisingly effective machine learning to suggest the genus or even the individual species. Submit the observation and within a few minutes other iNaturalist community members will confirm the identification or suggest a correction. It’s brilliantly well executed and an utter delight to use.

Tags: computer-vision, crowdsourcing, machine-learning, science, citizenscience, inaturalist

Color Survey Results

2010-05-05T15:59:00+00:00

Color Survey Results

XKCD asked anonymous netizens to provide names for random colours. The results (collated from 222,500 user sessions that named over 5 million colours) are fascinating.

Tags: crowdsourcing, science, surveys, xkcd, recovered, colours

WildlifeNearYou talk at £5 app, and being Wired (not Tired)

2010-04-11T20:42:11+00:00

Two quick updates about WildlifeNearYou. First up, I gave a talk about the site at £5 app, my favourite Brighton evening event which celebrates side projects and the joy of Making Stuff. I talked about the site's genesis on a fort, crowdsourcing photo ratings, how we use Freebase and DBpedia and how integrating with Flickr's machine tags gave us a powerful location API for free. Here's the video of the talk, courtesy of Ian Oszvald:

£5 App #22 WildLifeNearYou by Simon Willison and Natalie Downe from IanProCastsCoUk on Vimeo.

Secondly, I'm excited to note that WildlifeNearYou spin-off OwlsNearYou.com is featured in UK Wired magazine's Wired / Tired / Expired column... and we're Wired!

Tags: api, crowdsourcing, fivepoundapp, flickr, freebase, natalie-downe, owlsnearyou, my-talks, wildlifenearyou, wired

Help pick the best photos, but watch out, it's addictive!

2010-01-25T00:36:35+00:00

Help pick the best photos, but watch out, it's addictive!

My favourite WildlifeNearYou feature yet—our new tool asks you to pick the best from two photos, then uses the results to rank all of the photos for each species. It’s surprisingly addictive—we had over 5,000 votes in the first two hours, peaking at 4 or 5 votes a second. The feature seems to be staying nice and speedy thanks to Redis under the hood. Photos in the top three for any given species display a medal on their photo page.

Tags: crowdsourcing, photos, projects, redis, wildlifenearyou

WildlifeNearYou: Help identify animals in other people's photos

2010-01-15T01:35:07+00:00

WildlifeNearYou: Help identify animals in other people's photos

The first of a number of crowdsourcing-style features we have planned for WildlifeNearYou—users can now help identify the animals in each other’s photos, and photo owners get a simple queue interface to approve or reject the suggestions.

Tags: crowdsourcing, projects, wildlifenearyou

Crowdsourced document analysis and MP expenses

2009-12-20T12:07:53+00:00

As you may have heard, the UK government released a fresh batch of MP expenses documents a week ago on Thursday. I spent that week working with a small team at Guardian HQ to prepare for the release. Here's what we built:

http://mps-expenses2.guardian.co.uk/ Updated March 2021: all links now go to the Internet Archive

It's a crowdsourcing application that asks the public to help us dig through and categorise the enormous stack of documents - around 30,000 pages of claim forms, scanned receipts and hand-written letters, all scanned and published as PDFs.

This is the second time we've tried this - the first was back in June, and can be seen at mps-expenses.guardian.co.uk. Last week's attempt was an opportunity to apply the lessons we learnt the first time round.

Writing crowdsourcing applications in a newspaper environment is a fascinating challenge. Projects have very little notice - I heard about the new document release the Thursday before giving less than a week to put everything together. In addition to the fast turnaround for the application itself, the 48 hours following the release are crucial. The news cycle moves fast, so if the application launches but we don't manage to get useful data out of it quickly the story will move on before we can impact it.

ScaleCamp on the Friday meant that development work didn't properly kick off until Monday morning. The bulk of the work was performed by two server-side developers, one client-side developer, one designer and one QA on Monday, Tuesday and Wednesday. The Guardian operations team deftly handled our EC2 configuration and deployment, and we had some extra help on the day from other members of the technology department. After launch we also had a number of journalists helping highlight discoveries and dig through submissions.

The system was written using Django, MySQL (InnoDB), Redis and memcached.

Asking the right question

The biggest mistake we made the first time round was that we asked the wrong question. We tried to get our audience to categorise documents as either "claims" or "receipts" and to rank them as "not interesting", "a bit interesting", "interesting but already known" and "someone should investigate this". We also asked users to optionally enter any numbers they saw on the page as categorised "line items", with the intention of adding these up later.

The line items, with hindsight, were a mistake. 400,000 documents makes for a huge amount of data entry and for the figures to be useful we would need to confirm their accuracy. This would mean yet more rounds of crowdsourcing, and the job was so large that the chance of getting even one person to enter line items for each page rapidly diminished as the news story grew less prominent.

The categorisations worked reasonably well but weren't particularly interesting - knowing if a document is a claim or receipt is useful only if you're going to collect line items. The "investigate this" button worked very well though.

We completely changed our approach for the new system. We dropped the line item task and instead asked our users to categories each page by applying one or more tags, from a small set that our editors could control. This gave us a lot more flexibility - we changed the tags shortly before launch based on the characteristics of the documents - and had the potential to be a lot more fun as well. I'm particularly fond of the "hand-written" tag, which has highlighted some lovely examples of correspondence between MPs and the expenses office.

Sticking to an editorially assigned set of tags provided a powerful tool for directing people's investigations, and also ensured our users didn't start creating potentially libelous tags of their own.

Breaking it up in to assignments

For the first project, everyone worked together on the same task to review all of the documents. This worked fine while the document set was small, but once we had loaded in 400,000+ pages the progress bar become quite depressing.

This time round, we added a new concept of "assignments". Each assignment consisted of the set of pages belonging to a specified list of MPs, documents or political parties. Assignments had a threshold, so we could specify that a page must be reviewed by at least X people before it was considered reviewed. An editorial tool let us feature one "main" assignment and several alternative assignments right on the homepage.

Clicking "start reviewing" on an assignment sets a cookie for that assignment, and adds the assignment's progress bar to the top of the review interface. New pages are selected at random from the set of unreviewed pages in that assignment.

The assignments system proved extremely effective. We could use it to direct people to the highest value documents (our top hit list of interesting MPs, or members of the shadow cabinet) while still allowing people with specific interests to pick an alternative task.

Get the button right!

Having run two crowdsourcing projects I can tell you this: the single most important piece of code you will write is the code that gives someone something new to review. Both of our projects had big "start reviewing" buttons. Both were broken in different ways.

The first time round, the mistakes were around scalability. I used a SQL "ORDER BY RAND()" statement to return the next page to review. I knew this was an inefficient operation, but I assumed that it wouldn't matter since the button would only be clicked occasionally.

Something like 90% of our database load turned out to be caused by that one SQL statement, and it only got worse as we loaded more pages in to the system. This caused multiple site slow downs and crashes until we threw together a cron job that pushed 1,000 unreviewed page IDs in to memcached and made the button pick one of those at random.

This solved the performance problem, but meant that our user activity wasn't nearly as well targeted. For optimum efficiency you really want everyone to be looking at a different page - and a random distribution is almost certainly the easiest way to achieve that.

The second time round I turned to my new favourite in-memory data structure server, redis, and its SRANDMEMBER command (a feature I requested a while ago with this exact kind of project in mind). The system maintains a redis set of all IDs that needed to be reviewed for an assignment to be complete, and a separate set of IDs of all pages had been reviewed. It then uses redis set intersection (the SDIFFSTORE command) to create a set of unreviewed pages for the current assignment and then SRANDMEMBER to pick one of those pages.

This is where the bug crept in. Redis was just being used as an optimisation - the single point of truth for whether a page had been reviewed or not stayed as MySQL. I wrote a couple of Django management commands to repopulate the denormalised Redis sets should we need to manually modify the database. Unfortunately I missed some - the sets that tracked what pages were available in each document. The assignment generation code used an intersection of these sets to create the overall set of documents for that assignment. When we deleted some pages that had accidentally been imported twice I failed to update those sets.

This meant the "next page" button would occasionally turn up a page that didn't exist. I had some very poorly considered fallback logic for that - if the random page didn't exist, the system would return the first page in that assignment instead. Unfortunately, this meant that when the assignment was down to the last four non-existent pages every single user was directed to the same page - which subsequently attracted well over a thousand individual reviews.

Next time, I'm going to try and make the "next" button completely bullet proof! I'm also going to maintain a "denormalisation dictionary" documenting every denormalisation in the system in detail - such a thing would have saved me several hours of confused debugging.

Exposing the results

The biggest mistake I made last time was not getting the data back out again fast enough for our reporters to effectively use it. It took 24 hours from the launch of the application to the moment the first reporting feature was added - mainly because we spent much of the intervening time figuring out the scaling issues.

This time we handled this a lot better. We provided private pages exposing all recent activity on the site. We also provided public pages for each of the tags, as well as combination pages for party + tag, MP + tag, document + tag, assignment + tag and user + tag. Most of these pages were ordered by most-tagged, with the hope that the most interesting pages would quickly bubble to the top.

This worked pretty well, but we made one key mistake. The way we were ordering pages meant that it was almost impossible to paginate through them and be sure that you had seen everything under a specific tag. If you're trying to keep track of everything going on in the site, reliable pagination is essential. The only way to get reliable pagination on a fast moving site is to order by the date something was first added to a set in ascending order. That way you can work through all of the pages, wait a bit, hit "refresh" and be able to continue paginating where you left off. Any other order results in the content of each page changing as new content comes in.

We eventually added an undocumented /in-order/ URL prefix to address this issue. Next time I'll pay a lot more attention to getting the pagination options right from the start.

Rewarding our contributors

The reviewing experience the first time round was actually quite lonely. We deliberately avoided showing people how others had marked each page because we didn't want to bias the results. Unfortunately this meant the site felt like a bit of a ghost town, even when hundreds of other people were actively reviewing things at the same time.

For the new version, we tried to provide a much better feeling of activity around the site. We added "top reviewer" tables to every assignment, MP and political party as well as a "most active reviewers in the past 48 hours" table on the homepage (this feature was added to the first project several days too late). User profile pages got a lot more attention, with more of a feel that users were collecting their favourite pages in to tag buckets within their profile.

Most importantly, we added a concept of discoveries - editorially highlighted pages that were shown on the homepage and credited to the user that had first highlighted them. These discoveries also added valuable editorial interest to the site, showing up on the homepage and also the index pages for political parties and individual MPs.

Light-weight registration

For both projects, we implemented an extremely light-weight form of registration. Users can start reviewing pages without going through any signup mechanism, and instead are assigned a cookie and an anon-454 style username the first time they review a document. They are then encouraged to assign themselves a proper username and password so they can log in later and take credit for their discoveries.

It's difficult to tell how effective this approach really is. I have a strong hunch that it dramatically increases the number of people who review at least one document, but without a formal A/B test it's hard to tell how true that is. The UI for this process in the first project was quite confusing - we gave it a solid makeover the second time round, which seems to have resulted in a higher number of conversions.

Overall lessons

News-based crowdsourcing projects of this nature are both challenging and an enormous amount of fun. For the best chances of success, be sure to ask the right question, ensure user contributions are rewarded, expose as much data as possible and make the "next thing to review" behaviour rock solid. I'm looking forward to the next opportunity to apply these lessons, although at this point I really hope it involves something other than MPs' expenses.

Tags: crowdsourcing, django, guardian, innodb, memcached, mpsexpenses, mysql, nosql, politics, projects, python, redis

Four crowdsourcing lessons from the Guardian's (spectacular) expenses-scandal experiment

2009-06-24T15:31:59+00:00

Four crowdsourcing lessons from the Guardian's (spectacular) expenses-scandal experiment

Michael Andersen from the Nieman Journalism Lab interviewed me about the MP expenses crowdsourcing site.

Tags: crowdsourcing, guardian, interviews, mpsexpences

The breakneck race to build an application to crowdsource MPs' expenses

2009-06-19T22:16:04+00:00

The breakneck race to build an application to crowdsource MPs' expenses

Charles Arthur wrote up a very nice piece on the development effort behind the Guardian’s crowdsourcing expenses app.

Tags: charles-aurthur, crowdsourcing, guardian, mpsexpenses

Investigate your MP's expenses

2009-06-18T23:16:43+00:00

Investigate your MP's expenses

Launched today, this is the project that has been keeping me ultra-busy for the past week—we’re crowdsourcing the analysis of the 700,000+ scanned MP expenses documents released this morning. It’s the Guardian’s first live Django-powered application, and also the first time we’ve hosted something on EC2.

Tags: crowdsourcing, django, ec2, guardian, mpexpenses, projects, python

The Straight Choice | The election leaflet project

2009-06-08T16:23:30+00:00

The Straight Choice | The election leaflet project

Nice crowdsourcing app by Richard Pope, Francis Irving and Julian Todd—UK political leaflets are hard to keep tabs on due to the way they are distributed over small geographical areas, so this site encourages you to take photos of leaflets delivered to your home and tag them with postcode, party and key topics.

Tags: crowdsourcing, francis-irving, julian-todd, politics, richard-pope, thestraightchoice, ukpolitics

Flickr Shapefiles Public Dataset 1.0

2009-05-22T18:12:10+00:00

Flickr Shapefiles Public Dataset 1.0

Another awesome Geo dataset from the Yahoo! stable—this time it’s Flickr releasing shapefiles (geometrical shapes) for hundreds of thousands of places around the world, under the CC0 license which makes them essentially public domain. The shapes themselves have been crowdsourced from geocoded photos uploaded to Flickr, where users can “correct” the textual location assigned to each photo. Combine this with the GeoPlanet WOE data and you get a huge, free dataset describing the human geography of the world.

Tags: creativecommons, crowdsourcing, flickr, geoplanet, geospatial, maps, shapefiles, yahoo

ScenicOrNot

2009-05-12T13:32:53+00:00

ScenicOrNot

MySociety are crowdsourcing opinions on how “scenic” different parts of the UK are, by rating representative photos from Geograph.

Tags: crowdsourcing, geograph, mysociety, scenicornot

Map Maker for Developers

2009-02-21T09:05:57+00:00

Map Maker for Developers

Tiles from Google’s Map Maker crowdsourcing effort are now available in the JS and static maps APIs on an opt-in basis. Maybe I’m misunderstanding something here, but Google Map Maker seems like a big step backwards for open geographic data. People donate their mapping efforts to Google, who keep them—unlike OpenStreetMap, where the donated efforts are made available under a Creative Commons license.

Tags: creativecommons, crowdsourcing, google, googlemapmaker, google-maps-api, openstreetmap, staticmaps