Simon Willison's Weblog: jsk

Weeknotes: Datasette 0.40, various projects, Dogsheep photos

2020-04-22T23:09:10+00:00

A new release of Datasette, two new projects and progress towards a Dogsheep photos solution.

Datasette 0.40

I released Datasette 0.40 last night. Full release notes are here, but the highlights of this key feature in this release is the ability to provide metadata in a metadata.yaml file as an alternative to metadata.json. This is particularly useful for embedded multi-line SQL queries: I've upgraded simonw/museums and simonw/til to take advantage of this, since they both use their metadata to define SQL queries that power their search pages and Atom feeds.

A JSK fellows directory and twitter-to-sqlite 0.21

My JSK Fellowship at Stanford ends in a few months. JSK has extremely talented and influential alumni, and one of the benefits of the fellowship is becoming part of that network afterwards.

The @JSKStanford Twitter account maintains lists of fellows on Twitter - journalists love Twitter! - so I decided to use my twitter-to-sqlite tool to build a Datasette-powered search engine of them.

That search engine is now running at jsk-fellows.datasettes.com. It's updated daily by a GitHub Action to capture any bio changes or new list entrants.

It's a neat example of taking advantage of SQLite views to build faceted search across a subset of data. A script constructs the jsk_fellows view at build time, then metadata.json configures that view to run full-text search and facet by the derived fellowship column.

I shipped twitter-to-sqlite 0.21 with a new twitter-to-sqlite lists username command as part of this project.

TILs and datasette-template-sql 1.0

I described my new TILs project on Monday. I've published 15 so far - the format is working really well for me.

Hacking on simonw/tils reminded me of a feature gap in my datasette-template-sql plugin: it didn't have a solution for safely escaping parameters in SQL queries, leading to nasty string concatenated SQL queries.

datasette-template-sql 1.0 fixes that issue, at the cost of backwards compatibility with previous releases. I'm using it for both til and museums now.

github-to-sqlite 2.0

I released github-to-sqlite 2.0 with a small backwards incompatible change to the database schema (hence the major version increment). It builds on 1.1 from a few days ago which added a new github-to-sqlite contributors command for fetching statistics on contributors to repositories.

More importantly, I improved the live demo running at github-to-sqlite.dogsheep.net.

The demo now updates once a day using GitHub Actions and pulls in releases, commits, issues, issue comments and contributors for all of my Dogsheep projects plus datasette and sqlite-utils.

This means I can browse and execute SQL queries across 929 issues, 1,505 commits and 132 releases across 14 repositories!

Want to see which of my projects have had the most releases? Facet releases by repo.

I've also installed the datasette-search-all plugin there, so you can search across all commits, releases, issues etc for "zeit now" for example.

Bringing all of my different project data together in one place like this is really powerful.

I think it's a great illustration of the Datasette/Dogsheep philosophy of pulling down a complete SQLite-powered copy of data from external services so you can query and join across your data without being limited to the functionality that those services provide through their own interfaces or APIs.

photos-to-sqlite alpha

Dogsheep is about bringing all of my interesting personal and social data into a single, private place.

The biggest thing missing at the moment is photos. I want to be able to query my photos with SQL, and eventually combine them with tweets, checkins etc in a unified timeline.

Last week I took a step towards this goal with heic-to-jpeg, a proxy to let me display my iPhone's HEIC photos online.

This week I started work on photos-to-sqlite - the set of tools which I'll use to turn my photos into something I can run queries again.

So far I've mainly been figuring out how to get them into an S3 bucket that I control. Once configured, running photos-to-sqlite upload photos.db ~/Pictures/Photos\ Library.photoslibrary/originals will start uploading every photo it can find in that directory to the S3 bucket.

The filename it uses is the sha256 hash of the photo file contents, which I'm hoping will let me de-dupe photos from multiple sources in the future. It also writes basic metadata on the photos to that photos.db SQLite database.

This is going to be a big project. I'm investigating osxphotos to liberate the metadata from Apple Photos, and various Python libraries for extracting EXIF data from the files themselves.

Once I've got that working, I can experiment with things like piping photos through Google Cloud Vision to label them based on their contents.

This is all a very, very early alpha at the moment, but I'm cautiously optimistic about progress so far.

Tags: github, projects, twitter, datasette, jsk, dogsheep, weeknotes, sqlite-utils

Weeknotes: Improv at Stanford, planning Datasette Cloud

2020-01-14T00:22:18+00:00

Last week was the first week of the quarter at Stanford - which is called "shopping week" here because students are expected to try different classes to see which ones they are going to stick with.

I've settled on three classes this quarter: Beginning Improvising, Designing Machine Learning and Entrepreneurship: Formation of New Ventures.

Beginning Improvising is the Stanford improv theater course. It's a big time commitment: three two-hours sessions a week for ten weeks is nearly 60 hours of improv!

It's already proving to be really interesting though: it turns out the course is a thinly disguised applied psychology course.

Improv is about creating a creative space for other people to shine. The applications to professional teamwork are obvious and fascinating to me. I'll probably write more about this as the course continues.

Designing Machine Learning is a class at the Stanford d.School taught by Michelle Carney and Emily Callaghan. It focuses on multidisciplinary applications of machine learning, mixing together students from many different disciplines around Stanford.

I took a fast.ai deep learning course last year which gave me a basic understanding of the code size of neural networks, but I'm much more interestind in figuring out applications so this seems like a much more interesting option than a more code-focused course.

The class started out building some initial models using Google's Teachable Machine tool, which is fascinating. It lets you train transfer learning models for image, audio and posture recognition entirely in your browser - no data is transferred to Google's servers at all. You can then export those models and use them with a variety of different libraries - I've got them to work with both JavaScript and Python already.

I'm taking Entrepreneurship: Formation of New Ventures because of the rave reviews I heard from other JSK fellows who took it last quarter. It's a classic case-study business school class: each session features a guest speaker who is a successful entrepreneur, and the class discusses their case for the first two thirds of the section while they listen in - then finds out how well the discussion matched to what actually happened.

Planning Datasette Cloud

Shopping week kept me pretty busy so I've not done much actual development over the past week, but I have started planning out and researching my next major project, which I'm currently calling Datasette Cloud.

Datasette Cloud will be an invite-only hosted SaaS version of Datasette. It's designed to help get news organizations on board with the software without having to talk them through figuring out their own hosting, so I can help them solve real problems and learn more about how the ecosystem should evolve to support them.

I'd love to be able to run this on serverless hosting platforms like Google Cloud Run or Heroku, but sadly those tools aren't an option to me due to a key problem: I'm trying to build a stateful service (SQLite databases need to live on a local disk) in 2020.

I posed this challenge on Twitter back in October:

What's the easiest way of running a stateful web application these days?

Stateful as in it supports a process which can accept web requests and is allowed to write to a durable disk

So not Heroku/Zeit Now/Cloud Run etc
- Simon Willison (@simonw) October 9, 2019

I've been exploring my options since then, and I think I've settled on a decidedly 2010-era way of doing this: I'm going to run my own instances! So I've been exploring hosting Datasette on both AWS Lightsail and Digital Ocean Droplets over the past few months.

My current plan is to have each Datasette Cloud account run as a Datasette instance in its own Docker container, primarily to ensure filesystem isolation: different accounts must not be able to see each other's database files.

I started another discussion about this on Twitter and had several recommendations for Traefik as a load balancer for assigning hostnames to different Docker containers, which is exactly what I need to do.

So this afternoon I made my way through Digital Ocean's outstanding tutorial How To Use Traefik as a Reverse Proxy for Docker Containers on Ubuntu 18.04 and I think I've convinced myself that this is a smart way forward.

So, mostly a research week but I've got a solid plan for my next steps.

This week's Niche Museums

Jelly Belly Factory in Fairfield, CA
Bevolo Gas Light Museum in New Orleans, LA
Museo de las Misiones de Baja California in Loreto
Fort Point in San Francisco, CA
Donner Memorial State Park Visitor Center in Nevada County, CA
Anja Community Reserve in Madagascar
Palace of Fine Arts in San Francisco, CA

I also finally got around to implementing a map.

Tags: stanford, docker, jsk, weeknotes, datasette-cloud, digitalocean

Building tools to bring data-driven reporting to more newsrooms

2019-12-20T11:17:41+00:00

Building tools to bring data-driven reporting to more newsrooms

I wrote about my fellowship project so far and my goals for the next few months for the JSK Medium publication. My next priority: an invite-only hosted version for newsrooms so that figuring out how to install and manage the software isn’t the biggest barrier to entry.

Tags: data-journalism, datasette, jsk

Better presentations through storytelling and STAR moments

2019-12-10T00:00:42+00:00

Last week I completed GSBGEN 315: Strategic Communication at the Stanford Graduate School of Business.

The course has a stellar, well deserved reputation. It's principally about public speaking, and I gained a huge amount from it despite having over fifteen years of experience speaking at conferences.

Some of the things that really stood out for me (partially in the form of catchy acronyms):

Every talk should start with an AIM: Audience, Intent, Message. Who are the audience for the talk? What do you intend to achieve by giving the presentation? With those two things in mind, you can construct the message - the actual content of the talk.
Try to include at least one STAR moment - Something They'll Always Remember. This can be a gimmick, a repeated theme, a well-selected video or audio clip. Something to help the talk stand out.
Presentations are most interesting if they are structured with contrasts. These can be emotional high and low points, or content that illustrates what is compared to what could be. Sparklines are a tool that can be used to think about this structure.
The human brain is incredibly attuned to stories. If you can find an excuse to tell a story, no matter how thin that excuse is, take it.

That last point about stories is where things get really interesting. We reviewed the classic hero's journey story structure... but with a twist.

When giving a talk, position your audience as the hero. They start in position of comfort and safety. Your job is to call them to adventure - guide them towards a dangerous and unknown realm, encourage them to take on new challenges, learn new things and finish the adventure in a new, advanced state of mind.

You're not the hero - you're more the mentor who they meet along the way.

One of the course texts was Nancy Duarte's Resonate, which explains this model of presenting in great detail. It's a really clever and surprising way of thinking about a presentation.

My JSK backstory

The backstory is a core tradition of the JSK fellowship I'm participating in this year at Stanford. Each week, one of the 19 fellows tells the story of their career and how they came to journalism.

Last Wednesday was my turn. The timing couldn't have been more fortunate, as I got to apply the lessons I'd learned from Strategic Communications in putting together my presentation.

I think it was one of the best pieces of public speaking I'd ever done. Backstories include details that aren't necessarily intended for a public audience so I won't be sharing much of it here, but mindfully constructing an emotional sparkline and seeking out STAR moments worked out really well for me.

Since GSBGEN 315 is only available to Stanford GSB students, I'll throw in a strong recommendation for reading Resonate as an alternative if this has sparked your interest.

Also this week

Preparing my backstory took up much of my time this week. I ended up losing my streaks against both email checking and Datasette contributions, but I'm hoping to pick those back up again now that the presentation is out of the way.

I posted the following museums to Niche Museums - one of which, the Centennial Light, we got to see on Saturday:

Lynton and Lynmouth Cliff Railway in Devon
Clouds Hill in Dorset
Pioneertown in California
Teddy Bear Kingdom in Huis Ten Bosch near Nagasaki
The Centennial Light in Livermore
Dejima in Nagasaki
Museum of Dartmoor Life in Devon

I'm getting concerned about how many not-quite-finished Datasette features I have outstanding now (I started exploring another one just the other day). I'm going to try to resist the temptation to pick up any more until I've shipped at least some of the 47 currently open feature tickets.

Tags: speaking, datasette, jsk, weeknotes

Weeknotes: first week of Stanford classes

2019-09-30T16:28:12+00:00

One of the benefits of the JSK fellowship is that I can take classes and lectures at Stanford, on a somewhat ad-hoc basis (I don’t take exams or earn credits).

With thousands of courses to chose from, figuring out how best to take advantage of this isn’t at all easy - especially since I want to spend a big portion of my time focusing on my fellowship project.

This week was the first week of classes, which Stanford calls “shopping week” - because students are encouraged to try out lots of different things and literally walk out half way through a lecture if they decide it’s not for them! Feels really rude to me, but apparently that’s how it works here.

For this term I’ve settled on four classes:

Strategic Communications, at the Stanford Graduate School of Business. This is an extremely highly regarded course on public speaking and effective written communication. As you might expect from a class on public speaking the lectures themselves have been case studies in how to communicate well. I’ve given dozens of conference talks and I’m already learning a huge amount from this that will help me perform better in the future.
Classical Guitar. I’m taking this with three other fellows. It turns out my cheap acoustic guitar (bought on an impulse a couple of years ago from Amazon Prime Now) isn’t the correct instrument for this class (Classical Guitars are nylon stringed and a different shape) but the instructor thinks it will be fine for the moment. Great opportunity to do something musical!
Biostatistics. I want to firm up my fundamental knowledge of statistics, and I figured learning it from the biology department would be much more interesting than the corresponding maths or computer science classes.
Media Innovation. This is a lunchtime series of guest lectures from different professionals in different parts of the media industry. As such it doesn’t have much homework (wow, Stanford courses have a lot of homework) which makes it a good fit for my schedule, and the variety of speakers look to be really informative.

Combined with the JSK afternoon sessions on Monday, Wednesday and Friday I’ll be on campus every weekday, which will hopefully help me build a schedule that incorporates plenty of useful conversations with people about my project, plus actual time to get some code written.

… what with all the shopping for classes, I wrote almost no code at all this week!

I did some experimentation with structlog - I have an unfinished module which can write structlog entries to a SQLite database using sqlite-utils (here’s a Gist) and I’ve been messing around with Python threads in a Jupyter notebook as part of ongoing research into smarter connection pooling for Datasette but aside from that I’ve been concentrating on figuring out Stanford.

Books

Stanford classes come with all sorts of required reading, but I’ve also made some progress on Just Enough Research by Erika Hall (mentioned last week). I’m about half way through and it’s fantastic - really fun to read and packed with useful tips on getting the most out of user interviews and associated techniques. Hopefully I’ll get to start putting it into practice next week!

Tags: music, reading, speaking, stanford, jsk, weeknotes

Weeknotes: Design thinking for journalists, genome-to-sqlite, datasette-atom

2019-09-20T20:13:01+00:00

I haven’t had much time for code this week: we’ve had a full five day workshop at JSK with Tran Ha (a JSK alumni) learning how to apply Design Thinking to our fellowship projects and generally to challenges facing journalism.

I’ve used aspects of design thinking in building software products, but I’d never really thought about how it could be applied outside of digital product design. It’s been really interesting - especially seeing the other fellows (who, unlike me, are generally not planning to build software during their fellowship) start to apply it to a much wider and more interesting range of problems.

I’ve been commuting in to Stanford on the Caltrain, which did give me a bit of time to work on some code.

genome-to-sqlite

I’m continuing to build out a family of tools for personal analytics, where my principle goal is to reclaim the data that various internet companies have collected about me and pull it into a local SQLite database so I can analyze, visualize and generally and have fun with it.

A few years ago I shared my DNA with 23andMe. I don’t think I’d make the decision to do that today: it’s incredibly personal data, and the horror stories about people making unpleasant discoveries about their family trees keep on building. But since I’ve done it, I decided to see if I could extract out some data…

… and it turns out they let you download your entire genome! You can export it as a zipped up TSV file - mine decompresses to 15MB of data (which feels a little small - I know little about genetics, but I’m presuming that’s because the genome they record and share is just the interesting known genetic markers, not the entire DNA sequence - UPDATE: confirmed, thanks @laurencerowe).

So I wrote a quick utility, genome-to-sqlite, which loads the TSV file (directly from the zip or a file you’ve already extracted) and writes it to a simple SQLite table. Load it into Datasette and you can even facet by chromosome, which is exciting!

This is where my knowledge runs out. I’m confident someone with more insight than me could construct some interesting SQL queries against this - maybe one that determines if you are likely to have red hair? - so I’m hoping someone will step in and provide a few examples.

I filed a help wanted issue on GitHub. I also put a request out on Twitter for an UPDATE statement that could turn me into a dinosaur.

datasette-atom

This is very much a work-in-progress right now: datasette-atom will be a Datasette plugin that adds .atom as an output format (using the register_output_renderer plugin hook contributed by Russ Garrett a few months ago.

The aim is to allow people to subscribe to the output of a query in their feed reader (and potentially through that via email and other mechanisms) - particularly important for databases which are being updated over time.

It’s a slightly tricky plugin to design because valid Atom feed entries require a globally unique ID, a title and an “updated” date - and not all SQL queries produce obvious candidates for these values. As such, I’m going to have the plugin prompt the user for those fields and then persist them in the feed URL that you subscribe to.

This also means you won’t be able to generate an Atom feed for a query that doesn’t return at least one datetime column. I think I’m OK with that.

github-to-sqlite

I released one new feature for github-to-sqlite this week: the github-to-sqlite repos github.db command, which populates a database table of all of the repositories available to the authenticated user. Or use github-to-sqlite repos github.db dogsheep to pull the repos owned by a specific user or organization.

The command configures a SQLite full-text search index against the repo titles and descriptions, so if you have a lot of GitHub repos (I somehow have nearly 300!) you can search through them and use Datasette to facet them against different properties.

github-to-sqlite currently has two other useful subcommands: starred fetches details of every repository a user has starred, and issues pulls details of the issues (but sadly not yet their comment threads) attached to a repository.

Books

I’m trying to spend more time reading books - so I’m going to start including book stuff in my weeknotes in the hope of keeping myself on track.

I acquired two new books this week:

Just Enough Research by Erika Hall (recommended by Tom Coates and Tran Ha), because I need to spent the next few months interviewing as many journalists (and other project stakeholders) as possible to ensure I am solving the right problems for them.
Producing Open Source Software by Karl Fogel, because my fellowship goal is to build a thriving open source ecosystem around tooling for data journalism and this book looks like it covers a lot of the topics I need to really do a good job of that.

Next step: actually read them! Hopefully I’ll have some notes to share in next week’s update.

Tags: genetics, projects, reading, sqlite, datasette, jsk, weeknotes, design-thinking

My JSK Fellowship: Building an open source ecosystem of tools for data journalism

2019-09-10T23:29:12+00:00

I started a new chapter of my career last week: I began a year long fellowship with the John S. Knight Journalism Fellowships program at Stanford.

I’m going to spend the year thinking about and working on tools for data journalism. More details below, but the short version is that I want to help make the kind of data reporting we’re seeing from well funded publications like the New York Times, the Washington Post and the LA Times more accessible to smaller publications that don’t have the budget for full-time software engineers.

I’ve worked with newspapers a few times in the past: I helped create what would later become Django at the Lawrence Journal-World fifteen years ago, and I spent two years working on data journalism projects at the Guardian in London before being sucked into the tech startup world. My Datasette project was inspired by the challenges I saw at the Guardian, and I’m hoping to evolve it (and its accompanying ecosystem) in as useful a way as possible.

This fellowship is a chance for me to get fully embedded back in that world. I could not be more excited about it!

I’m at the Online News Association conference in New Orleans this week: if you’d like to meet up for a chat please drop me a line on Twitter or via email (swillison is my Gmail).

Here’s the part of my fellowship application (written back in January) which describes what I’m hoping to do. The program is extremely flexible and there is plenty of opportunity for me to change my focus if something more useful emerges from my research, but this provides a good indication of where my current thinking lies.

What is your fellowship proposal?

Think of this as your title or headline for your proposal. (25 words or less)

How might we grow an open source ecosystem of tools to help data journalists collect, analyze and publish the data underlying their stories?

Now, tell us more about your proposal. Why is it important to the challenges facing journalism and journalists today? How might it create meaningful change or advance the work of journalists? (600 words or less)

Data journalism is a crucial discipline for discovering and explaining true stories about the modern world - but effective data-driven reporting still requires tools and skills that are still not widely available outside of large, well funded news organizations.

Making these techniques readily available to smaller, local publications can help them punch above their weight, producing more impactful stories that overcome the challenges posed by their constrained resources.

Tools that work for smaller publications can work for larger publications as well. Reducing the time and money needed to produce great data journalism raises all boats and enables journalists to re-invest their improved productivity in ever more ambitious reporting projects.

Academic journals are moving towards publishing both the code and data that underlies their papers, encouraging reproducibility and better sharing of the underlying techniques. I want to encourage the same culture for data journalism, in the hope that “showing your working” can help fight misinformation and improve reader’s trust in the stories that are derived from the data.

I would like to use a JSK fellowship to build an ecosystem of data journalism tools that make data-driven reporting as productive and reproducible as possible, while opening it up to a much wider group of journalists.

At the core of my proposal is my Datasette open source project. I’ve been running this as a side-project for a year with some success: newspapers that have used it include the Baltimore Sun, who used it for their public salary records project: https://salaries.news.baltimoresun.com/. By dedicating myself to the project full-time I anticipate being able to greatly accelerate the pace of development and my ability to spend time teaching news organizations how to take advantage of it.

More importantly, the JSK fellowship would give me high quality access to journalism students, professors and professionals. A large portion of my fellowship would be spent talking to a wide pool of potential users and learning exactly what people need from the project.

I do not intend to be the only developer behind Datasette: I plan to deliberately grow the pool of contributors, both to the Datasette core project but also in developing tools and plugins that enhance the project’s capabilities. The great thing about a plugin ecosystem is that it removes the need for a gatekeeper: anyone can build and release a plugin independent of Datasette core, which both lowers the barriers to entry and dramatically increases the rate at which new functionality becomes available to all Datasette users.

My goal for the fellowship is to encourage the growth of open source tools that can be used by data journalists to increase the impact of their work. My experience at the Guardian lead me to Datasette as a promising avenue for this, but in talking to practitioners and students I hope to find other opportunities for tools that can help. My experience as a startup founder, R&D software engineer and an open source contributor put me in an excellent position to help create these tools in partnership with the wider open source community.

Tags: data-journalism, journalism, open-source, careers, datasette, jsk, personal-news

JSK Journalism Fellowships names Class of 2019-2020 (and I'm in it!)

2019-05-01T16:43:54+00:00

JSK Journalism Fellowships names Class of 2019-2020 (and I'm in it!)

In personal news... I’ve been accepted for a ten month journalism fellowship at Stanford (starting September)! My work there will involve “Improving the impact of investigative stories by expanding the open-source ecosystem of tools that allows journalists to share the underlying data”.

Via @simonw

Tags: data-journalism, journalism, stanford, datasette, jsk, personal-news