Simon Willison's Weblog: jupyter

CoreWeave adds Marimo to their 2025 acquisition spree

2025-10-31T13:57:51+00:00

I don't usually cover startup acquisitions here, but this one feels relevant to several of my interests.

Marimo (previously) provide an open source (Apache 2 licensed) notebook tool for Python, with first-class support for an additional WebAssembly build plus an optional hosted service. It's effectively a reimagining of Jupyter notebooks as a reactive system, where cells automatically update based on changes to other cells - similar to how Observable JavaScript notebooks work.

The first public Marimo release was in January 2024 and the tool has "been in development since 2022" (source).

CoreWeave are a big player in the AI data center space. They started out as an Ethereum mining company in 2017, then pivoted to cloud computing infrastructure for AI companies after the 2018 cryptocurrency crash. They IPOd in March 2025 and today they operate more than 30 data centers worldwide and have announced a number of eye-wateringly sized deals with companies such as Cohere and OpenAI. I found their Wikipedia page very helpful.

They've also been on an acquisition spree this year, including:

Weights & Biases in March 2025 (deal closed in May), the AI training observability platform.
OpenPipe in September 2025 - a reinforcement learning platform, authors of the Agent Reinforcement Trainer Apache 2 licensed open source RL framework.
Monolith AI in October 2025, a UK-based AI model SaaS platform focused on AI for engineering and industrial manufacturing.
And now Marimo.

Marimo's own announcement emphasizes continued investment in that tool:

Marimo is joining CoreWeave. We’re continuing to build the open-source marimo notebook, while also leveling up molab with serious compute. Our long-term mission remains the same: to build the world’s best open-source programming environment for working with data.

marimo is, and always will be, free, open-source, and permissively licensed.

Give CoreWeave's buying spree only really started this year it's impossible to say how well these acquisitions are likely to play out - they haven't yet established a track record.

Via @marimo_io

Tags: entrepreneurship, open-source, python, startups, ai, jupyter, marimo

Announcing Deno 2

2024-10-10T04:11:02+00:00

Announcing Deno 2

The big focus of Deno 2 is compatibility with the existing Node.js and npm ecosystem:

Deno 2 takes all of the features developers love about Deno 1.x — zero-config, all-in-one toolchain for JavaScript and TypeScript development, web standard API support, secure by default — and makes it fully backwards compatible with Node and npm (in ESM).

The npm support is documented here. You can write a script like this:

import * as emoji from "npm:node-emoji";
console.log(emoji.emojify(`:sauropod: :heart:  npm`));

And when you run it Deno will automatically fetch and cache the required dependencies:

deno run main.js

Another new feature that caught my eye was this:

deno jupyter now supports outputting images, graphs, and HTML

Deno has apparently shipped with a Jupyter notebook kernel for a while, and it's had a major upgrade in this release.

Here's Ryan Dahl's demo of the new notebook support in his Deno 2 release video.

I tried this out myself, and it's really neat. First you need to install the kernel:

deno juptyer --install

I was curious to find out what this actually did, so I dug around in the code and then further in the Rust runtimed dependency. It turns out installing Jupyter kernels, at least on macOS, involves creating a directory in ~/Library/Jupyter/kernels/deno and writing a kernel.json file containing the following:

{
  "argv": [
    "/opt/homebrew/bin/deno",
    "jupyter",
    "--kernel",
    "--conn",
    "{connection_file}"
  ],
  "display_name": "Deno",
  "language": "typescript"
}

That file is picked up by any Jupyter servers running on your machine, and tells them to run deno jupyter --kernel ... to start a kernel.

I started Jupyter like this:

jupyter-notebook /tmp

Then started a new notebook, selected the Deno kernel and it worked as advertised:

import * as Plot from "npm:@observablehq/plot";
import { document, penguins } from "jsr:@ry/jupyter-helper";
let p = await penguins();

Plot.plot({
  marks: [
    Plot.dot(p.toRecords(), {
      x: "culmen_depth_mm",
      y: "culmen_length_mm",
      fill: "species",
    }),
  ],
  document,
});

Tags: javascript, nodejs, npm, jupyter, typescript, deno, observable-plot

Anthropic's Prompt Engineering Interactive Tutorial

2024-08-30T02:52:04+00:00

Anthropic's Prompt Engineering Interactive Tutorial

Anthropic continue their trend of offering the best documentation of any of the leading LLM vendors. This tutorial is delivered as a set of Jupyter notebooks - I used it as an excuse to try uvx like this:

git clone https://github.com/anthropics/courses
uvx --from jupyter-core jupyter notebook courses

This installed a working Jupyter system, started the server and launched my browser within a few seconds.

The first few chapters are pretty basic, demonstrating simple prompts run through the Anthropic API. I used %pip install anthropic instead of !pip install anthropic to make sure the package was installed in the correct virtual environment, then filed an issue and a PR.

One new-to-me trick: in the first chapter the tutorial suggests running this:

API_KEY = "your_api_key_here"
%store API_KEY

This stashes your Anthropic API key in the IPython store. In subsequent notebooks you can restore the API_KEY variable like this:

%store -r API_KEY

I poked around and on macOS those variables are stored in files of the same name in ~/.ipython/profile_default/db/autorestore.

Chapter 4: Separating Data and Instructions included some interesting notes on Claude's support for content wrapped in XML-tag-style delimiters:

Note: While Claude can recognize and work with a wide range of separators and delimeters, we recommend that you use specifically XML tags as separators for Claude, as Claude was trained specifically to recognize XML tags as a prompt organizing mechanism. Outside of function calling, there are no special sauce XML tags that Claude has been trained on that you should use to maximally boost your performance. We have purposefully made Claude very malleable and customizable this way.

Plus this note on the importance of avoiding typos, with a nod back to the problem of sandbagging where models match their intelligence and tone to that of their prompts:

This is an important lesson about prompting: small details matter! It's always worth it to scrub your prompts for typos and grammatical errors. Claude is sensitive to patterns (in its early years, before finetuning, it was a raw text-prediction tool), and it's more likely to make mistakes when you make mistakes, smarter when you sound smart, sillier when you sound silly, and so on.

Chapter 5: Formatting Output and Speaking for Claude includes notes on one of Claude's most interesting features: prefill, where you can tell it how to start its response:

client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=100,
    messages=[
        {"role": "user", "content": "JSON facts about cats"},
        {"role": "assistant", "content": "{"}
    ]
)

Things start to get really interesting in Chapter 6: Precognition (Thinking Step by Step), which suggests using XML tags to help the model consider different arguments prior to generating a final answer:

Is this review sentiment positive or negative? First, write the best arguments for each side in <positive-argument> and <negative-argument> XML tags, then answer.

The tags make it easy to strip out the "thinking out loud" portions of the response.

It also warns about Claude's sensitivity to ordering. If you give Claude two options (e.g. for sentiment analysis):

In most situations (but not all, confusingly enough), Claude is more likely to choose the second of two options, possibly because in its training data from the web, second options were more likely to be correct.

This effect can be reduced using the thinking out loud / brainstorming prompting techniques.

A related tip is proposed in Chapter 8: Avoiding Hallucinations:

How do we fix this? Well, a great way to reduce hallucinations on long documents is to make Claude gather evidence first.

In this case, we tell Claude to first extract relevant quotes, then base its answer on those quotes. Telling Claude to do so here makes it correctly notice that the quote does not answer the question.

I really like the example prompt they provide here, for answering complex questions against a long document:

<question>What was Matterport's subscriber base on the precise date of May 31, 2020?</question>

Please read the below document. Then, in <scratchpad> tags, pull the most relevant quote from the document and consider whether it answers the user's question or whether it lacks sufficient detail. Then write a brief numerical answer in <answer> tags.

Via Hacker News

Tags: python, xml, ai, jupyter, prompt-engineering, generative-ai, llms, anthropic, claude, uv

Anthropic cookbook: multimodal

2024-07-10T18:38:10+00:00

Anthropic cookbook: multimodal

I'm currently on the lookout for high quality sources of information about vision LLMs, including prompting tricks for getting the most out of them.

This set of Jupyter notebooks from Anthropic (published four months ago to accompany the original Claude 3 models) is the best I've found so far. Best practices for using vision with Claude includes advice on multi-shot prompting with example, plus this interesting think step-by-step style prompt for improving Claude's ability to count the dogs in an image:

You have perfect vision and pay great attention to detail which makes you an expert at counting objects in images. How many dogs are in this picture? Before providing the answer in <answer> tags, think step by step in <thinking> tags and analyze every part of the image.

Tags: ai, jupyter, generative-ai, llms, anthropic, claude, vision-llms

marimo.app

2024-06-29T23:07:42+00:00

marimo.app

The Marimo reactive notebook (previously) - a Python notebook that's effectively a cross between Jupyter and Observable - now also has a version that runs entirely in your browser using WebAssembly and Pyodide. Here's the documentation.

Tags: python, jupyter, observable, webassembly, pyodide, marimo

fastlite

2024-05-27T21:14:01+00:00

fastlite

New Python library from Jeremy Howard that adds some neat utility functions and syntactic sugar to my sqlite-utils Python library, specifically for interactive use in Jupyter notebooks.

The autocomplete support through newly exposed dynamic properties is particularly neat, as is the diagram(db.tables) utility for rendering a graphviz diagram showing foreign key relationships between all of the tables.

Via @jeremyphoward

Tags: python, sqlite, jupyter, sqlite-utils, jeremy-howard

Interesting ideas in Observable Framework

2024-03-03T17:54:21+00:00

Mike Bostock, Announcing: Observable Framework:

Today we’re launching Observable 2.0 with a bold new vision: an open-source static site generator for building fast, beautiful data apps, dashboards, and reports.

Our mission is to help teams communicate more effectively with data. Effective presentation of data is critical for deep insight, nuanced understanding, and informed decisions. Observable notebooks are great for ephemeral, ad hoc data exploration. But notebooks aren't well-suited for polished dashboards and apps.

Enter Observable Framework.

There are a lot of really interesting ideas in Observable Framework.

A static site generator for data projects and dashboards

At its heart, Observable Framework is a static site generator. You give it a mixture of Markdown and JavaScript (and potentially other languages too) and it compiles them all together into fast loading interactive pages.

It ships with a full featured hot-reloading server, so you can edit those files in your editor, hit save and see the changes reflected instantly in your browser.

Once you're happy with your work you can run a build command to turn it into a set of static files ready to deploy to a server - or you can use the npm run deploy command to deploy it directly to Observable's own authenticated sharing platform.

JavaScript in Markdown

The key to the design of Observable Framework is the way it uses JavaScript in Markdown to create interactive documents.

Here's what that looks like:

# This is a document

Markdown content goes here.

This will output 1870:

```js
34 * 55
```

And here's the current date and time, updating constantly:

```js
new Date(now)
```

The same thing as an inline string: ${new Date(now)}

Any Markdown code block tagged js will be executed as JavaScript in the user's browser. This is an incredibly powerful abstraction - anything you can do in JavaScript (which these days is effectively anything at all) can now be seamlessly integrated into your document.

In the above example the now value is interesting - it's a special variable that provides the current time in milliseconds since the epoch, updating constantly. Because now updates constantly, the display value of the cell and that inline expression will update constantly as well.

If you've used Observable Notebooks before this will feel familiar - but notebooks involve code and markdown authored in separate cells. With Framework they are all now part of a single text document.

Aside: when I tried the above example I found that the ${new Date(now)} inline expression displayed as Mon Feb 19 2024 20:46:02 GMT-0800 (Pacific Standard Time) while the js block displayed as 2024-02-20T04:46:02.641Z. That's because inline expressions use the JavaScript default string representation of the object, while the js block uses the Observable display() function which has its own rules for how to display different types of objects, visible in inspect/src/inspect.js.

Everything is still reactive

The best feature of Observable Notebooks is their reactivity - the way cells automatically refresh when other cells they depend on change. This is a big difference to Python's popular Jupyter notebooks, and is the signature feature of marimo, a new Python notebook tool.

Observable Framework retains this feature in its new JavaScript Markdown documents.

This is particularly useful when working with form inputs. You can drop an input onto a page and refer its value throughout the rest of the document, adding realtime interactivity to documents incredibly easily.

Here's an example. I ported one of my favourite notebooks to Framework, which provides a tool for viewing download statistics for my various Python packages.

The Observable Framework version can be found at https://simonw.github.io/observable-framework-experiments/package-downloads - source code here on GitHub.

This entire thing is just 57 lines of Markdown. Here's the code with additional comments (and presented in a slightly different order - the order of code blocks doesn't matter in Observable thanks to reactivity).

# PyPI download stats for Datasette projects

Showing downloads for **${packageName}**

It starts with a Markdown <h1> heading and text that shows the name of the selected package.

```js echo
const packageName = view(Inputs.select(packages, {
  value: "sqlite-utils",
  label: "Package"
}));
```

This block displays the select widget allowing the user to pick one of the items from the packages array (defined later on).

Inputs.select() is a built-in method provided by Framework, described in the Observable Inputs documentation.

The view() function is new in Observable Framework - it's the thing that enables the reactivity, ensuring that updates to the input selection are acted on by other code blocks in the document.

Because packageName is defined with const it becomes a variable that is visible to other js blocks on the page. It's used by this next block:

```js echo
const data = d3.json(
  `https://datasette.io/content/stats.json?_size=max&package=${packageName}&_sort_desc=date&_shape=array`
);

Here we are fetching the data that we need for the chart. I'm using d3.json() (all of D3 is available in Framework) to fetch the data from a URL that includes the selected package name.

The data is coming from Datasette, using the Datasette JSON API. I have a SQLite table at datasette.io/content/stats that's updated once a day with the latest PyPI package statistics via a convoluted series of GitHub Actions workflows, described previously.

Adding .json to that URL returns the JSON, then I ask for rows for that particular package, sorted descending by date and returning the maximum number of rows (1,000) as a JSON array of objects.

Now that we have data as a variable we can manipulate it slightly for use with Observable Plot - parsing the SQLite string dates into JavaScript Date objects:

```js echo
const data_with_dates = data.map(function(d) {
  d.date = d3.timeParse("%Y-%m-%d")(d.date);
  return d;
})
```

This code is ready to render as a chart. I'm using Observable Plot - also packaged with Framework:

```js echo
Plot.plot({
  y: {
    grid: true,
    label: `${packageName} PyPI downloads per day`
  },
  width: width,
  marginLeft: 60,
  marks: [
    Plot.line(data_with_dates, {
      x: "date",
      y: "downloads",
      title: "downloads",
      tip: true
    })
  ]
})
```

So we have one cell that lets the user pick the package they want, a cell that fetches that data, a cell that processes it and a cell that renders it as a chart.

There's one more piece of the puzzle: where does that list of packages come from? I fetch that with another API call to Datasette. Here I'm using a SQL query executed against the /content database directly:

```js echo
const packages_sql = "select package from stats group by package order by max(downloads) desc"
```
```js echo
const packages = fetch(
  `https://datasette.io/content.json?sql=${encodeURIComponent(
    packages_sql
  )}&_size=max&_shape=arrayfirst`
).then((r) => r.json());
```

_shape=arrayfirst is a shortcut for getting back a JSON array of the first column of the resulting rows.

That's all there is to it! It's a pretty tiny amount of code for a full interactive dashboard.

Only include the code that you use

You may have noticed that my dashboard example uses several additional libraries - Inputs for the form element, d3 for the data fetching and Plot for the chart rendering.

Observable Framework is smart about these. It implements lazy loading in development mode, so code is only loaded the first time you attempt to use it in a cell.

When you build and deploy your application, Framework automatically loads just the referenced library code from the jsdelivr CDN.

Cache your data at build time

One of the most interesting features of Framework is its Data loader mechanism.

Dashboards built using Framework can load data at runtime from anywhere using fetch() requests (or wrappers around them). This is how Observable Notebooks work too, but it leaves the performance of your dashboard at the mercy of whatever backends you are talking to.

Dashboards benefit from fast loading times. Framework encourages a pattern where you build the data for the dashboard at deploy time, bundling it together into static files containing just the subset of the data needed for the dashboard. These can be served lightning fast from the same static hosting as the dashboard code itself.

The design of the data loaders is beautifully simple and powerful. A data loader is a script that can be written in any programming language. At build time, Framework executes that script and saves whatever is outputs to a file.

A data loader can be as simple as the following, saved as quakes.json.sh:

curl https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.geojson

When the application is built, that filename tells Framework the destination file (quakes.json) and the loader to execute (.sh).

This means you can load data from any source using any technology you like, provided it has the ability to output JSON or CSV or some other useful format to standard output.

Comparison to Observable Notebooks

Mike introduced Observable Framework as Observable 2.0. It's worth reviewing how the this system compares to the original Observable Notebook platform.

I've been a huge fan of Observable Notebooks for years - 38 blog posts and counting! The most obvious comparison is to Jupyter Notebooks, where they have some key differences:

Observable notebooks use JavaScript, not Python.
The notebook editor itself isn't open source - it's a hosted product provided on observablehq.com. You can export the notebooks as static files and run them anywhere you like, but the editor itself is a proprietary product.
Observable cells are reactive. This is the key difference with Jupyter: any time you change a cell all other cells that depend on that cell are automatically re-evaluated, similar to Excel.
The JavaScript syntax they use isn't quite standard JavaScript - they had to invent a new viewof keyword to support their reactivity model.
Editable notebooks are a pretty complex proprietary file format. They don't play well with tools like Git, to the point that Observable ended up implementing their own custom version control and collaboration systems.

Observable Framework reuses many of the ideas (and code) from Observable Notebooks, but with some crucial differences:

Notebooks (really documents) are now single text files - Markdown files with embedded JavaScript blocks. It's all still reactive, but the file format is much simpler and can be edited using any text editor, and checked into Git.
It's all open source. Everything is under an ISC license (OSI approved) and you can run the full editing stack on your own machine.
It's all just standard JavaScript now - no custom syntax.

A change in strategy

Reading the tea leaves a bit, this also looks to me like a strategic change of direction for Observable as a company. Their previous focus was on building great collaboration tools for data science and analytics teams, based around the proprietary Observable Notebook editor.

With Framework they appear to be leaning more into the developer tools space.

On Twitter @observablehq describes itself as "The end-to-end solution for developers who want to build and host dashboards that don’t suck" - the Internet Archive copy from October 3rd 2023 showed "Build data visualizations, dashboards, and data apps that impact your business — faster."

I'm excited to see where this goes. I've limited my usage of Observable Notebooks a little in the past purely due to the proprietary nature of their platform and the limitations placed on free accounts (mainly the lack of free private notebooks), while still having enormous respect for the technology and enthusiastically adopting their open source libraries such as Observable Plot.

Observable Framework addresses basically all of my reservations. It's a fantastic new expression of the ideas that made Observable Notebooks so compelling, and I expect to use it for all sorts of interesting projects in the future.

Tags: javascript, open-source, pypi, d3, jupyter, observable, mike-bostock, observable-framework, observable-plot

Marimo

2024-01-12T21:17:57+00:00

Marimo

This is a really interesting new twist on Python notebooks.

The most powerful feature is that these notebooks are reactive: if you change the value or code in a cell (or change the value in an input widget) every other cell that depends on that value will update automatically. It’s the same pattern implemented by Observable JavaScript notebooks, but now it works for Python.

There are a bunch of other nice touches too. The notebook file format is a regular Python file, and those files can be run as “applications” in addition to being edited in the notebook interface. The interface is very nicely built, especially for such a young project—they even have GitHub Copilot integration for their CodeMirror cell editors.

Via Hacker News

Tags: open-source, python, jupyter, observable, github-copilot, marimo

Bottleneck T5 Text Autoencoder

2023-10-10T02:12:38+00:00

Bottleneck T5 Text Autoencoder

Colab notebook by Linus Lee demonstrating his Contra Bottleneck T5 embedding model, which can take up to 512 tokens of text, convert that into a 1024 floating point number embedding vector... and then then reconstruct the original text (or a close imitation) from the embedding again.

This allows for some fascinating tricks, where you can do things like generate embeddings for two completely different sentences and then reconstruct a new sentence that combines the weights from both.

Via @thesephist

Tags: python, ai, jupyter, generative-ai, llms, embeddings

codespaces-jupyter

2023-04-14T22:38:21+00:00

codespaces-jupyter

This is really neat. Click “Use this template” -> “Open in a codespace” and you get a full in-browser VS Code interface where you can open existing notebook files (or create new ones) and start playing with them straight away.

Via @simon thread about online notebooks

Tags: github, python, jupyter, github-codespaces

Grokking Stable Diffusion

2022-09-04T18:50:29+00:00

Grokking Stable Diffusion

Jonathan Whitaker built this interactive Jupyter notebook that walks through how to use Stable Diffusion from Python step-by-step, and then dives deep into helping understand the different components of the implementation, including how text is encoded, how the diffusion loop works and more. This is by far the most useful tool I’ve seen yet for understanding how this model actually works. You can run Jonathan’s notebook directly on Google Colab, with a GPU.

Via @johnowhitaker

Tags: jupyter, stable-diffusion, generative-ai, text-to-image

storysniffer

2022-08-01T23:40:13+00:00

storysniffer

Ben Welsh built a small Python library that guesses if a URL points to an article on a news website, or if it’s more likely to be a category page or /about page or similar. I really like this as an example of what you can do with a tiny machine learning model: the model is bundled as a ~3MB pickle file as part of the package, and the repository includes the Jupyter notebook that was used to train it.

Via @palewire

Tags: machine-learning, python, jupyter, ben-welsh

Weeknotes: datasette-jupyterlite, s3-credentials and a Python packaging talk

2021-11-05T05:04:58+00:00

My big project this week was s3-credentials, described yesterday - but I also put together a fun expermiental Datasette plugin bundling JupyterLite and wrote up my PyGotham talk on Python packaging.

datasette-jupyterlite

JupyterLite is absolutely incredible: it's a full, working distribution of Jupyter that runs entirely in a browser, thanks to a Python interpreter (and various other parts of the scientific Python stack) that has been compiled to WebAssembly by the Pyodide project.

Since it's just static JavaScript (and WASM modules) it's possible to host it anywhere that can run a web server.

Datasette runs a web server...

So, I built datasette-jupyterlite - a Datasette plugin that bundles JupyterLite and serves it up as part of the Datasette instance.

You can try a live demo here:

Here's some Python code that will retrieve data from the associated Datasette instance and pull it into a Pandas DataFrame:

import pandas, pyodide
pandas.read_csv(pyodide.open_url(
  "https://latest-with-plugins.datasette.io/github/stars.csv")
)

(I haven't yet found a way to do this with a relative rather than absolute URL.)

The best part of this is that it works in Datasette Desktop! You can install the plugin using the "Install and manage plugins" menu item to get a version of Jupyter running in Python running in WebAssembly running in V8 running in Chromium running in Electron.

The plugin implementation is just 30 lines of code - it uses the jupyterlite Python package which bundles a .tgz file containing all of the required static assets, then serves files directly out of that tarfile.

Also this week

My other projects from this week are already written about on the blog:

s3-credentials: a tool for creating credentials for S3 buckets introduces a new CLI tool I built that automates the process of creating new, long-lived credentials that provide read-only, read-write or write-only access to just a single specified S3 bucket.
How to build, test and publish an open source Python library is a detailed write-up of the 10 minute workshop I presented at PyGotham this year showing how to create a Python library, bundle it up as a package using setup.py, publish it to PyPI and then set up GitHub Actions to test and publish future releases.

Releases this week

s3-credentials: 0.4 - (4 releases total) - 2021-11-04
A tool for creating credentials for accessing S3 buckets
datasette-notebook: 0.2a0 - (4 releases total) - 2021-11-02
A markdown wiki and dashboarding system for Datasette
datasette-jupyterlite: 0.1a0 - 2021-11-01
JupyterLite as a Datasette plugin

TIL this week

Tags: projects, jupyter, datasette, webassembly, weeknotes, pyodide

Proof of concept: sqlite_utils magic for Jupyter

2020-10-21T17:26:43+00:00

Proof of concept: sqlite_utils magic for Jupyter

Tony Hirst has been experimenting with building a Jupyter “magic” that adds special syntax for using sqlite-utils to insert data and run queries. Query results come back as a Pandas DataFrame, which Jupyter then displays as a table.

Via @psychemedia

Tags: pandas, sqlite, tony-hirst, jupyter, sqlite-utils

Estimating COVID-19's Rt in Real-Time

2020-04-20T15:06:53+00:00

Estimating COVID-19's Rt in Real-Time

I’m not qualified to comment on the mathematical approach, but this is a really nice example of a Jupyter Notebook explanatory essay by Kevin Systrom.

Tags: jupyter, covid19

Weeknotes: Datasette 0.39 and many other projects

2020-03-25T05:33:19+00:00

This week's theme: Well, I'm not going anywhere. So a ton of progress to report on various projects.

Datasette 0.39

This evening I shipped Datasette 0.39. The two big features are a mechanism for setting the default sort order for tables and a new base_url configuration setting.

You can see the new default sort order in action on my Covid-19 project - the daily reports now default to sort by day descending so the most recent figures show up first. Here's the metadata that makes it happen, and here's the new documentation.

I had to do some extra work on that project this morning when the underlying data changed its CSV column headings without warning.

The base_url feature has been an open issue since Janunary 2019. It lets you run Datasette behind a proxy on a different URL prefix - /tools/datasette/ for example. The trigger for finally getting this solved was a Twitter conversation about running Datasette on Binder in coordination with a Jupyter notebook.

Tony Hirst did some work on this last year, but was stumped by the lack of a base_url equivalent. Terry Jones shared an implementation in December. I finally found the inspiration to pull it all together, and ended up wih a working fork of Tony's project which does indeed launch Datasette on Binder - try launching your own here.

github-to-sqlite

I've not done much work on my Dogsheep family of tools in a while. That changed this week: in particular, I shipped a 1.0 of github-to-sqlite.

As you might expect, it's a tool for importing GitHub data into a SQLite database. Today it can handle repositories, releases, release assets, commits, issues and issue comments. You can see a live demo built from Dogsheep organization data at github-to-sqlite.dogsheep.net (deployed by this GitHub action).

I built this tool primarily to help me better keep track of all of my projects. Pulling the issues into a single database means I can run queries against all open issues across all of my repositories, and imporing commits and releases is handy for when I want to write my weeknotes and need to figure out what I've worked on lately.

datasette-render-markdown

GitHub issues use Markdown. To correctly display them it's useful to be able to render that Markdown. I built datasette-render-markdown back in November, but this week I made some substantial upgrades: you can now configure which columns should be rendered, and it includes support for Markdown extensions including GitHub-Flavored Markdown.

You can see it in action on the github-to-sqlite demo.

I also upgraded datasette-render-timestamps with the same explicit column configuration pattern.

datasette-publish-fly

Fly is a relatively new hosting provider which lets you host applications bundled as Docker containers in load-balanced data centers geographically close to your users.

It has a couple of characteristics that make it a really good fit for Datasette.

Firstly, the pricing model: Fly will currently host a tiny (128MB of RAM) container for $2.67/month - and they give you $10/month of free service credit, enough for 3 containers.

It turns out Datasette runs just fine in 128MB of RAM, so that's three always-on Datasette containers! (Unlike Heroku and Cloud Run, Fly keeps your containers running rather than scaling them to zero).

Secondly, it works by shipping it a Dockerfile. This means building datasette publish support for it is really easy.

I added the publish_subcommand plugin hook to Datasette all the way back in 0.25 in September 2018, but I've never actually built anything with it. That's now changed: datasette-publish-fly uses the hook to add a datasette publish fly command for publishing databases directly to your Fly account.

hacker-news-to-sqlite

It turns out I created my Hacker News account in 2007, and I've posted 2,167 comments and submitted 131 stories since then. Since my personal Dogsheep project is about pulling my data from multiple sources into a single place it made sense to build a tool for importing from Hacker News.

hacker-news-to-sqlite uses the official Hacker News API to import every comment and story posted by a specific user. It can also use one or more item IDs to suck the entire discussion tree around those items.

The README includes detailed documentation on how to best browse your data using Datasette once you have imported it.

Other projects

sqlite-utils gained some improvements to the way it suggests types for existing columns.
twitter-to-sqlite now offers --sql and --attach for more of its subcommands.
datasette-show-errors is a new plugin which exposes 500 errors as tracebacks, like Django does with DEBUG=True. It's built on top of Starlette's ServerErrorMiddleware.
I upgraded inaturalist-to-sqlite to work with sqlite-utils 2.x.

Tags: github, projects, sqlite, markdown, jupyter, datasette, dogsheep, weeknotes, fly

selenium-demoscraper

2019-11-04T15:05:38+00:00

selenium-demoscraper

Really useful minimal example of a Binder project. Click the button to launch a Jupyter notebook in Binder that can take screenshots of URLs using Selenium-controlled headless Firefox. The binder/ folder uses an apt.txt file to install Firefox, requirements.txt to get some Python dependencies and a postBuild Python script to download the Gecko Selenium driver.

Via Tony Hirst

Tags: firefox, selenium, tony-hirst, jupyter

Los Angeles Weedmaps analysis

2019-05-30T04:35:42+00:00

Los Angeles Weedmaps analysis

Ben Welsh at the LA Times published this Jupyter notebook showing the full working behind a story they published about LA’s black market weed dispensaries. I picked up several useful tricks from it—including how to load points into a geopandas GeoDataFrame (in epsg:4326 aka WGS 84) and how to then join that against the LA Times neighborhoods GeoJSON boundaries file.

Via Ben Welsh

Tags: data-journalism, geospatial, latimes, pandas, jupyter, ben-welsh

Fast Autocomplete Search for Your Website

2018-12-19T00:26:32+00:00

Fast Autocomplete Search for Your Website

I wrote a tutorial for the 24 ways advent calendar on building fast autocomplete search for a website on top of Datasette and SQLite. I built the demo against 24 ways itself—I used wget to recursively fetch all 330 articles as HTML, then wrote code in a Jupyter notebook to extract the raw data from them (with BeautifulSoup) and load them into SQLite using my sqlite-utils Python library. I deployed the resulting database using Datasette, then wrote some vanilla JavaScript to implement autocomplete using fast SQL queries against the Datasette JSON API.

Via @simonw

Tags: 24-ways, autocomplete, beautifulsoup, search, sqlite, jupyter, datasette

The _repr_html_ method in Jupyter notebooks

2018-12-12T18:09:23+00:00

The _repr_html_ method in Jupyter notebooks

Today I learned that if you add a _repr_html_ method returning a string of HTML to any Python class Jupyter notebooks will render that HTML inline to represent that object.

Via A Gist by Sam Liu

Tags: jupyter

repo2docker

2018-11-28T22:06:42+00:00

repo2docker

Neat tool from the Jupyter project team: run “jupyter-repo2docker https://github.com/norvig/pytudes” and it will pull a GitHub repository, create a new Docker container for it, install Jupyter and launch a Jupyter instance for you to start trying out the library. I’ve been doing this by hand using virtual environments, but using Docker for even cleaner isolation seems like a smart improvement.

Via Carol Willing

Tags: docker, jupyter

Helicopter accident analysis notebook

2018-11-19T18:25:07+00:00

Helicopter accident analysis notebook

Ben Welsh worked on an article for the LA Times about helicopter accident rates, and has published the underlying analysis as an extremely detailed Jupyter notebook. Lots of neat new (to me) notebook tricks in here as well.

Via Ben Welsh

Tags: data-journalism, jupyter, ben-welsh

Tracking Jupyter: Newsletter, the Third...

2018-11-09T17:42:28+00:00

Tracking Jupyter: Newsletter, the Third...

Tony Hirst’s tracking Jupyter newsletter is fantastic. The Jupyter ecosystem is incredibly exciting and fast moving at the moment as more and more groups discover how productive it is, and Tony’s newsletter is a wealth of information on what’s going on out there.

Via Chris Adams

Tags: tony-hirst, jupyter

Computational and Inferential Thinking: The Foundations of Data Science

2018-08-25T22:13:51+00:00

Computational and Inferential Thinking: The Foundations of Data Science

Free online textbook written for the UC Berkeley Foundations of Data Science class. The examples are all provided as Jupyter notebooks, using the mybinder web application to allow students to launch interactive notebooks for any of the examples without having to install any software on their own machines.

Tags: education, jupyter, data-science

The Future of Notebooks: Lessons from JupyterCon

2018-08-25T21:55:20+00:00

The Future of Notebooks: Lessons from JupyterCon

It sounds like reactive notebooks (where cells keep track of their dependencies on other cells and re-evaluate when those update) were a hot topic at JupyterCon this year.

Via Hacker News

Tags: jupyter

Quoting Jake VanderPlas

2018-08-25T03:16:26+00:00

In case you missed it: @GoogleColab can open any @ProjectJupyter notebook directly from @github!

To run the notebook, just replace "github.com" with "colab.research.google.com/github/" in the notebook URL, and it will be loaded into Colab.

— Jake VanderPlas

Tags: github, jupyter

I don't like Jupyter Notebooks - a presentation by Joel Grus

2018-08-25T03:04:45+00:00

I don't like Jupyter Notebooks - a presentation by Joel Grus

Fascinating talk by Joel Grus at the Jupyter conference in New York. He highlights some of the drawbacks of he Jupyter way of working, including the huge confusion that can come from the ability to execute cells out of order (something Observable notebooks solve brilliantly using spreadsheet-style reactive cell associations). He also makes strong arguments that notebooks encourage a way of working that discourages people from producing stable, repeatable and well tested code.

Via @dehora

Tags: jupyter, observable

Beyond Interactive: Notebook Innovation at Netflix

2018-08-18T17:55:42+00:00

Beyond Interactive: Notebook Innovation at Netflix

Netflix have been investing heavily in their internal Jupyter notebooks infrastructure: it’s now the most popular tool for working with data at Netflix. They also use parameterized notebooks to make it easy to create templates for reusable operations, and scheduled notebooks for recurring tasks. “When a Spark or Presto job executes from the scheduler, the source code is injected into a newly-created notebook and executed. That notebook then becomes an immutable historical record, containing all related artifacts — including source code, parameters, runtime config, execution logs, error messages, and so on.”

Tags: netflix, jupyter

Quoting Netflix Technology Blog

2018-08-18T17:35:45+00:00

Every day more than 1 trillion events are written into a streaming ingestion pipeline, which is processed and written to a 100PB cloud-native data warehouse. And every day, our users run more than 150,000 jobs against this data, spanning everything from reporting and analysis to machine learning and recommendation algorithms.

— Netflix Technology Blog

Tags: netflix, big-data, jupyter

Quoting Chris Rogers

2018-06-05T19:37:10+00:00

At Harvard we've built out an infrastructure to allow us to deploy JupyterHub to courses with authentication managed by Canvas. It has allowed us to easily deploy complex set-ups to students so they can do really cool stuff without having to spend hours walking them through setup. Instructors are writing their lectures as IPython notebooks, and distributing them to students, who then work through them in their JupyterHub environment. Our most ambitious so far has been setting up each student in the course with a p2.xlarge machine with cuda and TensorFlow so they could do deep learning work for their final projects. We supported 15 courses last year, and got deployment time for an implementation down to only 2-3 hours.

— Chris Rogers

Tags: education, python, jupyter