Simon Willison's Weblog: restructuredtext

The subset of reStructuredText worth committing to memory

2018-08-25T18:44:29+00:00

reStructuredText is the standard for documentation in the Python world.

It’s a bit weird. It’s like Markdown but older, more feature-filled and in my experience significantly harder to remember.

There are plenty of guides and cheatsheets out there, but when writing simple documentation for software projects I think there’s a subset that is worth committing to memory. I’ll describe that subset here.

First though: when writing reStructuredText having a live preview render is extremely useful. I use rst.ninjs.org for this. If you don’t trust that hosted version (it round-trips your documentation through the server in order to render it) you can run a local copy instead using the underlying source code.

Paragraphs

Paragraphs work the same way as Markdown and plain text. They are nice and easy.

This is the first paragraph. No need to wrap the text (though you can wrap at e.g. 80 characters without affecting rendering).

This is the second paragraph.

Headings

reStructuredText section headings are a little surprising.

Markdown has multiple levels of heading, each with a different number of prefix hashes:

# Markdown heading level 1
## Markdown heading level 2
..
###### Markdown heading fevel 6

In reStructuredText there is no single format for these different levels. Instead, the format you use first will be treated as an H1, the next format as an H2 and so on. Here’s the description from the official documentation:

Sections are identified through their titles, which are marked up with adornment: “underlines” below the title text, or underlines and matching “overlines” above the title. An underline/overline is a single repeated punctuation character that begins in column 1 and forms a line extending at least as far as the right edge of the title text. Specifically, an underline/overline character may be any non-alphanumeric printable 7-bit ASCII character. […] There may be any number of levels of section titles, although some output formats may have limits (HTML has 6 levels).

This is deeply confusing. I suggest instead standardizing on the following:

=====================
 This is a heading 1
=====================

This heading has = signs both above and below, and they extend past the text by a single character in each direction.

This is a heading 2
===================

This is a heading 3
-------------------

This is a heading 4
~~~~~~~~~~~~~~~~~~~

If you need more levels, you can invent them using whatever character you like - but try to stay consistent within your project.

Bulleted lists

As with headings, you can use a variety of characters for these. I suggest sticking with asterisks.

A blank line is required before starting a bulleted list.

* A bullet point
* Another bullet point

If you decide to wrap your text (I tend not to) you must maintain the indentation on the wrapped lines:

* A bulleted list item. Since the text is wrapped each subsequent
  line of text must be indented by two spaces.
* Second list item.

Nested lists are supported, but you MUST leave a blank line above the first inner list bullet point or they won't work:

* This is the first bullet list item. Here comes a sub-list:

  * Hello sublist
  * Sublist two

* Back to the parent list.

Inline markup

I only use three inline markup features: bold, italic and code.

**Bold text** is surrounded by two asterisks.

*Italic text* is one asterisk.

``inline code`` uses two backticks at either side of the code.

Links

Links are my least favorite feature of reStructuredText. There are several different ways of including them, but the one I use most often (and hence have committed to memory) is this one:

`a link, note the trailing underscores <http://example.com>`__

So that’s a backtick at the start, then the link text, then the URL contained in greater than / less than symbols, then another backtick and then TWO underscores to finish it off.

Why two underscores? Because if you only use one, the text part of the link is remembered and can be used to duplicate your link later on - see example below. In my experience this is more trouble than it’s worth.

A more complex link syntax example (documented here) looks like this:

See the `Python home page`_ for info.

This link_ is an alias to the link above.

.. _Python home page: http://www.python.org
.. _link: `Python home page`_

I can’t remember this at all, so I stick with the anonymous hyperlink syntax instead.

Code blocks

The easiest way to embed a block of code is like this:

::

    # This is a code example
    print("It needs to be indented")

The :: indicates that a code block is coming up. The blank line after the :: before the indentation starts is required.

Most renderers have the ability to apply syntax highlighting. To specify that a block should have syntax highlighting for a specific language, replace the :: in the above example with one of the following:

.. code-block:: sql

.. code-block:: javascript

.. code-block:: python

Images

There are plenty of options for embedding images, but the most basic syntax (worth remembering) looks like this:

.. image:: full_text_search.png
   :alt: alternate text

This will embed an image of that filename that sits in the same directory as the document itself.

Internal references

In my opinion this is the key feature that makes reStructuredText more powerful than Markdown for larger documentation projects.

Again, there is a vast and complex array of options around this, but the key thing to remember is how to add a reference name to a specific section and how to link to that section later on.

Names are applied to section headings, by adding some magic text before the heading itself. For example:

.. _full_text_search:

Full-text search
================

Note the format: two periods, then a space, then an underscore, then the label, then a colon at the end.

The label full_text_search is now associated with that heading. I can link to it from any page in my documentation project like so:

:ref:`full_text_search`

Note that the leading underscore isn’t included in this reference.

The link text displayed will be the text of the heading, in this case “Full-text search”. If I want to replace that link text with something custom, I can do so like this:

Learn about the :ref:`search feature <full_text_search>`.

This syntax is similar to the inline hyperlink syntax described above.

Learning more

I extracted the patterns I describe in this post from the Datasette documentation - I encourage you to dig around in the source code to see how it all works.

The definitive guide to reStructuredText is the reStructuredText Markup Specification. My favourite of the various quick references is the Restructured Text (reST) and Sphinx CheatSheet by Thomas Cokelaer.

I'm a huge fan of Read the Docs for hosting documentation - it's the key reason I use reStructuredText in my projects. Unsurprisingly, they offer extensive documentation to help you make the most of their platform.

Tags: documentation, python, restructuredtext, sphinx-docs, read-the-docs

Documentation unit tests

2018-07-28T15:59:55+00:00

Or: Test-driven documentation.

Keeping documentation synchronized with an evolving codebase is difficult. Without extreme discipline, it’s easy for documentation to get out-of-date as new features are added.

One thing that can help is keeping the documentation for a project in the same repository as the code itself. This allows you to construct the ideal commit: one that includes the code change, the updated unit tests AND the accompanying documentation all in the same unit of work.

When combined with a code review system (like Phabricator or GitHub pull requests) this pattern lets you enforce documentation updates as part of the review process: if a change doesn’t update the relevant documentation, point that out in your review!

Good code review systems also execute unit tests automatically and attach the results to the review. This provides an opportunity to have the tests enforce other aspects of the codebase: for example, running a linter so that no-one has to waste their time arguing over standardize coding style.

I’ve been experimenting with using unit tests to ensure that aspects of a project are covered by the documentation. I think it’s a very promising technique.

Introspect the code, introspect the docs

The key to this trick is introspection: interogating the code to figure out what needs to be documented, then parsing the documentation to see if each item has been covered.

I’ll use my Datasette project as an example. Datasette’s test_docs.py module contains three relevant tests:

test_config_options_are_documented checks that every one of Datasette’s configuration options are documented.
test_plugin_hooks_are_documented ensures all of the plugin hooks (powered by pluggy) are covered in the plugin documentation.
test_view_classes_are_documented iterates through all of the *View classes (corresponding to pages in the Datasette user interface) and makes sure they are covered.

In each case, the test uses introspection against the relevant code areas to figure out what needs to be documented, then runs a regular expression against the documentation to make sure it is mentioned in the correct place.

Obviously the tests can’t confirm the quality of the documentation, so they are easy to cheat: but they do at least protect against adding a new option but forgetting to document it.

Testing that Datasette’s view classes are covered

Datasette’s view classes use a naming convention: they all end in View. The current list of view classes is DatabaseView, TableView, RowView, IndexView and JsonDataView.

Since these classes are all imported into the datasette.app module (in order to be hooked up to URL routes) the easiest way to introspect them is to import that module, then run dir(app) and grab any class names that end in View. We can do that with a Python list comprehension:

from datasette import app
views = [v for v in dir(app) if v.endswith("View")]

I’m using reStructuredText labels to mark the place in the documentation that addresses each of these classes. This also ensures that each documentation section can be linked to, for example:

http://datasette.readthedocs.io/en/latest/pages.html#tableview

The reStructuredText syntax for that label looks like this:

.. _TableView:

Table
=====

The table page is the heart of Datasette...

We can extract these labels using a regular expression:

from pathlib import Path
import re

docs_path = Path(__file__).parent.parent / 'docs'
label_re = re.compile(r'\.\. _([^\s:]+):')

def get_labels(filename):
    contents = (docs_path / filename).open().read()
    return set(label_re.findall(contents))

Since Datasette’s documentation is spread across multiple *.rst files, and I want the freedom to document a view class in any one of them, I iterate through every file to find the labels and pull out the ones ending in View:

def documented_views():
    view_labels = set()
    for filename in docs_path.glob("*.rst"):
        for label in get_labels(filename):
            first_word = label.split("_")[0]
            if first_word.endswith("View"):
                view_labels.add(first_word)
    return view_labels

We now have a list of class names and a list of labels across all of our documentation. Writing a basic unit test comparing the two lists is trivial:

def test_view_documentation():
    view_labels = documented_views()
    view_classes = set(v for v in dir(app) if v.endswith("View"))
    assert view_labels == view_classes

Taking advantage of pytest

Datasette uses pytest for its unit tests, and documentation unit tests are a great opportunity to take advantage of some advanced pytest features.

Parametrization

The first of these is parametrization: pytest provides a decorator which can be used to execute a single test function multiple times, each time with different arguments.

This example from the pytest documentation shows how parametrization works:

import pytest
@pytest.mark.parametrize("test_input,expected", [
    ("3+5", 8),
    ("2+4", 6),
    ("6*9", 42),
])
def test_eval(test_input, expected):
    assert eval(test_input) == expected

pytest treats this as three separate unit tests, even though they share a single function definition.

We can combine this pattern with our introspection to execute an independent unit test for each of our view classes. Here’s what that looks like:

@pytest.mark.parametrize("view", [v for v in dir(app) if v.endswith("View")])
def test_view_classes_are_documented(view):
    assert view in documented_views()

Here’s the output from pytest if we execute just this unit test (and one of our classes is undocumented):

$ pytest -k test_view_classes_are_documented -v
=== test session starts ===
collected 249 items / 244 deselected

tests/test_docs.py::test_view_classes_are_documented[DatabaseView] PASSED [ 20%]
tests/test_docs.py::test_view_classes_are_documented[IndexView] PASSED [ 40%]
tests/test_docs.py::test_view_classes_are_documented[JsonDataView] PASSED [ 60%]
tests/test_docs.py::test_view_classes_are_documented[RowView] PASSED [ 80%]
tests/test_docs.py::test_view_classes_are_documented[TableView] FAILED [100%]

=== FAILURES ===

view = 'TableView'

    @pytest.mark.parametrize("view", [v for v in dir(app) if v.endswith("View")])
    def test_view_classes_are_documented(view):
>       assert view in documented_views()
E       AssertionError: assert 'TableView' in {'DatabaseView', 'IndexView', 'JsonDataView', 'RowView', 'Table2View'}
E        +  where {'DatabaseView', 'IndexView', 'JsonDataView', 'RowView', 'Table2View'} = documented_views()

tests/test_docs.py:77: AssertionError
=== 1 failed, 4 passed, 244 deselected in 1.13 seconds ===

Fixtures

There’s a subtle inefficiency in the above test: for every view class, it calls the documented_views() function - and that function then iterates through every *.rst file in the docs/ directory and uses a regular expression to extract the labels. With 5 view classes and 17 documentation files that’s 85 executions of get_labels(), and that number will only increase as Datasette’s code and documentation grow larger.

We can use pytest’s neat fixtures to reduce this to a single call to documented_views() that is shared across all of the tests. Here’s what that looks like:

@pytest.fixture(scope="session")
def documented_views():
    view_labels = set()
    for filename in docs_path.glob("*.rst"):
        for label in get_labels(filename):
            first_word = label.split("_")[0]
            if first_word.endswith("View"):
                view_labels.add(first_word)
    return view_labels

@pytest.mark.parametrize("view_class", [
    v for v in dir(app) if v.endswith("View")
])
def test_view_classes_are_documented(documented_views, view_class):
    assert view_class in documented_views

Fixtures in pytest are an example of dependency injection: pytest introspects every test_* function and checks if it has a function argument with a name matching something that has been annotated with the @pytest.fixture decorator. If it finds any matching arguments, it executes the matching fixture function and passes its return value in to the test function.

By default, pytest will execute the fixture function once for every test execution. In the above code we use the scope="session" argument to tell pytest that this particular fixture should be executed only once for every pytest command-line execution of the tests, and that single return value should be passed to every matching test.

What if you haven’t documented everything yet?

Adding unit tests to your documentation in this way faces an obvious problem: when you first add the tests, you may have to write a whole lot of documentation before they can all pass.

Having tests that protect against future code being added without documentation is only useful once you’ve added them to the codebase - but blocking that on documenting your existing features could prevent that benefit from ever manifesting itself.

Once again, pytest to the rescue. The @pytest.mark.xfail decorator allows you to mark a test as “expected to fail” - if it fails, pytest will take note but will not fail the entire test suite.

This means you can add deliberately failing tests to your codebase without breaking the build for everyone - perfect for tests that look for documentation that hasn’t yet been written!

I used xfail when I first added view documentation tests to Datasette, then removed it once the documentation was all in place. Any future code in pull requests without documentation will cause a hard test failure.

Here’s what the test output looks like when some of those tests are marked as “expected to fail”:

$ pytest tests/test_docs.py
collected 31 items

tests/test_docs.py ..........................XXXxx.                [100%]

============ 26 passed, 2 xfailed, 3 xpassed in 1.06 seconds ============

Since this reports both the xfailed and the xpassed counts, it shows how much work is still left to be done before the xfail decorator can be safely removed.

Structuring code for testable documentation

A benefit of comprehensive unit testing is that it encourages you to design your code in a way that is easy to test. In my experience this leads to much higher code quality in general: it encourages separation of concerns and cleanly decoupled components.

My hope is that documentation unit tests will have a similar effect. I’m already starting to think about ways of restructuring my code such that I can cleanly introspect it for the areas that need to be documented. I’m looking forward to discovering code design patterns that help support this goal.

Tags: design-patterns, documentation, restructuredtext, testing, datasette, pytest

Restructured Text to Anything

2007-09-13T15:54:44+00:00

Restructured Text to Anything

Slick set of online tools for converting Restructured Text (one of the more mature wiki-style markup languages) to HTML or PDF. Includes a nice looking API. Powered by Django.

Tags: django, html, pdf, python, restructuredtext

A myriad of markup systems

2004-04-13T04:58:54+00:00

It's hard to avoid the legions of custom markup systems out there these days. Every Wiki has it's own syntactical quirks, while packages like Markdown, Textile, BBCode (in dozens of variants), reStructuredText offer easy ways of hooking markup conversion in to existing applications. When it comes to being totally over-implemented and infuratingly inconsistent, markup systems are rapidly catching up with template packages. Never one to miss out on an opportunity to reinvent the wheel, I've worked on several of each ;)

My most recent markup handling attempt has just been published as part of my SitePoint article on Bookmarklets (cliché). It's a structured markup language in a bookmarklet: activate the bookmarklet to convert the text in any textarea on a page to XHTML. The syntax is ridiculously simple, and serves my limited needs just fine:


= This is a header

Here is a paragraph.

* This is a list of items
* Another item in the list

Converts to:


<h4>This is a header</h4>

<p>Here is a paragraph.</p>

<ul>
 <li>This is a list of items</li>
 <li>Another item in the list</li>
</ul>

The algorithm is simple, and easily portable to any language you care to mention:

Normalise newlines to \n, for cross-platform consistency.
Split the text up on double newlines, to create a list of blocks.
For each block:
1. If it starts with an equals sign, wrap it in header tags.
2. If it starts with an asterisk, split it in to lines, make each a list item (stripping off the asterisk at the start of the line if required) and glue them all together inside a <ul>.
3. Otherwise, wrap it in a  tag provided it doesn't have one already.
Glue everything back together again with a couple of newlines, to make the underlying XHTML look pretty.

The bookmarklet comes in two flavours: Expand HTML Shorthand (the full version) and Expand HTML Shorthand IE, which loses header support in order to fit within IE's crippling 508 character limit. A more capable bookmarklet could be built using the import-script-stub method described in my article, but the implementation of such a thing is left as an exercise for the reader (I've always wanted to say that).

Incidentally, there's a very common bug in markup systems that allow inline styles that proves extremely difficult to fix: that of improperly nested tags. Say you have a system where *text* is bold and _text_ is italic; what happens when the user enters _italic*italic-bold_bold*? Most systems (and that includes Markdown, Textile and my home-rolled Python solution) use naive regular expressions for inline markup processing and will output vadly formed XHTML: italicitalic-boldbold. To truly solve this problem requires a context-sensitive parser, which involves an unpleasantly large amount of effort to solve what looks like a simple bug.

Tags: bookmarklets, restructuredtext, markdown