Writing Effective Notebooks

Slide 3 from the slide deck (see link in body)

This page provides advice about writing notebooks (Jupyter, Quarto, Rnotebook, etc.) that are effective and easy to read. I originally developed it for my data science course, and provide a revised version here for reference.

Notebooks are communication tools. It is not enough to create a notebook that contains code to compute correct results; you also need to ensure that your notebooks are well-structured documents that communicate your work and findings.

I provide a checklist at the end of this page.

Video

This video from CS 533 @ Boise State University discusses the material here. It is not a replacement for this document, and talks primarily about Python, but is hopefully a useful supplement if you prefer video content (slides also available):

Standalone Documents

As I discuss in the video, a notebook is first a document — a document that incorporates code and visuals to present data with written explanation.

This means that you need to provide text that walks your reader through the story of your data analysis: what you are doing, why, how, and what we learn. This does not mean you should to describe every little detail (as too much detail can actually make it harder to read), but you need to provide the context for what the analyses mean and what we can learn from them.

In the context of a course assignment, the solution notebook should be readable even without having read the original assignment description: they should stand alone.

The document also needs to include the code outputs, where applicable — the reader should be able to read it without re-running the code, although the code is there so that they can. Without the output, the notebook does not work as a communicative document.

Complete Runs

The final, exported, submitted version of a notebook should contain the outputs from a complete run: make sure that re-running the notebook from top to bottom (e.g. “Restart and Run All” in Jupyter) works, and that the outputs included for submission are the results of this. This makes sure you submit the notebook in a working state, and that the results still match the code as written. Otherwise, we can get situations where e.g. you change a data file, but don’t rerun everything, and therefore some outputs do not reflect the data file loaded in the data load code.

Formatting Notebook Text

In the common notebook tools, the text content (text cells in Jupyter, document content in Quarto or Rmd) of a notebook is formatted with Markdown. LaTeX math markup, delimited with $ characters, is also widely supported. For example, this code:

The equation $a^n + b^n = c^n$ does not hold for integers $n>2$.

will yield:

The equation $a^n + b^n = c^n$ does not hold for integers $n>2$ .

Notebooks also support block-mode math, delimited with $$ or \[.

Use Markdown formatting judiciously to highlight important points and make your notebooks easier to read. For example, using code formatting to mark up Python or R functions is often helpful:

The `train_test_split` function from SciKit Learn will helpfully
partition our data for us.

This renders as something like this, except using the notebook tool’s theme instead of my website’s:

The train_test_split function from SciKit Learn will helpfully partition our data for us.

Markdown also supports strong emphasis (bold) and emphasis (italics), which can be very useful for highlighting your argument. In addition to the common Markdown syntax, Jupyter and Quarto both support various extensions, such as ~~strikethrough~~.

Markdown also supports **strong emphasis** (bold) and *emphasis* (italics),
which can be very useful for highlighting your argument.  In addition to the
common Markdown syntax, Jupyter and Quarto both support extensions, such as
~~strikethrough~~.

One thing that is important to pay attention to is use of section headings, as discussed in the video. Section headings are indicated with # characters, as in:

# Document Title

## Level 2 heading

Section headings are a crucial tool for structuring your document and making it easier to read. It’s important to note, however, that these have actual meaning: ## does not mean “large bold font”, it means “level 2 heading”. Properly structuring headings makes your document easier to read (see above, that a notebook is a document), and also enables tooling that to support navigating the document. JupyterLab and extensions to the notebook server both provide notebook outlines using the section headings, as do RStudio and Visual Studio Code. Section headings should also be short.

Finally, it is often helpful to use lists, either numbered or bulleted. For further reference on Markdown features and syntax, see:

CommonMark syntax — the common Markdown syntax supported almost everywhere.
GitHub Flavored Markdown (GFM) — superset of CommonMark mostly supported by Jupyter.
Jupyter MarkDown docs — brief docs on using Markdown in Jupyter
Pandoc manual — Pandoc is document processing tool for a heavily extended version of Markdown (along with other formats). Quarto notebooks are written in Pandoc, and Jupyter uses Pandoc to create LaTeX (and LaTeX-derived PDFs).

Rmd and Rnotebook mostly use CommonMark and GFM syntax.

Process

I recommend leaving time before something is due to go back through the notebook and clean it up for final presentation.

Sometimes it works best to start with the notebook you have and delete unnecessary code, remove excess debugging outputs, and improve the writing.

Sometimes it works best to start a fresh notebook, start putting together the structure, and copy over the code you actually need for the final solution.

Either way, you should produce a final notebook that is:

readable
executable from top to bottom
clean of extraneous or unnecessary outputs, particularly without explanation

This last point is to avoid the “sea of charts” effect. If there are a lot of charts and tables that don’t advance your story, it is much harder to read. Not every output you created in the process of figuring out how to solve the problem will be useful to your reader. Additional debugging or deep-dive outputs can be moved to a separate file (that should also be executable!) and linked as an appendix to your main report.

Exporting Notebooks

Once you have your notebook ready and complete, you usually want to export it to a standalone file so that you can share it, submit it as an assignment solution, etc. without requiring the reader to open it in the notebook server (and for Quart, Rmd, or Rnotebook, an export is the only way to provide a file that includes the outputs, since they are not saved in the source notebook file).

HTML Export

Any of these tools support HTML export, and it is often the easiest to produce. From Jupyter, you can choose “File” → “Save and Export Notebook As…” → “HTML”, and it will create a single HTML file that contains the text, code, and all outputs.

When writing Quarto or Rmd/Rnotebook in Rstudio, you want to “Knit” the file to HTML. You want to make sure it is set to produce a self-contained HTML file; this is the default in recent installations of Rstudio.

If you are using Quarto from the command line, self-contained files are not the default, but you can configure Quarto to produce them (see the Quarto docs for details on this).

Course management systems typically don’t allow students to upload HTML files, so you will usually need compress your HTML file into a Zip file and upload that. This can work with non-standalone HTML files too, so long as the zip file contains the images etc. too.

PDF Export

PDF files are a little trickier to create well, but they have a few benefits:

They are always self-contained.
They can be uploaded directly as assignment solutions.
When submitted as an assignment solution, the instructor or grader can view the file directly in the grading interface as well as annotate it with notes, highlights, etc. to provide feedback.
They can be easily annotated with other PDF annotation software like Highlights (this is particularly useful for reviewing research results).

Jupyter, Quarto, and Rmd/Rnotebook can all export PDF files. Their default PDF exports require a working LaTeX installation; Quarto provides a command-line option to install a minimal one that’s enough to build its output.

If you don’t have LaTeX installed, it can work well to produce PDFs from HTML. Jupyter has built-in support for this; it just requires a couple of installs:

pip install playwright
playwright install chromium
Select “Webpdf” from Jupyter’s save and export menu

Quarto also theoretically supports HTML-based PDF workflows, but I haven’t figured out how to get those working yet.

You can also create a PDF from any HTML file a few different ways:

Open it in your browser and print it to a PDF file.
Use WeasyPrint:
```
conda install weasyprint
weasy file.html file.pdf
```
You can also install weasyprint with pip, but that requires you to also make sure you have the appropriate Cairo development libraries installed, a process that Conda automates (and is especially hard on Windows). On macOS, Homebrew is also a good way to install weasyprint.
Use wkhtmltopdf or another HTML-to-PDF tool.

Checklist

This checklist is to help you ensure your notebook is well-structured and well-written. I may expand or revise it as we progress through the semester.

Structure

Does the notebook re-run without error from top to bottom?
Does re-running the notebook produce correct charts and results?
Does the notebook begin with the document title as an L1 heading?
Are headings correctly nested (H2 within H1, H3 in H2, etc.)?
Are headings short titles? (No full sentences!)

Writing and Output

Does the introduction state the notebook’s purpose?
Does either the introduction or the data section describe where the data come from?
- If it’s publicly downloadable, is there a link?
- If there are multiple options, which one is used?
Does it use correct grammar and spelling?
Does it use formatting to provide appropriate emphasis and clarity?
- Are variable, function, etc. names marked as code?
- Are lists used when helpful to break down points?
Is mathematical notation used to precisely define quantities or intended computation when it will increase clarity?
Do all outputs help advance the notebook’s story? Have you removed ones only for debugging or trying things out?
Do charts & conclusion-supporting outputs have surrounding text explaining their purpose and any extra information needed for accurate interpretation?
Do all relevant outputs have clear and accurate interpretation? Do they have text that explains their purpose in the context of the original problem or research question?
Does it avoid unnecessary detail, such as discussions of things that are obvious to anyone with basic familiarity with the programming environment, or lengthy expositions that do not provide clarity to the analysis?

Graphics

Do all charts have properly labeled axes and legends (color codes, etc.)?
Do charts have titles if purpose is not clear from axes or immediately preceding text?
- If there are multiple variants of a chart w/ same axes, they must have titles to quickly distinguish.
Are charts legible?
Do charts present lessons learned without distortions?
If you did not create the chart, would you be able to interpret it correctly?
Do facets, colors, and axes draw the reader to the most important comparisons or patterns?
Does surrounding text, if any, accurately interpret the chart?

The Data Visualization Checklist is useful, if opinionated.

Content

Are observations and conclusions substantiated by data and/or sound argument?
Are goals and observations made clear, both for the document and for individual pieces of analysis?

Post-Export

Does the PDF or HTML export include all outputs?
Are plots correctly displayed in the resulting PDF or HTML file?