Remote Data Analysis with Jupyter and ngrok

In my previous post, I mentioned that we're using Jupyter notebooks for a lot of our data analysis, even in R.

This post is a quick ‘howto’ for doing such analyses on remote compute severs (in the university data center, Amazon or Azure clouds, or whatever).

The Pieces

  • Jupyter, with its Notebook system
  • IRKernel, if you are using R
  • ngrok to make the tunnel
  • tmux to split the terminal screen & keep the session alive

I'm doing all of this on a Linux server; our current compute server runs Red Hat Enterprise Linux.

I install tmux from my distribution repositories.

I download the ngrok binary from the ngrok web site, and put it in ~/bin (for myself) or /usr/local/bin (so it's available for my students too).

For Jupyter and IRKernel, there are many ways to get them! I most often use Anaconda or Miniconda to install it, along with my Python and R:

$ conda install notebook
$ conda install -c r r r-irkernel

You also need to edit your Jupyter configuration to allow remote connections; edit ~/.jupyter/jupyter_notebook_config.py to contain the following line:

c.NotebookApp.allow_remote_access = True

Finally, you will need to create an ngrok account and set up your ngrok installation to connect to it.

Setting Up a Session

First, I SSH in to the compute server, and go to my project directory:

localhost:~$ ssh big-data-monster.cs.txstate.edu
big-data-monster:~$ cd /data/my-project

I then launch a tmux session to contain my analysis & allow me to split my session into multiple terminals:

big-data-monster:~$ tmux

I typically split my tmux into two panes with Ctrl+B ". Once this is done, Ctrl+B o will jump between panes.

In one pane, I start the notebook server:

big-data-monster:~$ jupyter notebook

In the other pane, I launch ngrok; watch the notebook server's output to see the port to use:

big-data-monster:~$ ngrok http 8888

Once this is done, ngrok will give you a URL to connect to; connect to the HTTPS version, and you have your notebook! Jupyter will prompt you for a token; enter the token (the part after ?token= in your Jupyter console output). If you don't have the token handy, start another terminal pane and run jupyter notebook list.

Benefits and Rationale

One of the big benefits of this setup is that it doesn't require administering additional server software. Access control is handled entirely by the system's SSH daemon, authentication, and file system permissions. Users with accounts on the compute server can spin up their own notebooks, with their own Python and/or R environments, and perform their analysis with minimal need for sysadmin support.

Using ngrok instead of plain SSH tunnels allows me to build the tunnel after the fact, without having to anticipate what port will be available for my network server. When several users share a compute server, Jupyter will automatically pick an available port. I set up the ngrok tunnel after the notebook server is running, so I can tunnel whatever port it found.

Also, ngrok is easier to set up than multi-hop tunnels when connecting through a bastion host.

This setup is secure enough for most of our work. For an analysis with sensitive human subjects data, however, I may drop ngrok and do the extra work to set up a plain SSH tunnel for the notebook server.

Alternatives to ngrok

ngrok is not the only source of this service; if you want something that does not require account, try serveo:

ssh -R 443:localhost:8888 serveo.net

It will print out the URL you can connect to, and the rest proceeds just like logging in with ngrok.

Advanced Setup: Per-Project Environments

Conda, the Anaconda/Miniconda package manager, has a useful concept of environments that allows you to have multiple installations of Python, R, and libraries and switch between them on a per-project basis. If each environment has notebook installed, then this setup method starts a notebook server that's using the software environment set up for that project.

I combine this with the wonderful direnv, which I've been using for a while to improve my shell environment setup.

I have the following in ~/.direnvrc:

use_conda() {
    source activate "$1"
}

I then put the following in the .envrc file for a particular project:

use conda doascience

When I cd into the experiment directory, the direnv shell hooks will automatically activate the proper Conda environment, and running Jupyter will spin up a notebook server with that project's requirements.