Remote Data Analysis with Jupyter and ngrok
In my previous post, I mentioned that we’re using Jupyter notebooks for a lot of our data analysis, even in R.
This post is a quick ‘howto’ for doing such analyses on remote compute severs (in the university data center, Amazon or Azure clouds, or whatever). I’m doing all of this on a Linux server; our current compute server runs Red Hat Enterprise Linux. I install I download the For Jupyter and IRKernel, there are many ways to get them! I most often use Anaconda or Miniconda to install it, along with my Python and R: You also need to edit your Jupyter configuration to allow remote connections; edit If this file doesn’t yet exist, you can ask Jupyter to generate it first: Finally, you will need to create an ngrok account and set up your ngrok installation to connect to it. First, I SSH in to the compute server, and go to my project directory: I then launch a I typically split my In one pane, I start the notebook server: In the other pane, I launch Once this is done, One of the big benefits of this setup is that it doesn’t require administering additional server software. Access control is handled entirely by the system’s SSH daemon, authentication, and file system permissions. Users with accounts on the compute server can spin up their own notebooks, with their own Python and/or R environments, and perform their analysis with minimal need for sysadmin support. Using Also, This setup is secure enough for most of our work. For an analysis with sensitive human subjects data, however, I may drop It will print out the URL you can connect to, and the rest proceeds just like logging in with Conda, the Anaconda/Miniconda package manager, has a useful concept of environments that allows you to have multiple installations of Python, R, and libraries and switch between them on a per-project basis. If each environment has I combine this with the wonderful direnv, which I’ve been using for a while to improve my shell environment setup. I have the following in I then put the following in the When I The Pieces
tmux
from my distribution repositories.ngrok
binary from the ngrok web site, and put it in ~/bin
(for myself) or /usr/local/bin
(so it’s available for my students too).$ conda install notebook
$ conda install -c r r r-irkernel
~/.jupyter/jupyter_notebook_config.py
to contain the following line:c.NotebookApp.allow_remote_access = True
jupyter notebook --generate-config
Setting Up a Session
localhost:~$ ssh big-data-monster.cs.txstate.edu
big-data-monster:~$ cd /data/my-project
tmux
session to contain my analysis & allow me to split my session into multiple terminals:big-data-monster:~$ tmux
tmux
into two panes with Ctrl+B “. Once this is done, Ctrl+B o will jump between panes.big-data-monster:~$ jupyter notebook
ngrok
; watch the notebook server’s output to see the port to use:big-data-monster:~$ ngrok http 8888
ngrok
will give you a URL to connect to; connect to the HTTPS version, and you have your notebook! Jupyter will prompt you for a token; enter the token (the part after ?token=
in your Jupyter console output). If you don’t have the token handy, start another terminal pane and run jupyter notebook list
.Benefits and Rationale
ngrok
instead of plain SSH tunnels allows me to build the tunnel after the fact, without having to anticipate what port will be available for my network server. When several users share a compute server, Jupyter will automatically pick an available port. I set up the ngrok
tunnel after the notebook server is running, so I can tunnel whatever port it found.ngrok
is easier to set up than multi-hop tunnels when connecting through a bastion host.ngrok
and do the extra work to set up a plain SSH tunnel for the notebook server.Alternatives to
ngrok
ngrok
is not the only source of this service; if you want something that does not require account, try serveo:ssh -R 443:localhost:8888 serveo.net
ngrok
.Advanced Setup: Per-Project Environments
notebook
installed, then this setup method starts a notebook server that’s using the software environment set up for that project.~/.direnvrc
:use_conda() {
source activate "$1"
}
.envrc
file for a particular project:use conda doascience
cd
into the experiment directory, the direnv
shell hooks will automatically activate the proper Conda environment, and running Jupyter will spin up a notebook server with that project’s requirements.