Blog Articles 61–65

Remote Data Analysis with Jupyter and ngrok

In my previous post, I mentioned that we’re using Jupyter notebooks for a lot of our data analysis, even in R.

This post is a quick ‘howto’ for doing such analyses on remote compute severs (in the university data center, Amazon or Azure clouds, or whatever).

I'll Keep Using R

During my two years at Texas State, I’ve been engaged in a bit of an experiment on statistics & data analysis tools. Some of my graduate students have been using R for data analysis, but some have been working with the PyData stack. I’ve also been learning PyData, doing some new analysis or data processing in it and trying to convert a few old analyses.

I learned R in grad school, and used it throughout my Ph.D work. My R style dramatically changed over time, as I learned ggplot2, then data.table, and finally plyr and dplyr. I’d like to think that I’ve become fairly proficient at R. I even like the language.

But Python is a far more usable general-purpose language. If we can do the things that we currently need from R, in a reasonable fashion, then using it decreases the number of different things that students need to master to contribute to research. There are a few things it just can’t do yet — I have yet to find a structural equation modeling package for Python — but the core capability is there for most of what we do.

Curried Chickpeas and Peanuts

Curried chickpeas (garbanzo beans) and peanuts is one of our favorite foods. The recipe is pretty flexible — I am writing the rough contours here, but you can always add or subtract ingredients to suit your preferences.

This dish can be made in either the crock pot or on the stove. We tend to like the stovetop version better. It requires a fair amount of forethought due to the bean prep, and about 15 minutes of prep for the crock pot and 30-60 minutes of prep & cooking for the stove.

I haven’t indicated quantities on most of these ingredients. Again, this is flexible! Put together a quantity of food that seems right, and adjust the spices to taste. I have yet to have it come out terribly.

  • Garbanzo beans, either dried or canned; we use a full 1lb bag in the crock pot, half a bag on the stove
  • Peanuts (we use lightly-salted roasted peanuts; raw peanuts work as well)
  • Jalapeño, chopped finely
  • Onion, chopped
  • ½ block extra firm tofu, pressed and cubed
  • Broccoli (frozen is fine)
  • Salt
  • Curry spices
  • Vegetable (or other) stock (can be omitted — we make good curry just with the milk for liquid)
  • Milk-like substance (we have had good success with unsweetened almond milk, coconut milk, and cashew milk)

Secure Self-Hosted Backups for Windows

Some time ago, I wrote up my data protection strategy. I’m no longer using most of what I describe in that article for two reasons:

  • For various reasons, we are now using Windows as our primary laptop computing platform.
  • Backing up to an external hard drive is not robust against ransomware such as CryptoLocker that encrypts your files and holds them hostage until you pay the criminal or organization deploying the attack.

This second point is particularly important: in addition general practices to keep from running ransomware (or other malicious software) in the first place, the best protection against CryptoLocker-style attacks is a good backup strategy where the backups are kept out of reach of the backed-up system. These attacks typically try to encrypt all the data files they can find, including on external drives and network shares, so if your backups are stored on such a drive they’ll be scrambled too.

However, if you have good backups out of reach of your computer, and you get hit, you can recover relatively easily: reset the computer from install media, restore data from backups, and go on your merry way. Being careful, of course, to avoid opening potentially-malicious files, as it’s quite possible that the infection arrived through some document that is saved in your backups.

Dependency Injection

This week, Michael Ludwig and I had a paper published that we’ve been working on for quite some time. ‘Dependency Injection with Static Analysis and Context-Aware Policy’, describing our Grapht dependency injector, has been published in the Journal of Object Technology.

We wrote Grapht to support algorithm configuration in LensKit, our open-source recommender systems toolkit. We didn’t want to write a new dependency injection framework; initially, we built LensKit on top of Guice, and then migrated to PicoContainer when Guice wasn’t a good fit for our needs. But PicoContainer also made it difficult to implement the kind of configuration support we wanted to provide, so we finally broke down and wrote our own.

Grapht provides two major advantages over existing injectors:

  • It pre-computes component graphs before instantiating any objects. These graphs can then be analyzed and manipulated to implement whatever diagnostics, visualization, or configuration transformation an application desires. In LensKit, we use this to generate graphs of recommender configurations and to automatically separate pre-computable model components from those that must be instantiated at runtime.

  • It supports context-sensitive policy, making it far easier to configure object graphs that include arbitrary compositions and re-use of components. This is important in LensKit, because it allows us to create simple components and remix them in arbitrary ways to implement the full recommender.