Michael Ekstrand

I'll Keep Using R

Published on 22 Apr 2016 and tagged with research, programming, and R.

During my two years at Texas State, I've been engaged in a bit of an experiment on statistics & data analysis tools. Some of my graduate students have been using R for data analysis, but some have been working with the PyData stack. I've also been learning PyData, doing some new analysis or data processing in it and trying to convert a few old analyses.

I learned R in grad school, and used it throughout my Ph.D work. My R style dramatically changed over time, as I learned ggplot2, then data.table, and finally plyr and dplyr. I'd like to think that I've become fairly proficient at R. I even like the language.

But Python is a far more usable general-purpose language. If we can do the things that we currently need from R, in a reasonable fashion, then using it decreases the number of different things that students need to master to contribute to research. There are a few things it just can't do yet — I have yet to find a structural equation modeling package for Python — but the core capability is there for most of what we do.

So, as students have joined my group, I've encouraged them to pick whichever they want between Python and R to do their analyses. 2–3 students chose Python (+ Pandas + Seaborn), and others chose R. I've tried to support each decision as best I can.

Based on my experience helping my students and trying PyData myself, I'll be encouraging all my new research students to learn R for a while. One of my students who chose Python has also switched to R (and got up to speed rather quickly).

Before going further, I'd like to note that I'm really impressed with the work that PyData developers have been doing on making it a great tool for statistics and machine learning. It's good work, well done, and having options in this space is good. Being able to do data analysis in a useful general-purpose language like Python is a good thing.

Also, this analysis is with respect to our particular context, which involves a lot of ‘medium data’. We're regularly working with data sets that push my 16GB MacBook Pro, but don't need a cluster.

The tipping point for my decision was a recent data processing task on the MovieLens 20M data set. I wrote the code in Python using Pandas; it's fairly standard data table munging (grouping, splitting, etc.). While I do not know Pandas nearly as deeply as I know R, I am comfortable with vectorization and the split-apply-combine paradigm for writing efficient code. (It is quite possible that there's some fantastic resource of tricks that I'm missing — if so, please let me know; I've already read relevant portions of the Pandas documentation and the Pandas cookbook, as well as portions of Python for Data Analysis.)

The initial version of the code was not super-fast, but it was usable. I refactored and extended it, with what seemed to me a fairly straightforward application of Pandas split-apply-combine to restructure my data tables, and the program promptly chewed up 180GB of RAM. Yes, there are three digits in that number.

I rewrote the code in R with dplyr, and it was

  • easier to read
  • easier to extend
  • faster than the original Python

It's quite possible that it is possible to write the code I needed to write in Python and have it be performant. But reading the Pandas documentation, already familiar with the core programming paradigm, I could not figure out how, and attempting to apply Pandas methods in a straightforward fashion resulted in code that either didn't work, or blew up fairly badly.

Even before the Great ML20M Performance Breakdown, I've been having challenges with the readability of Pandas code. It reminds me a lot of data.table code: terse and heavily dependent on preexisting state of the data structures. The index is extremely important in Pandas, like keys are in data.table; it affects the performance of many operations, but more importantly it defines the behavior. The problem is that the index is not apparent at the site of a join, group, or other operation: in order to understand what a particular line of Pandas code does, you have to trace back up through the previous code to understand exactly how the indexes are structured for the tables involved, and combine this with a deep knowledge of the various indexing modes supported by Pandas.

The paradigm is also very object oriented: smart data that knows its state and structure. This is useful in many cases, but in our kinds of data analysis, it makes it harder to understand code locally. Writing code in R with dplyr is more functional: the data is ‘dumb’, and the functions define its behavior. Data structure state can affect performance — using a data.table for dplyr is much faster than a standard R data frame in many cases — but the semantics of the code are described in the expression being studied.

user.summary = ratings %>%
    group_by(User) %>%
    summarize(MeanRating = mean(rating),
              RatingCount = n(),
              SessionCount = max(SessionId))

It takes some familiarity with the basic principles of dplyr and function pipelining to fully understand this code, but with those principles in mind, the meaning of the code is fairly explicit: group by user and compute summary statistics. The basic building blocks combine to express much more complicated expressions, but they retain this kind of readability: looking at dplyr code, I can understand its intent far more easily than I can understand the equivalent data.table or Pandas code, particularly as the computation in question becomes more complex.

This kind of readability is important for statistical code, because I need to be able to review my students' (and collaborators') analysis code to verify that the analysis is actually doing what we think it is so we can interpret the results. Part of it is a familiarity issue, but there are things about dplyr that I think make it fundamentally easier to interpret than Pandas code can be, especially when multiple tables are being joined (an inner_join in dplyr). Good written explanations in Jupyter notebook help, but we still need to verify that the code is actually performing the analysis described in the text.

dplyr is also very fast, and when it isn't fast enough we can drop down to plyr and write slightly more obscure code that can execute in parallel.

So, to sum up (and add a few things), here are the things keeping me on R:

  • dplyr makes data manipulation much easier to read and write, and in some cases it is faster.
  • ggplot2 is unparalleled for generating statistical graphics. Seaborn makes pretty plots, as long as you want one of the built-in plot types. There is a Python ggplot, but when I tried to use it there were some bugs with composing a complex chart. I can do whatever I want in raw matplotlib, but the code is verbose and difficult.
  • Parallelism with foreach and plyr is more immediately usable in R. There are multiprocessing packages for Python, but when I have tried to understand how to use those with Pandas in a Jupyter notebook, it hasn't been as accessible as foreach.

For the work I do, in 2016, R seems to still be the best tool for the job. If a student comes to me already fluent in Python, and wants to use the PyData stack, I won't tell them they can't. But when a student asks what they should use to do analysis, my answer will be R. For another few years, anyway.

I also recommend that they learn it with R Programming for Data Science, as it teaches dplyr. I also recommend that they skip R's built-in plotting functions and just learn ggplot2.

Update: Even though we'll keep using R, there's one huge benefit that we get from the Python data ecosystem: I've mostly switched from RStudio to using Jupyter notebooks with IRKernel. It is fantastic. We've also been using Anaconda to install R and it's worked pretty well.

Appendix: Example Code

By request, here's the Python and dplyr code that blew up. The goal is to sessionize user ratings by saying that a new session begins after 60 minutes of inactivity (see this paper); the output is a session ID for every rating. At this point, the ratings data frame has been sorted by user ID and timestamp.

def sessionize(df):
    delta_t = df.timestamp - df.timestamp.shift(1)
    is_new = delta_t >= 3600
    is_new[0] = True
    session_id = is_new.cumsum()
    return session_id
session_ids = ratings.set_index(['userId', 'movieId']).groupby(level=['userId']).apply(sessionize)
session_ids.head()

And here's the dplyr; as a bonus, the ratings frame is not already sorted, and it outputs a whole data frame instead of just the session ID vector:

sessionize = function(ts) {
    delta.t = diff(ts)
    is.new = c(TRUE, delta.t >= 3600)
    cumsum(is.new)
}
sessions = ratings %>%
    group_by(userId) %>%
    arrange(timestamp) %>%
    mutate(sessionId = sessionize(timestamp))

Update: Wes McKinney has responded with a hint about reducing memory blowup — if you're wanting to do similar computations, consider it:

*[RAM]: Random Access Memory {:.acro}