When Recommenders Fail

This notebook reproduces the analysis in ‘When Recommenders Fail’.

There is some blending of dplyr and data.table code here. I try to mostly use dplyr, but there are a few places where I do data.table joins directly.

Setup and Support Code

Let us first load some libraries.

In [1]:
library(plyr)
library(data.table)
library(dplyr)
library(reshape2)
library(ggplot2)
library(ROCR)
library(lazyeval)
Attaching package: ‘dplyr’

The following objects are masked from ‘package:data.table’:

    between, last

The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Attaching package: ‘reshape2’

The following objects are masked from ‘package:data.table’:

    dcast, melt

Loading required package: gplots

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:

    lowess

The following function samples a vector, returning a vector with of n TRUE values at the selected items. It makes it easier to sample other data structures.

In [2]:
sample.true = function(n, l) {
  result = vector(length=l)
  result[sample(1:l, n)] = TRUE
  result
}

Another little function to check if a file is up to date.

In [3]:
file.current = function(dst, src) {
  if (!file.exists(dst)) {
    FALSE
  } else if (!file.exists(src)) {
    TRUE
  } else {
    file.info(dst)$mtime >= file.info(src)$mtime
  }
}

Load Data

We now want to load the results of our LensKit experiment.

In [4]:
fn.cache = "build/predictions.Rdata"
fn.src = "build/individual.csv.gz"
if (file.current(fn.cache, fn.src)) {
  message("loading cache file ", fn.cache)
  load(fn.cache)
} else {
  message("loading src file ", fn.src)
  preds.tall = data.table(read.csv(gzfile(fn.src)))
  preds.tall = preds.tall[,list(User,Item,Rating,Algorithm,Prediction)]
  setkey(preds.tall, User, Item, Algorithm)
  preds.tall = mutate(preds.tall, Error = Rating - Prediction)
  users = summarize(group_by(preds.tall, User), n=length(unique(Item)))
  users.usable = filter(users, n >= 10)
  preds.tall = preds.tall[users.usable[,list(User)]]
  message("writing cache file ", fn.cache)
  save(file=fn.cache, compress='xz',
       preds.tall, users, users.usable)
}
algorithms = levels(preds.tall$Algorithm)
loading cache file build/predictions.Rdata

Let's pivot the prediction table to have a column per algorithm.

In [7]:
predictions = preds.tall %>%
    select(-Error) %>%
    dcast.data.table(User + Item + Rating ~ Algorithm, value.var='Prediction')
setkey(predictions, User, Item)
head(predictions)
Out[7]:
UserItemRatingFunkSVDItemItemLuceneMeanUserUser
152833.9772233.8912844.5042574.0458963.968387
253053.9645573.7418513.6146073.6146073.849966
355243.5619453.5036522.8735773.5204953.937678
4524943.8033183.5852623.4932233.653793.802025
5530834.0261133.6694733.4200053.8878844.268066
6532133.9686483.7528314.0489233.7331464.083675

We now extract the ratings from the predictions table to get the test ratings.

In [8]:
ratings = select(predictions, User, Item, Rating)
setkey(ratings, User, Item)

For some of our analysis, we need the entire rating history. So let's load it.

In [9]:
fn.cache = "build/ratings.Rdata"
fn.src = "data/ml-10m/ratings.dat"
if (file.current(fn.cache, fn.src)) {
  message("loading cache file ", fn.cache)
  load(fn.cache)
} else {
  message("loading src file ", fn.src)
  all.ratings = read.csv(pipe("sed -e 's/::/,/g' -e 's/,[[:digit:]]*$//' data/ml-10m/ratings.dat"),
                         header=FALSE, col.names=c("User", "Item", "Rating"))
  all.ratings = data.table(all.ratings, key=c('User', 'Item'))
  train.ratings = all.ratings[!ratings]
  rm(all.ratings)
  message("writing cache file ", fn.cache)
  save(file=fn.cache, compress='xz',
       train.ratings)
}
loading cache file build/ratings.Rdata

Processing Data

Now that the data is loaded, we need to do some processing of it.

Probe Ratings

First step: pulling apart probe ratings (for training hybrids) and test ratings (for testing everything).

LensKit output a set of predictions for test items for each user. We pick 5 of those items for each user as probe items, and the rest as the actual test items. These actual test items will be our test items for the rest of the analysis.

To do this, we will first identify probe pairs: for each user, 5 random items. This will be stored in a frame test.pair.purpose.

In [10]:
test.pair.purpose = ratings %>% select(User, Item) %>% group_by(User) %>% mutate(IsTest=sample(n()) > 5)

To make this easier to use, we will create a function to filter a table down to test or train ratings. If TRUE (the default), it will look for test ratings, otherwise, probe ratings.

In [11]:
filter.pairs = function(tbl, want.test=TRUE) {
    inner_join(tbl, test.pair.purpose) %>% filter(!IsTest) %>% select(-IsTest)
}

We will then select the predictions for these probe pairs to be the probe predictions, and the rest as test predictions. We'll use data.table's joining capabilities for this.

In [50]:
probe.preds = filter.pairs(preds.tall, FALSE)
test.preds = filter.pairs(preds.tall)
test.rating.count = nrow(test.preds)
dim(probe.preds)
dim(test.preds)
Joining by: c("User", "Item")
Joining by: c("User", "Item")
Out[50]:
  1. 1115350
  2. 6
Out[50]:
  1. 1115350
  2. 6

Errors

Now, we will convert the predictions table to an errors table. This will include both test and probe predictions.

In [51]:
errors.full = dcast.data.table(select(preds.tall, -Rating, -Prediction),
                               User + Item ~ Algorithm)
head(errors.full)
Using 'Error' as value column. Use 'value.var' to override
Out[51]:
UserItemFunkSVDItemItemLuceneMeanUserUser
1528-0.9772226-0.8912842-1.504257-1.045896-0.9683868
25301.0354431.2581491.3853931.3853931.150034
35520.4380550.49634751.1264230.47950450.06232184
452490.19668210.41473850.50677710.34620950.1979746
55308-1.026113-0.669473-0.4200048-0.8878837-1.268066
65321-0.9686483-0.7528306-1.048923-0.7331463-1.083675
In [52]:
test.errors = dcast.data.table(select(test.preds, -Rating, -Prediction),
                               User + Item ~ Algorithm)
head(test.errors)
Using 'Error' as value column. Use 'value.var' to override
Out[52]:
UserItemFunkSVDItemItemLuceneMeanUserUser
15520.4380550.49634751.1264230.47950450.06232184
25321-0.9686483-0.7528306-1.048923-0.7331463-1.083675
355321.9360241.9256631.7654951.975191.645022
455620.95041011.0343820.94552451.2139910.6759712
5511991.0581550.67188570.56507720.98335230.4217419
6750-0.225067-0.2867515-0.5417287-0.458344-0.3280759

Identifying Best Predictions

And we will process our test predictions to find out the algorithm that makes the best prediction for each test rating.

In [53]:
best.preds = test.preds %>%
    group_by(User, Item) %>%
    summarise(Algorithm=Algorithm[which.min(abs(Error))],
              Prediction=Prediction[which.min(abs(Error))],
              # Rating should be same value. We'll fail if not.
              Rating=unique(Rating)) %>%
    mutate(Error = Rating - Prediction)
setkey(best.preds, User, Item)
head(best.preds)
Out[53]:
UserItemAlgorithmPredictionRatingError
1552UserUser3.93767840.06232184
25321Mean3.7331463-0.7331463
35532UserUser3.35497851.645022
45562UserUser4.32402950.6759712
551199UserUser4.57825850.4217419
6750FunkSVD4.2250674-0.225067

Summarize these results:

In [54]:
best.algos.pred = best.preds %>%
    group_by(Algorithm) %>%
    summarise(Count=n()) %>%
    mutate(By='Prediction')
best.algos.pred
Error in dimnames.data.table(x): data.table inherits from data.frame (from v1.5) but this data.table does not. Has it been created manually (e.g. by using 'structure' rather than 'data.table') or saved to disk using a prior version of data.table? The correct class is c('data.table','data.frame').
Out[54]:
Source: local data table [5 x 3]

  Algorithm Count         By
     (fctr) (int)      (chr)
1  UserUser 49277 Prediction
2      Mean 32726 Prediction
3   FunkSVD 47407 Prediction
4  ItemItem 45653 Prediction
5    Lucene 48007 Prediction

Aggregate Errors by User

Now, we will compute per-user error (rather than per item). We will then see how often each algorithm is best by user RMSE or by user # correct.

First, we group things by user.

In [55]:
errors.by.user = test.preds %>%
    group_by(User, Algorithm) %>%
    summarise(MAE = mean(abs(Error)),
              RMSE = sqrt(mean(Error*Error)),
              Correct = sum(abs(Error <= 0.5)),
              NPreds = n()) %>%
    mutate(FracCorrect = Correct / NPreds)
head(errors.by.user)
Out[55]:
UserAlgorithmMAERMSECorrectNPredsFracCorrect
15FunkSVD1.0702581.174847250.4
25ItemItem0.97622181.099333250.4
35Lucene1.0902891.157529150.2
45Mean1.0770371.192439250.4
55UserUser0.77774640.9506982350.6
67FunkSVD0.58502410.6961736450.8

Next, we pick the best algorithm for each user, by two different metrics (RMSE and # Correct)

In [56]:
user.best = errors.by.user %>% group_by(User) %>%
    summarise(Algorithm=Algorithm[which.min(RMSE)],
              RMSE=min(RMSE))
user.best.correct = errors.by.user %>% group_by(User) %>%
    summarise(Algorithm=Algorithm[which.max(Correct)],
              Correct=max(Correct))

And add it to the best algorithms table.

In [57]:
best.algos = rbind(best.algos.pred %>% mutate(Frac=Count / sum(Count)),
                   user.best %>% group_by(Algorithm) %>%
                       summarise(Count=n()) %>% mutate(By='User RMSE', Frac=Count / sum(Count)),
                   user.best.correct %>% group_by(Algorithm) %>%
                       summarise(Count=n()) %>% mutate(By='User # Correct', Frac=Count / sum(Count)))
best.algos$By=as.factor(best.algos$By)
best.algos
Out[57]:
AlgorithmCountByFrac
1UserUser49277Prediction0.2209038
2Mean32726Prediction0.1467073
3FunkSVD47407Prediction0.2125207
4ItemItem45653Prediction0.2046577
5Lucene48007Prediction0.2152105
6UserUser9252User RMSE0.2073788
7FunkSVD11540User RMSE0.2586632
8ItemItem10980User RMSE0.2461111
9Lucene7806User RMSE0.1749675
10Mean5036User RMSE0.1128794
11UserUser3174User # Correct0.07114359
12ItemItem8014User # Correct0.1796297
13FunkSVD27278User # Correct0.6114224
14Lucene4637User # Correct0.103936
15Mean1511User # Correct0.03386829

Analyze Best Algorithms

First, what algorithms are best by each metric?

In [58]:
options(repr.plot.width=7, repr.plot.height=3)
ggplot(best.algos) +
    aes(x=Algorithm, y=Frac) +
    geom_bar(stat='identity') +
    facet_wrap(~ By) +
    theme(axis.text.x=element_text(angle=45, vjust=0.5)) +
    ylab("Frac. of Preds/Users")

Correlation

How correlated are the errors of our predictors?

In [59]:
errors.test = filter.pairs(errors.full)
error.cor.matrix = cor(select(errors.test, -User, -Item))
error.cor.matrix
Joining by: c("User", "Item")
Out[59]:
FunkSVDItemItemLuceneMeanUserUser
FunkSVD1.00000000.95319850.88999770.92120870.9357213
ItemItem0.95319851.00000000.89220380.91304270.9264254
Lucene0.88999770.89220381.00000000.91135730.8882744
Mean0.92120870.91304270.91135731.00000000.9420421
UserUser0.93572130.92642540.88827440.94204211.0000000

Binary Accuracy

How often is each algorithm correct?

In [60]:
algo.correct = summarise(group_by(test.preds, Algorithm),
                         N.Good.05 = sum(abs(Error) <= 0.5),
                         N.Good.07 = sum(abs(Error) <= 0.75),
                         N.Good.10 = sum(abs(Error) <= 1.0),
                         Good.05 = mean(abs(Error) <= 0.5),
                         Good.07 = mean(abs(Error) <= 0.75),
                         Good.10 = mean(abs(Error) <= 1.0))
correct.tall = melt(select(algo.correct, Algorithm, starts_with('Good')),
                    id.vars='Algorithm')
correct.tall = mutate(correct.tall, Thresh = c('ε ≤ 0.5', 'ε ≤ 0.75', 'ε ≤ 1.0')[as.integer(variable)])
ggplot(correct.tall) +
  aes(x=Algorithm, y=value) +
  geom_bar(stat='identity') +
  facet_wrap(~ Thresh) +
  ylab("Fraction Correct") + xlab(NULL) +
  theme(axis.text.x=element_text(angle=45, vjust=0.5))
Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <ce>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <b5>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <e2>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <89>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <a4>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <ce>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <b5>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <e2>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <89>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <a4>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <ce>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <b5>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <e2>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <89>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <a4>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <ce>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <b5>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <e2>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <89>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <a4>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <ce>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <b5>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <e2>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <89>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <a4>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <ce>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <b5>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <e2>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <89>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <a4>

Let's rank the algorithms by # correct at 0.5 level.

In [61]:
algo.success = algo.correct %>%
    select(Algorithm, N.Good.05, Good.05) %>%
    arrange(-Good.05)
algo.success
Error in dimnames.data.table(x): data.table inherits from data.frame (from v1.5) but this data.table does not. Has it been created manually (e.g. by using 'structure' rather than 'data.table') or saved to disk using a prior version of data.table? The correct class is c('data.table','data.frame').
Out[61]:
Source: local data table [5 x 3]

  Algorithm N.Good.05   Good.05
     (fctr)     (int)     (dbl)
1  ItemItem    113181 0.5073788
2   FunkSVD    111625 0.5004035
3  UserUser    109590 0.4912808
4    Lucene    105392 0.4724616
5      Mean    100834 0.4520285

Per-User Analysis

How often do binary accuracy and RMSE pick the same 'best' algorithm for each user? Each panel represents users for which RMSE picked that algorithm to be best, and the bars indicate the fraction of those users for which '# Correct' picked each algorithm to be best.

In [62]:
user.picked = merge(user.best, user.best.correct, by='User',
                    suffixes=c('.RMSE', '.Correct'))
user.picked.summary = ddply(user.picked, .(Algorithm.RMSE), function(adf) {
  result = summarise(group_by(adf, Algorithm.Correct), Count=n())
  mutate(result, Frac=Count / sum(Count))
})
ggplot(user.picked.summary) +
  aes(x=Algorithm.Correct, y=Frac) +
  geom_bar(stat='identity') +
  facet_grid(~ Algorithm.RMSE) +
  theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1))

Marginal Benefit

Now, we want to analyze the marginal benefit: how much does one algorithm get correct, after accounting for the correct predictions from other algorithms?

First up: we need a function to identify the ratings that a particular algorithm got correct, and remove them so we can test the next algorithm(s).

In [63]:
remove.correct = function(err.tbl, algo, thresh=0.5) {
    message("removing algorithm ", algo)
    # make a formula to select things where the error on algo excedes the threshold
    sel = interp(~abs(algo) > thresh, algo=as.name(as.character(algo)), thresh=thresh)
    filter_(err.tbl, sel)
}

With this function, we are going to create a marginal improvement matrix: in turn, remove each algorithm (the rows). For each other algorithm, compute how many predictions the first algorithm (A) missed but the second algorithm (B) got correct.

In [64]:
algos.ordered = as.character(algo.success$Algorithm)
algo.factor = function(as) {
  factor(as, levels=algos.ordered, ordered=TRUE)
}
marginal.table = ldply(algos.ordered,
                       function(algo) {
                           message("finding marginal improvements from ", algo)
                           remaining = remove.correct(test.errors, algo)
                           rem.tall = melt(select_(remaining, paste("-", algo, sep="")),
                                           id.vars=c("User", "Item"),
                                           variable.name="Algorithm", value.name="Error")
                           improve = remaining %>%
                               select_(paste("-", algo, sep="")) %>%
                               melt(id.vars=c("User", "Item"),
                                    variable.name="Algorithm", value.name="Error") %>%
                               group_by(Algorithm) %>%
                               summarise(Good = sum(abs(Error) <= 0.5))
                           data.table(Removed=algo.factor(algo),
                                      Algorithm=algo.factor(improve$Algorithm),
                                      Good=improve$Good)
                        })
marginal.matrix = acast(marginal.table, Removed ~ Algorithm)
marginal.matrix
finding marginal improvements from ItemItem
removing algorithm ItemItem
finding marginal improvements from FunkSVD
removing algorithm FunkSVD
finding marginal improvements from UserUser
removing algorithm UserUser
finding marginal improvements from Lucene
removing algorithm Lucene
finding marginal improvements from Mean
removing algorithm Mean
Using Good as value column: use value.var to override.
Out[64]:
ItemItemFunkSVDUserUserLuceneMean
ItemItem NA15540187991914417295
FunkSVD17096 NA195502158216541
UserUser2239021585 NA2337616491
Lucene269332781527574 NA18612
Mean29642273322524723170 NA

Incremental Marginal Benefit

Now, we are going to do this iteratively - remove the best algorithm's predictions, then the one that is the best on the remaining predictions, etc.

First, we need a function to perform ths cumulative removal. This function takes several parameters:

tbl : The error table to work on

use : The algorithms to start with, if we want to start with something other than the best.

algos.left : The remaining algorithms (callers should never set this, used only for recursive calls).

thresh : The error threshold

In [65]:
cum.marginal.good = function(tbl, use=NULL, algos.left=algorithms, thresh=0.5) {
    if (length(algos.left) == 0) {
        # No algorithms left - finish up
        tbl.sum = summarise(group_by(tbl, User, Item), n=n())
        data.frame(Algorithm = 'Unclaimed', Good = nrow(tbl.sum))
    } else {
        # make formulas to summarize each algorithm's column
        formulas = lapply(algos.left, function(algo) {
            interp(~sum(abs(algo) <= thresh), algo=as.name(algo), thresh=thresh)
        })
        # pass it off to summarise_, getting a summary table
        algo.good = as.data.frame(do.call(summarise_, c(list(tbl), setNames(formulas, algos.left))))
        if (length(use) == 0) {
            cur.algo = with(melt(algo.good, id.vars=c()), variable[which.max(value)])
            use.next = NULL
        } else {
            cur.algo = use[1]
            use.next = tail(use, -1)
        }
        left.next = setdiff(algos.left, cur.algo)
        row = data.table(Algorithm = cur.algo,
                         Good = algo.good[[cur.algo]])
        tbl.next = remove.correct(tbl, cur.algo, thresh=thresh)
        rbind(row,
              cum.marginal.good(tbl.next, use=use.next, algos.left=left.next, thresh=thresh))
    }
}

A function to enhance the output with some additional stats.

In [66]:
cum.table.enhance = function(tbl, ntotal = test.rating.count) {
  mutate(tbl, Frac = Good / ntotal, CumFrac = cumsum(Frac))
}

Now, run and pick the best!

In [67]:
cum.best = cum.table.enhance(cum.marginal.good(test.errors))
cum.best
removing algorithm ItemItem
removing algorithm Lucene
removing algorithm UserUser
removing algorithm FunkSVD
removing algorithm Mean
Out[67]:
AlgorithmGoodFracCumFrac
1ItemItem1131810.10147580.1014758
2Lucene191440.017164120.1186399
3UserUser106860.0095808490.1282207
4FunkSVD51450.0046129020.1328336
5Mean26330.0023606940.1351943
6Unclaimed722810.064805670.2

Let's try removing Funk-SVD right after Item-Item.

In [68]:
cum.best.ii.svd = cum.table.enhance(cum.marginal.good(test.errors,
                                                      use=c("ItemItem", "FunkSVD")))
cum.best.ii.svd
removing algorithm ItemItem
removing algorithm FunkSVD
removing algorithm Lucene
removing algorithm UserUser
removing algorithm Mean
Out[68]:
AlgorithmGoodFracCumFrac
1ItemItem1131810.10147580.1014758
2FunkSVD155400.013932850.1154086
3Lucene124220.011137310.1265459
4UserUser70130.0062877120.1328336
5Mean26330.0023606940.1351943
6Unclaimed722810.064805670.2

Let's try with Funk-SVD first.

In [69]:
cum.best.svd = cum.table.enhance(cum.marginal.good(test.errors,
                                                   use=c("FunkSVD")))
cum.best.svd
removing algorithm FunkSVD
removing algorithm Lucene
removing algorithm UserUser
removing algorithm ItemItem
removing algorithm Mean
Out[69]:
AlgorithmGoodFracCumFrac
1FunkSVD1116250.10008070.1000807
2Lucene215820.019349980.1194307
3UserUser101160.0090697990.1285005
4ItemItem48330.0043331690.1328336
5Mean26330.0023606940.1351943
6Unclaimed722810.064805670.2
In [70]:
cum.best.svd.ii = cum.table.enhance(cum.marginal.good(test.errors,
                                                      use=c("FunkSVD", "ItemItem")))
cum.best.svd.ii
removing algorithm FunkSVD
removing algorithm ItemItem
removing algorithm Lucene
removing algorithm UserUser
removing algorithm Mean
Out[70]:
AlgorithmGoodFracCumFrac
1FunkSVD1116250.10008070.1000807
2ItemItem170960.015327920.1154086
3Lucene124220.011137310.1265459
4UserUser70130.0062877120.1328336
5Mean26330.0023606940.1351943
6Unclaimed722810.064805670.2

And user-user after FunkSVD.

In [71]:
cum.best.svd.uu = cum.table.enhance(cum.marginal.good(test.errors,
                                                      use=c("FunkSVD", "UserUser")))
cum.best.svd.uu
removing algorithm FunkSVD
removing algorithm UserUser
removing algorithm Lucene
removing algorithm ItemItem
removing algorithm Mean
Out[71]:
AlgorithmGoodFracCumFrac
1FunkSVD1116250.10008070.1000807
2UserUser195500.017528130.1176088
3Lucene121480.010891650.1285005
4ItemItem48330.0043331690.1328336
5Mean26330.0023606940.1351943
6Unclaimed722810.064805670.2

Probe Switching Hybrid

We're now going to look at the probe switching hybrid, which uses our probe ratings to pick the best predictor for each user.

First we need to identify the best algorithms for each user on the probe users. We'll start by summarizing the probe errors.

In [72]:
probe.user.errors = probe.preds %>%
    group_by(User, Algorithm) %>%
    summarise(MAE = mean(abs(Error)),
              RMSE = sqrt(mean(Error * Error)),
              Correct = sum(abs(Error) <= 0.5))

Now pick the best for each user.

In [73]:
probe.user.best.rmse = summarise(group_by(probe.user.errors, User),
                                 Algorithm=Algorithm[which.min(RMSE)],
                                 RMSE=min(RMSE))
probe.user.best.correct = summarise(group_by(probe.user.errors, User),
                                    Algorithm=Algorithm[which.max(Correct)],
                                    Correct=max(Correct))

Now we will merge these results with the user-best results.

In [74]:
user.best.rmse.merged = mutate(merge(user.best, probe.user.best.rmse, 
                                     by='User', suffixes=c('.user', '.probe')),
                               Agree = Algorithm.user == Algorithm.probe)
user.best.correct.merged = mutate(merge(user.best.correct, probe.user.best.correct, 
                                        by='User', suffixes=c('.user', '.probe')),
                                  Agree = Algorithm.user == Algorithm.probe)
In [75]:
head(user.best.rmse.merged)
Out[75]:
UserAlgorithm.userRMSE.userAlgorithm.probeRMSE.probeAgree
15UserUser0.9506982UserUser0.9506982TRUE
27FunkSVD0.6961736FunkSVD0.6961736TRUE
38ItemItem0.5141469ItemItem0.5141469TRUE
410FunkSVD0.595932FunkSVD0.595932TRUE
511FunkSVD0.7442435FunkSVD0.7442435TRUE
612UserUser0.4745755UserUser0.4745755TRUE

Probe Linear Hybrid

We'll now train a linear hybrid of our algorithms, using the probe ratings.

In [76]:
norm.predictions = mutate(predictions,
                          UserUser = UserUser - Mean,
                          ItemItem = ItemItem - Mean,
                          Lucene = Lucene - Mean,
                          FunkSVD = FunkSVD - Mean)
blend.model = lm(Rating ~ Mean + UserUser + ItemItem + Lucene + FunkSVD,
                 # get the probe predictions
                 filter.pairs(norm.predictions, FALSE))
summary(blend.model)
Joining by: c("User", "Item")
Out[76]:
Call:
lm(formula = Rating ~ Mean + UserUser + ItemItem + Lucene + FunkSVD, 
    data = filter.pairs(norm.predictions, FALSE))

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1363 -0.4488  0.0588  0.5164  3.9281 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.168266   0.011360   14.81   <2e-16 ***
Mean        0.946768   0.003154  300.16   <2e-16 ***
UserUser    0.253858   0.007251   35.01   <2e-16 ***
ItemItem    0.409410   0.007410   55.25   <2e-16 ***
Lucene      0.104320   0.004987   20.92   <2e-16 ***
FunkSVD     0.561280   0.007849   71.51   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7939 on 223064 degrees of freedom
Multiple R-squared:  0.4328,	Adjusted R-squared:  0.4328 
F-statistic: 3.405e+04 on 5 and 223064 DF,  p-value: < 2.2e-16

And use that to generate predictions.

In [77]:
blend.preds = with(new.env(), {
  table = filter.pairs(norm.predictions)
  message("Making ", nrow(table), " predictions")
  pvec = predict(blend.model, table)
  pvec = pmax(pvec, 0.5)
  pvec = pmin(pvec, 5)
  mutate(data.table(select(table, User, Item, Rating),
                    Prediction=pvec),
         Error = Rating - Prediction)
})
head(blend.preds)
Joining by: c("User", "Item")
Making 223070 predictions
Out[77]:
UserItemRatingPredictionError
155243.5561470.4438525
2532133.964859-0.9648588
3553253.1800121.819988
4556254.1388030.8611973
55119954.2428370.7571629
675044.163729-0.1637288

User Characteristics

For this part of the analysis, we will examine per-user characteristics and try to use them to predict various kinds of models.

First, we must summarize the user data from the training ratings.

In [78]:
user.info = train.ratings %>%
    group_by(User) %>%
    summarise(RatingCount = n(),
              MeanRating = mean(Rating),
              RatingVar = var(Rating)) %>%
    mutate(LogCount = log10(RatingCount))
user.count.all = nrow(user.info)
user.count.usable = nrow(user.best)

Predict Item-Item Best

If we ignore the FunkSVD algorithm, can we predict whether Item-Item will be the best algorithm?

Prepare the model training data:

In [79]:
user.best.nosvd = summarise(group_by(filter(errors.by.user, Algorithm != 'FunkSVD'),
                                     User),
                            BestAlgo=Algorithm[which.min(RMSE)],
                            BestRMSE = min(RMSE))
user.best.data = inner_join(user.info, user.best.nosvd) %>%
    mutate(IIBest = BestAlgo == 'ItemItem')
summary(user.best.data$IIBest)
Joining by: "User"
Out[79]:
   Mode   FALSE    TRUE    NA's 
logical   28062   16552       0 

Then build the model:

In [85]:
user.best.test = sample_frac(user.best.data, size=0.2)
setkey(user.best.test, User)
user.best.train = user.best.data[!user.best.test]
user.best.model = glm(IIBest ~ LogCount + RatingVar,
                      data=user.best.train,
                      family=binomial())
summary(user.best.model)
Out[85]:
Call:
glm(formula = IIBest ~ LogCount + RatingVar, family = binomial(), 
    data = user.best.train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2641  -0.9686  -0.9193   1.3817   1.5446  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.23282    0.07059 -17.466  < 2e-16 ***
LogCount     0.23062    0.03148   7.326 2.37e-13 ***
RatingVar    0.24592    0.02502   9.829  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 47114  on 35690  degrees of freedom
Residual deviance: 46967  on 35688  degrees of freedom
AIC: 46973

Number of Fisher Scoring iterations: 4

Plot the performance (ROC) curve:

In [86]:
user.best.preds = prediction(predict(user.best.model, user.best.test), 
                             user.best.test$IIBest)
options(repr.plot.height=5)
plot(performance(user.best.preds, measure='tpr', x.measure='fpr'))

And compute the area under the curve:

In [87]:
user.best.auc = performance(user.best.preds, 'auc')
print(user.best.auc@y.values)
[[1]]
[1] 0.5320658

Predict Item-Item Better

Let's try to predict when item-item will be better than user-user for a particular user.

Again, prepare the data first.

In [88]:
user.wide.rmse = dcast(select(errors.by.user, User, Algorithm, RMSE),
                       User ~ Algorithm, value.var='RMSE')
user.wide.correct = dcast(select(errors.by.user, User, Algorithm, Correct),
                          User ~ Algorithm, value.var='Correct')
user.errors.wide = data.table(merge(user.wide.rmse, user.wide.correct, 
                                    by='User', suffixes=c('.RMSE', '.Correct')),
                              key='User')
user.ii.uu.data = mutate(user.info[user.errors.wide],
                         IIBest.RMSE = ItemItem.RMSE <= UserUser.RMSE,
                         IIBest.Correct = ItemItem.Correct >= UserUser.Correct)
summary(mutate(select(user.ii.uu.data, starts_with('IIBest.')),
               Agree = IIBest.RMSE == IIBest.Correct))
Out[88]:
 IIBest.RMSE     IIBest.Correct    Agree        
 Mode :logical   Mode :logical   Mode :logical  
 FALSE:18923     FALSE:7859      FALSE:17062    
 TRUE :25691     TRUE :36755     TRUE :27552    
 NA's :0         NA's :0         NA's :0        

Train the model, using user RMSE as the selection strategy:

In [89]:
user.ii.uu.test = sample_frac(user.ii.uu.data, size=0.2)
setkey(user.ii.uu.test, User)
user.ii.uu.train = user.ii.uu.data[!user.ii.uu.test]
user.ii.uu.model = glm(IIBest.RMSE ~ LogCount + RatingVar,
                      data=user.ii.uu.train,
                      family=binomial())
summary(user.ii.uu.model)
Out[89]:
Call:
glm(formula = IIBest.RMSE ~ LogCount + RatingVar, family = binomial(), 
    data = user.ii.uu.train)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.438  -1.297   1.023   1.059   1.114  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.07538    0.06920  -1.089 0.276010    
LogCount     0.14137    0.03107   4.550 5.38e-06 ***
RatingVar    0.09025    0.02470   3.654 0.000258 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 48683  on 35690  degrees of freedom
Residual deviance: 48650  on 35688  degrees of freedom
AIC: 48656

Number of Fisher Scoring iterations: 4

ROC curve:

In [90]:
user.ii.uu.preds = prediction(predict(user.ii.uu.model, user.ii.uu.test), 
                              user.ii.uu.test$IIBest.RMSE)
plot(performance(user.ii.uu.preds, measure='tpr', x.measure='fpr'))

Area under the curve:

In [91]:
user.ii.uu.auc = performance(user.ii.uu.preds, 'auc')
print(user.ii.uu.auc@y.values)
[[1]]
[1] 0.5270254

Another model, using # Correct as the selection strategy.

In [92]:
user.ii.uu.cor.model = glm(IIBest.Correct ~ LogCount + MeanRating,
                           data=user.ii.uu.train,
                           family=binomial())
summary(user.ii.uu.cor.model)
Out[92]:
Call:
glm(formula = IIBest.Correct ~ LogCount + MeanRating, family = binomial(), 
    data = user.ii.uu.train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4855   0.5125   0.5950   0.6557   0.9359  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.52705    0.17165  26.373  < 2e-16 ***
LogCount    -0.19149    0.04155  -4.608 4.06e-06 ***
MeanRating  -0.71322    0.03577 -19.941  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 33287  on 35690  degrees of freedom
Residual deviance: 32872  on 35688  degrees of freedom
AIC: 32878

Number of Fisher Scoring iterations: 4

ROC curve:

In [93]:
user.ii.uu.cor.preds = prediction(predict(user.ii.uu.cor.model, user.ii.uu.test), 
                                  user.ii.uu.test$IIBest.Correct)
plot(performance(user.ii.uu.cor.preds, measure='tpr', x.measure='fpr'))

AUC:

In [94]:
user.ii.uu.cor.auc = performance(user.ii.uu.cor.preds, 'auc')
print(user.ii.uu.cor.auc@y.values)
[[1]]
[1] 0.5629184

Compute and Plot Errors

Now we're going to take the errors from our various models and pull them together into a single chart of errors.

Start with a helper function:

In [95]:
compute.metrics = function(tbl, thresh=0.5) {
  per.user = summarise(group_by(tbl, User, Algorithm),
                       RMSE=sqrt(mean(Error*Error)),
                       SSE=sum(Error*Error),
                       NCorrect = sum(abs(Error) <= thresh),
                       n=n())
  summarise(group_by(per.user, Algorithm),
            RMSE.ByUser = mean(RMSE),
            RMSE.Global = sqrt(sum(SSE) / sum(n)),
            Correct.ByUser = mean(NCorrect / n),
            Correct.Global = sum(NCorrect) / sum(n))
}

Summarize error of the individual algorithms:

In [96]:
algos.rmse = data.table(Family='Single',
                        compute.metrics(test.preds))
algos.rmse
Out[96]:
FamilyAlgorithmRMSE.ByUserRMSE.GlobalCorrect.ByUserCorrect.Global
1SingleFunkSVD0.7429460.8056410.50040350.5004035
2SingleItemItem0.74550470.81155750.50737880.5073788
3SingleLucene0.81058780.88068370.47246160.4724616
4SingleMean0.81505930.8812080.45202850.4520285
5SingleUserUser0.77182620.84007860.49128080.4912808

And of the linear blend:

In [97]:
blend.rmse = data.table(Family='Blend',
                        compute.metrics(mutate(blend.preds, Algorithm='Blend')))
blend.rmse
Out[97]:
FamilyAlgorithmRMSE.ByUserRMSE.GlobalCorrect.ByUserCorrect.Global
1BlendBlend0.73049890.7938230.51250730.5125073

Of our oracle hybrid that picks the best predictor for each individual prediciton:

In [98]:
best.rmse = data.table(Family='Oracle',
                       compute.metrics(mutate(best.preds, Algorithm='BestPred')))
best.rmse
Out[98]:
FamilyAlgorithmRMSE.ByUserRMSE.GlobalCorrect.ByUserCorrect.Global
1OracleBestPred0.55581370.62701640.67597170.6759717

Of each of our various per-user selection methods:

In [99]:
per.user.rmse = with(new.env(), {
  preds = rbind(mutate(merge(test.preds, select(user.best, User, Algorithm),
                             by=c('User', 'Algorithm')),
                       Algorithm = 'UserBestRMSE'),
                mutate(merge(test.preds, select(probe.user.best.rmse, User, Algorithm),
                             by=c('User', 'Algorithm')),
                       Algorithm = 'TuneBestRMSE'),
                mutate(merge(test.preds, select(user.best.correct, User, Algorithm),
                             by=c('User', 'Algorithm')),
                       Algorithm = 'UserMostRight'),
                mutate(merge(test.preds, select(probe.user.best.correct, User, Algorithm),
                             by=c('User', 'Algorithm')),
                       Algorithm = 'TuneMostRight'))
  data.table(Family='Per User', compute.metrics(preds))
})
per.user.rmse
Out[99]:
FamilyAlgorithmRMSE.ByUserRMSE.GlobalCorrect.ByUserCorrect.Global
1Per UserUserBestRMSE0.65560890.72003690.56176090.5617609
2Per UserTuneBestRMSE0.65560890.72003690.56176090.5617609
3Per UserUserMostRight0.72762370.79691770.58261080.5826108
4Per UserTuneMostRight0.71812360.78981310.62296590.6229659

Pull things together!

In [100]:
all.rmse = rbind(algos.rmse, blend.rmse, best.rmse, per.user.rmse) %>%
    mutate(Family=factor(Family, levels=c("Single", "Blend", "Per User", "Oracle"), ordered=TRUE))
all.metrics = melt(all.rmse, id.vars=c('Family', 'Algorithm'), variable.name='Metric') %>%
    mutate(Group = gsub("\\..*$", "", Metric))
all.metrics
Out[100]:
FamilyAlgorithmMetricvalueGroup
1SingleFunkSVDRMSE.ByUser0.742946RMSE
2SingleItemItemRMSE.ByUser0.7455047RMSE
3SingleLuceneRMSE.ByUser0.8105878RMSE
4SingleMeanRMSE.ByUser0.8150593RMSE
5SingleUserUserRMSE.ByUser0.7718262RMSE
6BlendBlendRMSE.ByUser0.7304989RMSE
7OracleBestPredRMSE.ByUser0.5558137RMSE
8Per UserUserBestRMSERMSE.ByUser0.6556089RMSE
9Per UserTuneBestRMSERMSE.ByUser0.6556089RMSE
10Per UserUserMostRightRMSE.ByUser0.7276237RMSE
11Per UserTuneMostRightRMSE.ByUser0.7181236RMSE
12SingleFunkSVDRMSE.Global0.805641RMSE
13SingleItemItemRMSE.Global0.8115575RMSE
14SingleLuceneRMSE.Global0.8806837RMSE
15SingleMeanRMSE.Global0.881208RMSE
16SingleUserUserRMSE.Global0.8400786RMSE
17BlendBlendRMSE.Global0.793823RMSE
18OracleBestPredRMSE.Global0.6270164RMSE
19Per UserUserBestRMSERMSE.Global0.7200369RMSE
20Per UserTuneBestRMSERMSE.Global0.7200369RMSE
21Per UserUserMostRightRMSE.Global0.7969177RMSE
22Per UserTuneMostRightRMSE.Global0.7898131RMSE
23SingleFunkSVDCorrect.ByUser0.5004035Correct
24SingleItemItemCorrect.ByUser0.5073788Correct
25SingleLuceneCorrect.ByUser0.4724616Correct
26SingleMeanCorrect.ByUser0.4520285Correct
27SingleUserUserCorrect.ByUser0.4912808Correct
28BlendBlendCorrect.ByUser0.5125073Correct
29OracleBestPredCorrect.ByUser0.6759717Correct
30Per UserUserBestRMSECorrect.ByUser0.5617609Correct
31Per UserTuneBestRMSECorrect.ByUser0.5617609Correct
32Per UserUserMostRightCorrect.ByUser0.5826108Correct
33Per UserTuneMostRightCorrect.ByUser0.6229659Correct
34SingleFunkSVDCorrect.Global0.5004035Correct
35SingleItemItemCorrect.Global0.5073788Correct
36SingleLuceneCorrect.Global0.4724616Correct
37SingleMeanCorrect.Global0.4520285Correct
38SingleUserUserCorrect.Global0.4912808Correct
39BlendBlendCorrect.Global0.5125073Correct
40OracleBestPredCorrect.Global0.6759717Correct
41Per UserUserBestRMSECorrect.Global0.5617609Correct
42Per UserTuneBestRMSECorrect.Global0.5617609Correct
43Per UserUserMostRightCorrect.Global0.5826108Correct
44Per UserTuneMostRightCorrect.Global0.6229659Correct

Let's draw our major plot:

In [101]:
ggplot(all.metrics %>% filter(Group == 'RMSE')) +
  aes(x=value, y=Algorithm, color=Metric, shape=Metric) +
  geom_point() +
  facet_grid(Family ~ ., scales='free_y', space='free_y') +
  xlab(NULL) +
  theme(strip.text.y=element_text(angle=0),
        legend.position='bottom', legend.title=element_blank())

We can do the same thing with 'is correct' metrics:

In [102]:
ggplot(all.metrics %>% filter(Group == 'Correct')) +
  aes(x=value, y=Algorithm, color=Metric, shape=Metric) +
  geom_point() +
  facet_grid(Family ~ ., scales='free_y', space='free_y') +
  xlab(NULL) +
  theme(strip.text.y=element_text(angle=0),
        legend.position='bottom', legend.title=element_blank())
In [ ]: