When Recommenders Fail¶

This notebook reproduces the analysis in ‘When Recommenders Fail’.

There is some blending of dplyr and data.table code here. I try to mostly use dplyr, but there are a few places where I do data.table joins directly.

Setup and Support Code¶

Let us first load some libraries.

library(plyr)
library(data.table)
library(dplyr)
library(reshape2)
library(ggplot2)
library(ROCR)
library(lazyeval)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:data.table’:

    between, last

The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Attaching package: ‘reshape2’

The following objects are masked from ‘package:data.table’:

    dcast, melt

Loading required package: gplots

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:

    lowess

The following function samples a vector, returning a vector with of n TRUE values at the selected items. It makes it easier to sample other data structures.

sample.true = function(n, l) {
  result = vector(length=l)
  result[sample(1:l, n)] = TRUE
  result
}

Another little function to check if a file is up to date.

file.current = function(dst, src) {
  if (!file.exists(dst)) {
    FALSE
  } else if (!file.exists(src)) {
    TRUE
  } else {
    file.info(dst)$mtime >= file.info(src)$mtime
  }
}

Load Data¶

We now want to load the results of our LensKit experiment.

fn.cache = "build/predictions.Rdata"
fn.src = "build/individual.csv.gz"
if (file.current(fn.cache, fn.src)) {
  message("loading cache file ", fn.cache)
  load(fn.cache)
} else {
  message("loading src file ", fn.src)
  preds.tall = data.table(read.csv(gzfile(fn.src)))
  preds.tall = preds.tall[,list(User,Item,Rating,Algorithm,Prediction)]
  setkey(preds.tall, User, Item, Algorithm)
  preds.tall = mutate(preds.tall, Error = Rating - Prediction)
  users = summarize(group_by(preds.tall, User), n=length(unique(Item)))
  users.usable = filter(users, n >= 10)
  preds.tall = preds.tall[users.usable[,list(User)]]
  message("writing cache file ", fn.cache)
  save(file=fn.cache, compress='xz',
       preds.tall, users, users.usable)
}
algorithms = levels(preds.tall$Algorithm)

loading cache file build/predictions.Rdata

Let's pivot the prediction table to have a column per algorithm.

predictions = preds.tall %>%
    select(-Error) %>%
    dcast.data.table(User + Item + Rating ~ Algorithm, value.var='Prediction')
setkey(predictions, User, Item)
head(predictions)

We now extract the ratings from the predictions table to get the test ratings.

ratings = select(predictions, User, Item, Rating)
setkey(ratings, User, Item)

For some of our analysis, we need the entire rating history. So let's load it.

fn.cache = "build/ratings.Rdata"
fn.src = "data/ml-10m/ratings.dat"
if (file.current(fn.cache, fn.src)) {
  message("loading cache file ", fn.cache)
  load(fn.cache)
} else {
  message("loading src file ", fn.src)
  all.ratings = read.csv(pipe("sed -e 's/::/,/g' -e 's/,[[:digit:]]*$//' data/ml-10m/ratings.dat"),
                         header=FALSE, col.names=c("User", "Item", "Rating"))
  all.ratings = data.table(all.ratings, key=c('User', 'Item'))
  train.ratings = all.ratings[!ratings]
  rm(all.ratings)
  message("writing cache file ", fn.cache)
  save(file=fn.cache, compress='xz',
       train.ratings)
}

loading cache file build/ratings.Rdata

Processing Data¶

Now that the data is loaded, we need to do some processing of it.

Probe Ratings¶

First step: pulling apart probe ratings (for training hybrids) and test ratings (for testing everything).

LensKit output a set of predictions for test items for each user. We pick 5 of those items for each user as probe items, and the rest as the actual test items. These actual test items will be our test items for the rest of the analysis.

To do this, we will first identify probe pairs: for each user, 5 random items. This will be stored in a frame test.pair.purpose.

test.pair.purpose = ratings %>% select(User, Item) %>% group_by(User) %>% mutate(IsTest=sample(n()) > 5)

To make this easier to use, we will create a function to filter a table down to test or train ratings. If TRUE (the default), it will look for test ratings, otherwise, probe ratings.

filter.pairs = function(tbl, want.test=TRUE) {
    inner_join(tbl, test.pair.purpose) %>% filter(!IsTest) %>% select(-IsTest)
}

We will then select the predictions for these probe pairs to be the probe predictions, and the rest as test predictions. We'll use data.table's joining capabilities for this.

probe.preds = filter.pairs(preds.tall, FALSE)
test.preds = filter.pairs(preds.tall)
test.rating.count = nrow(test.preds)
dim(probe.preds)
dim(test.preds)

Joining by: c("User", "Item")
Joining by: c("User", "Item")

Errors¶

Now, we will convert the predictions table to an errors table. This will include both test and probe predictions.

errors.full = dcast.data.table(select(preds.tall, -Rating, -Prediction),
                               User + Item ~ Algorithm)
head(errors.full)

Using 'Error' as value column. Use 'value.var' to override

test.errors = dcast.data.table(select(test.preds, -Rating, -Prediction),
                               User + Item ~ Algorithm)
head(test.errors)

Using 'Error' as value column. Use 'value.var' to override

Identifying Best Predictions¶

And we will process our test predictions to find out the algorithm that makes the best prediction for each test rating.

best.preds = test.preds %>%
    group_by(User, Item) %>%
    summarise(Algorithm=Algorithm[which.min(abs(Error))],
              Prediction=Prediction[which.min(abs(Error))],
              # Rating should be same value. We'll fail if not.
              Rating=unique(Rating)) %>%
    mutate(Error = Rating - Prediction)
setkey(best.preds, User, Item)
head(best.preds)

Summarize these results:

best.algos.pred = best.preds %>%
    group_by(Algorithm) %>%
    summarise(Count=n()) %>%
    mutate(By='Prediction')
best.algos.pred

Error in dimnames.data.table(x): data.table inherits from data.frame (from v1.5) but this data.table does not. Has it been created manually (e.g. by using 'structure' rather than 'data.table') or saved to disk using a prior version of data.table? The correct class is c('data.table','data.frame').

Source: local data table [5 x 3]

  Algorithm Count         By
     (fctr) (int)      (chr)
1  UserUser 49277 Prediction
2      Mean 32726 Prediction
3   FunkSVD 47407 Prediction
4  ItemItem 45653 Prediction
5    Lucene 48007 Prediction

Aggregate Errors by User¶

Now, we will compute per-user error (rather than per item). We will then see how often each algorithm is best by user RMSE or by user # correct.

First, we group things by user.

errors.by.user = test.preds %>%
    group_by(User, Algorithm) %>%
    summarise(MAE = mean(abs(Error)),
              RMSE = sqrt(mean(Error*Error)),
              Correct = sum(abs(Error <= 0.5)),
              NPreds = n()) %>%
    mutate(FracCorrect = Correct / NPreds)
head(errors.by.user)

Next, we pick the best algorithm for each user, by two different metrics (RMSE and # Correct)

user.best = errors.by.user %>% group_by(User) %>%
    summarise(Algorithm=Algorithm[which.min(RMSE)],
              RMSE=min(RMSE))
user.best.correct = errors.by.user %>% group_by(User) %>%
    summarise(Algorithm=Algorithm[which.max(Correct)],
              Correct=max(Correct))

And add it to the best algorithms table.

best.algos = rbind(best.algos.pred %>% mutate(Frac=Count / sum(Count)),
                   user.best %>% group_by(Algorithm) %>%
                       summarise(Count=n()) %>% mutate(By='User RMSE', Frac=Count / sum(Count)),
                   user.best.correct %>% group_by(Algorithm) %>%
                       summarise(Count=n()) %>% mutate(By='User # Correct', Frac=Count / sum(Count)))
best.algos$By=as.factor(best.algos$By)
best.algos

Analyze Best Algorithms¶

First, what algorithms are best by each metric?

options(repr.plot.width=7, repr.plot.height=3)
ggplot(best.algos) +
    aes(x=Algorithm, y=Frac) +
    geom_bar(stat='identity') +
    facet_wrap(~ By) +
    theme(axis.text.x=element_text(angle=45, vjust=0.5)) +
    ylab("Frac. of Preds/Users")

Correlation¶

How correlated are the errors of our predictors?

errors.test = filter.pairs(errors.full)
error.cor.matrix = cor(select(errors.test, -User, -Item))
error.cor.matrix

Joining by: c("User", "Item")

Binary Accuracy¶

How often is each algorithm correct?

algo.correct = summarise(group_by(test.preds, Algorithm),
                         N.Good.05 = sum(abs(Error) <= 0.5),
                         N.Good.07 = sum(abs(Error) <= 0.75),
                         N.Good.10 = sum(abs(Error) <= 1.0),
                         Good.05 = mean(abs(Error) <= 0.5),
                         Good.07 = mean(abs(Error) <= 0.75),
                         Good.10 = mean(abs(Error) <= 1.0))
correct.tall = melt(select(algo.correct, Algorithm, starts_with('Good')),
                    id.vars='Algorithm')
correct.tall = mutate(correct.tall, Thresh = c('ε ≤ 0.5', 'ε ≤ 0.75', 'ε ≤ 1.0')[as.integer(variable)])
ggplot(correct.tall) +
  aes(x=Algorithm, y=value) +
  geom_bar(stat='identity') +
  facet_wrap(~ Thresh) +
  ylab("Fraction Correct") + xlab(NULL) +
  theme(axis.text.x=element_text(angle=45, vjust=0.5))

Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <ce>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <b5>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <e2>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <89>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <a4>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <ce>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <b5>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <e2>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <89>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <a4>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <ce>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <b5>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <e2>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <89>Warning message:
In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <a4>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <ce>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <b5>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <e2>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <89>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.5' in 'mbcsToSbcs': dot substituted for <a4>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <ce>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <b5>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <e2>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <89>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 0.75' in 'mbcsToSbcs': dot substituted for <a4>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <ce>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <b5>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <e2>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <89>Warning message:
In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, : conversion failure on 'ε ≤ 1.0' in 'mbcsToSbcs': dot substituted for <a4>

Let's rank the algorithms by # correct at 0.5 level.

algo.success = algo.correct %>%
    select(Algorithm, N.Good.05, Good.05) %>%
    arrange(-Good.05)
algo.success

Error in dimnames.data.table(x): data.table inherits from data.frame (from v1.5) but this data.table does not. Has it been created manually (e.g. by using 'structure' rather than 'data.table') or saved to disk using a prior version of data.table? The correct class is c('data.table','data.frame').

Source: local data table [5 x 3]

  Algorithm N.Good.05   Good.05
     (fctr)     (int)     (dbl)
1  ItemItem    113181 0.5073788
2   FunkSVD    111625 0.5004035
3  UserUser    109590 0.4912808
4    Lucene    105392 0.4724616
5      Mean    100834 0.4520285

Per-User Analysis¶

How often do binary accuracy and RMSE pick the same 'best' algorithm for each user? Each panel represents users for which RMSE picked that algorithm to be best, and the bars indicate the fraction of those users for which '# Correct' picked each algorithm to be best.

user.picked = merge(user.best, user.best.correct, by='User',
                    suffixes=c('.RMSE', '.Correct'))
user.picked.summary = ddply(user.picked, .(Algorithm.RMSE), function(adf) {
  result = summarise(group_by(adf, Algorithm.Correct), Count=n())
  mutate(result, Frac=Count / sum(Count))
})
ggplot(user.picked.summary) +
  aes(x=Algorithm.Correct, y=Frac) +
  geom_bar(stat='identity') +
  facet_grid(~ Algorithm.RMSE) +
  theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1))

Marginal Benefit¶

Now, we want to analyze the marginal benefit: how much does one algorithm get correct, after accounting for the correct predictions from other algorithms?

First up: we need a function to identify the ratings that a particular algorithm got correct, and remove them so we can test the next algorithm(s).

remove.correct = function(err.tbl, algo, thresh=0.5) {
    message("removing algorithm ", algo)
    # make a formula to select things where the error on algo excedes the threshold
    sel = interp(~abs(algo) > thresh, algo=as.name(as.character(algo)), thresh=thresh)
    filter_(err.tbl, sel)
}

With this function, we are going to create a marginal improvement matrix: in turn, remove each algorithm (the rows). For each other algorithm, compute how many predictions the first algorithm (A) missed but the second algorithm (B) got correct.

algos.ordered = as.character(algo.success$Algorithm)
algo.factor = function(as) {
  factor(as, levels=algos.ordered, ordered=TRUE)
}
marginal.table = ldply(algos.ordered,
                       function(algo) {
                           message("finding marginal improvements from ", algo)
                           remaining = remove.correct(test.errors, algo)
                           rem.tall = melt(select_(remaining, paste("-", algo, sep="")),
                                           id.vars=c("User", "Item"),
                                           variable.name="Algorithm", value.name="Error")
                           improve = remaining %>%
                               select_(paste("-", algo, sep="")) %>%
                               melt(id.vars=c("User", "Item"),
                                    variable.name="Algorithm", value.name="Error") %>%
                               group_by(Algorithm) %>%
                               summarise(Good = sum(abs(Error) <= 0.5))
                           data.table(Removed=algo.factor(algo),
                                      Algorithm=algo.factor(improve$Algorithm),
                                      Good=improve$Good)
                        })
marginal.matrix = acast(marginal.table, Removed ~ Algorithm)
marginal.matrix

finding marginal improvements from ItemItem
removing algorithm ItemItem
finding marginal improvements from FunkSVD
removing algorithm FunkSVD
finding marginal improvements from UserUser
removing algorithm UserUser
finding marginal improvements from Lucene
removing algorithm Lucene
finding marginal improvements from Mean
removing algorithm Mean
Using Good as value column: use value.var to override.

Incremental Marginal Benefit¶

Now, we are going to do this iteratively - remove the best algorithm's predictions, then the one that is the best on the remaining predictions, etc.

First, we need a function to perform ths cumulative removal. This function takes several parameters:

tbl : The error table to work on

use : The algorithms to start with, if we want to start with something other than the best.

algos.left : The remaining algorithms (callers should never set this, used only for recursive calls).

thresh : The error threshold

cum.marginal.good = function(tbl, use=NULL, algos.left=algorithms, thresh=0.5) {
    if (length(algos.left) == 0) {
        # No algorithms left - finish up
        tbl.sum = summarise(group_by(tbl, User, Item), n=n())
        data.frame(Algorithm = 'Unclaimed', Good = nrow(tbl.sum))
    } else {
        # make formulas to summarize each algorithm's column
        formulas = lapply(algos.left, function(algo) {
            interp(~sum(abs(algo) <= thresh), algo=as.name(algo), thresh=thresh)
        })
        # pass it off to summarise_, getting a summary table
        algo.good = as.data.frame(do.call(summarise_, c(list(tbl), setNames(formulas, algos.left))))
        if (length(use) == 0) {
            cur.algo = with(melt(algo.good, id.vars=c()), variable[which.max(value)])
            use.next = NULL
        } else {
            cur.algo = use[1]
            use.next = tail(use, -1)
        }
        left.next = setdiff(algos.left, cur.algo)
        row = data.table(Algorithm = cur.algo,
                         Good = algo.good[[cur.algo]])
        tbl.next = remove.correct(tbl, cur.algo, thresh=thresh)
        rbind(row,
              cum.marginal.good(tbl.next, use=use.next, algos.left=left.next, thresh=thresh))
    }
}

A function to enhance the output with some additional stats.

cum.table.enhance = function(tbl, ntotal = test.rating.count) {
  mutate(tbl, Frac = Good / ntotal, CumFrac = cumsum(Frac))
}

Now, run and pick the best!

cum.best = cum.table.enhance(cum.marginal.good(test.errors))
cum.best

removing algorithm ItemItem
removing algorithm Lucene
removing algorithm UserUser
removing algorithm FunkSVD
removing algorithm Mean

Let's try removing Funk-SVD right after Item-Item.

cum.best.ii.svd = cum.table.enhance(cum.marginal.good(test.errors,
                                                      use=c("ItemItem", "FunkSVD")))
cum.best.ii.svd

removing algorithm ItemItem
removing algorithm FunkSVD
removing algorithm Lucene
removing algorithm UserUser
removing algorithm Mean

Let's try with Funk-SVD first.

cum.best.svd = cum.table.enhance(cum.marginal.good(test.errors,
                                                   use=c("FunkSVD")))
cum.best.svd

removing algorithm FunkSVD
removing algorithm Lucene
removing algorithm UserUser
removing algorithm ItemItem
removing algorithm Mean

cum.best.svd.ii = cum.table.enhance(cum.marginal.good(test.errors,
                                                      use=c("FunkSVD", "ItemItem")))
cum.best.svd.ii

removing algorithm FunkSVD
removing algorithm ItemItem
removing algorithm Lucene
removing algorithm UserUser
removing algorithm Mean

And user-user after FunkSVD.

cum.best.svd.uu = cum.table.enhance(cum.marginal.good(test.errors,
                                                      use=c("FunkSVD", "UserUser")))
cum.best.svd.uu

removing algorithm FunkSVD
removing algorithm UserUser
removing algorithm Lucene
removing algorithm ItemItem
removing algorithm Mean

Probe Switching Hybrid¶

We're now going to look at the probe switching hybrid, which uses our probe ratings to pick the best predictor for each user.

First we need to identify the best algorithms for each user on the probe users. We'll start by summarizing the probe errors.

probe.user.errors = probe.preds %>%
    group_by(User, Algorithm) %>%
    summarise(MAE = mean(abs(Error)),
              RMSE = sqrt(mean(Error * Error)),
              Correct = sum(abs(Error) <= 0.5))

Now pick the best for each user.

probe.user.best.rmse = summarise(group_by(probe.user.errors, User),
                                 Algorithm=Algorithm[which.min(RMSE)],
                                 RMSE=min(RMSE))
probe.user.best.correct = summarise(group_by(probe.user.errors, User),
                                    Algorithm=Algorithm[which.max(Correct)],
                                    Correct=max(Correct))

Now we will merge these results with the user-best results.

user.best.rmse.merged = mutate(merge(user.best, probe.user.best.rmse, 
                                     by='User', suffixes=c('.user', '.probe')),
                               Agree = Algorithm.user == Algorithm.probe)
user.best.correct.merged = mutate(merge(user.best.correct, probe.user.best.correct, 
                                        by='User', suffixes=c('.user', '.probe')),
                                  Agree = Algorithm.user == Algorithm.probe)

head(user.best.rmse.merged)

Probe Linear Hybrid¶

We'll now train a linear hybrid of our algorithms, using the probe ratings.

norm.predictions = mutate(predictions,
                          UserUser = UserUser - Mean,
                          ItemItem = ItemItem - Mean,
                          Lucene = Lucene - Mean,
                          FunkSVD = FunkSVD - Mean)
blend.model = lm(Rating ~ Mean + UserUser + ItemItem + Lucene + FunkSVD,
                 # get the probe predictions
                 filter.pairs(norm.predictions, FALSE))
summary(blend.model)

Joining by: c("User", "Item")

Call:
lm(formula = Rating ~ Mean + UserUser + ItemItem + Lucene + FunkSVD, 
    data = filter.pairs(norm.predictions, FALSE))

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1363 -0.4488  0.0588  0.5164  3.9281 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.168266   0.011360   14.81   <2e-16 ***
Mean        0.946768   0.003154  300.16   <2e-16 ***
UserUser    0.253858   0.007251   35.01   <2e-16 ***
ItemItem    0.409410   0.007410   55.25   <2e-16 ***
Lucene      0.104320   0.004987   20.92   <2e-16 ***
FunkSVD     0.561280   0.007849   71.51   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7939 on 223064 degrees of freedom
Multiple R-squared:  0.4328,	Adjusted R-squared:  0.4328 
F-statistic: 3.405e+04 on 5 and 223064 DF,  p-value: < 2.2e-16

And use that to generate predictions.

blend.preds = with(new.env(), {
  table = filter.pairs(norm.predictions)
  message("Making ", nrow(table), " predictions")
  pvec = predict(blend.model, table)
  pvec = pmax(pvec, 0.5)
  pvec = pmin(pvec, 5)
  mutate(data.table(select(table, User, Item, Rating),
                    Prediction=pvec),
         Error = Rating - Prediction)
})
head(blend.preds)

Joining by: c("User", "Item")
Making 223070 predictions

User Characteristics¶

For this part of the analysis, we will examine per-user characteristics and try to use them to predict various kinds of models.

First, we must summarize the user data from the training ratings.

user.info = train.ratings %>%
    group_by(User) %>%
    summarise(RatingCount = n(),
              MeanRating = mean(Rating),
              RatingVar = var(Rating)) %>%
    mutate(LogCount = log10(RatingCount))
user.count.all = nrow(user.info)
user.count.usable = nrow(user.best)

Predict Item-Item Best¶

If we ignore the FunkSVD algorithm, can we predict whether Item-Item will be the best algorithm?

Prepare the model training data:

user.best.nosvd = summarise(group_by(filter(errors.by.user, Algorithm != 'FunkSVD'),
                                     User),
                            BestAlgo=Algorithm[which.min(RMSE)],
                            BestRMSE = min(RMSE))
user.best.data = inner_join(user.info, user.best.nosvd) %>%
    mutate(IIBest = BestAlgo == 'ItemItem')
summary(user.best.data$IIBest)

Joining by: "User"

   Mode   FALSE    TRUE    NA's 
logical   28062   16552       0

Then build the model:

user.best.test = sample_frac(user.best.data, size=0.2)
setkey(user.best.test, User)
user.best.train = user.best.data[!user.best.test]
user.best.model = glm(IIBest ~ LogCount + RatingVar,
                      data=user.best.train,
                      family=binomial())
summary(user.best.model)

Call:
glm(formula = IIBest ~ LogCount + RatingVar, family = binomial(), 
    data = user.best.train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2641  -0.9686  -0.9193   1.3817   1.5446  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.23282    0.07059 -17.466  < 2e-16 ***
LogCount     0.23062    0.03148   7.326 2.37e-13 ***
RatingVar    0.24592    0.02502   9.829  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 47114  on 35690  degrees of freedom
Residual deviance: 46967  on 35688  degrees of freedom
AIC: 46973

Number of Fisher Scoring iterations: 4

Plot the performance (ROC) curve:

user.best.preds = prediction(predict(user.best.model, user.best.test), 
                             user.best.test$IIBest)
options(repr.plot.height=5)
plot(performance(user.best.preds, measure='tpr', x.measure='fpr'))

And compute the area under the curve:

user.best.auc = performance(user.best.preds, 'auc')
print(user.best.auc@y.values)

[[1]]
[1] 0.5320658

Predict Item-Item Better¶

Let's try to predict when item-item will be better than user-user for a particular user.

Again, prepare the data first.

user.wide.rmse = dcast(select(errors.by.user, User, Algorithm, RMSE),
                       User ~ Algorithm, value.var='RMSE')
user.wide.correct = dcast(select(errors.by.user, User, Algorithm, Correct),
                          User ~ Algorithm, value.var='Correct')
user.errors.wide = data.table(merge(user.wide.rmse, user.wide.correct, 
                                    by='User', suffixes=c('.RMSE', '.Correct')),
                              key='User')
user.ii.uu.data = mutate(user.info[user.errors.wide],
                         IIBest.RMSE = ItemItem.RMSE <= UserUser.RMSE,
                         IIBest.Correct = ItemItem.Correct >= UserUser.Correct)
summary(mutate(select(user.ii.uu.data, starts_with('IIBest.')),
               Agree = IIBest.RMSE == IIBest.Correct))

 IIBest.RMSE     IIBest.Correct    Agree        
 Mode :logical   Mode :logical   Mode :logical  
 FALSE:18923     FALSE:7859      FALSE:17062    
 TRUE :25691     TRUE :36755     TRUE :27552    
 NA's :0         NA's :0         NA's :0

Train the model, using user RMSE as the selection strategy:

user.ii.uu.test = sample_frac(user.ii.uu.data, size=0.2)
setkey(user.ii.uu.test, User)
user.ii.uu.train = user.ii.uu.data[!user.ii.uu.test]
user.ii.uu.model = glm(IIBest.RMSE ~ LogCount + RatingVar,
                      data=user.ii.uu.train,
                      family=binomial())
summary(user.ii.uu.model)

Call:
glm(formula = IIBest.RMSE ~ LogCount + RatingVar, family = binomial(), 
    data = user.ii.uu.train)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.438  -1.297   1.023   1.059   1.114  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.07538    0.06920  -1.089 0.276010    
LogCount     0.14137    0.03107   4.550 5.38e-06 ***
RatingVar    0.09025    0.02470   3.654 0.000258 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 48683  on 35690  degrees of freedom
Residual deviance: 48650  on 35688  degrees of freedom
AIC: 48656

Number of Fisher Scoring iterations: 4

ROC curve:

user.ii.uu.preds = prediction(predict(user.ii.uu.model, user.ii.uu.test), 
                              user.ii.uu.test$IIBest.RMSE)
plot(performance(user.ii.uu.preds, measure='tpr', x.measure='fpr'))

Area under the curve:

user.ii.uu.auc = performance(user.ii.uu.preds, 'auc')
print(user.ii.uu.auc@y.values)

[[1]]
[1] 0.5270254

Another model, using # Correct as the selection strategy.

user.ii.uu.cor.model = glm(IIBest.Correct ~ LogCount + MeanRating,
                           data=user.ii.uu.train,
                           family=binomial())
summary(user.ii.uu.cor.model)

Call:
glm(formula = IIBest.Correct ~ LogCount + MeanRating, family = binomial(), 
    data = user.ii.uu.train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4855   0.5125   0.5950   0.6557   0.9359  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.52705    0.17165  26.373  < 2e-16 ***
LogCount    -0.19149    0.04155  -4.608 4.06e-06 ***
MeanRating  -0.71322    0.03577 -19.941  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 33287  on 35690  degrees of freedom
Residual deviance: 32872  on 35688  degrees of freedom
AIC: 32878

Number of Fisher Scoring iterations: 4

ROC curve:

user.ii.uu.cor.preds = prediction(predict(user.ii.uu.cor.model, user.ii.uu.test), 
                                  user.ii.uu.test$IIBest.Correct)
plot(performance(user.ii.uu.cor.preds, measure='tpr', x.measure='fpr'))

AUC:

user.ii.uu.cor.auc = performance(user.ii.uu.cor.preds, 'auc')
print(user.ii.uu.cor.auc@y.values)

[[1]]
[1] 0.5629184

Compute and Plot Errors¶

Now we're going to take the errors from our various models and pull them together into a single chart of errors.

Start with a helper function:

compute.metrics = function(tbl, thresh=0.5) {
  per.user = summarise(group_by(tbl, User, Algorithm),
                       RMSE=sqrt(mean(Error*Error)),
                       SSE=sum(Error*Error),
                       NCorrect = sum(abs(Error) <= thresh),
                       n=n())
  summarise(group_by(per.user, Algorithm),
            RMSE.ByUser = mean(RMSE),
            RMSE.Global = sqrt(sum(SSE) / sum(n)),
            Correct.ByUser = mean(NCorrect / n),
            Correct.Global = sum(NCorrect) / sum(n))
}

Summarize error of the individual algorithms:

algos.rmse = data.table(Family='Single',
                        compute.metrics(test.preds))
algos.rmse

And of the linear blend:

blend.rmse = data.table(Family='Blend',
                        compute.metrics(mutate(blend.preds, Algorithm='Blend')))
blend.rmse

Of our oracle hybrid that picks the best predictor for each individual prediciton:

best.rmse = data.table(Family='Oracle',
                       compute.metrics(mutate(best.preds, Algorithm='BestPred')))
best.rmse

Of each of our various per-user selection methods:

per.user.rmse = with(new.env(), {
  preds = rbind(mutate(merge(test.preds, select(user.best, User, Algorithm),
                             by=c('User', 'Algorithm')),
                       Algorithm = 'UserBestRMSE'),
                mutate(merge(test.preds, select(probe.user.best.rmse, User, Algorithm),
                             by=c('User', 'Algorithm')),
                       Algorithm = 'TuneBestRMSE'),
                mutate(merge(test.preds, select(user.best.correct, User, Algorithm),
                             by=c('User', 'Algorithm')),
                       Algorithm = 'UserMostRight'),
                mutate(merge(test.preds, select(probe.user.best.correct, User, Algorithm),
                             by=c('User', 'Algorithm')),
                       Algorithm = 'TuneMostRight'))
  data.table(Family='Per User', compute.metrics(preds))
})
per.user.rmse

Pull things together!

all.rmse = rbind(algos.rmse, blend.rmse, best.rmse, per.user.rmse) %>%
    mutate(Family=factor(Family, levels=c("Single", "Blend", "Per User", "Oracle"), ordered=TRUE))
all.metrics = melt(all.rmse, id.vars=c('Family', 'Algorithm'), variable.name='Metric') %>%
    mutate(Group = gsub("\\..*$", "", Metric))
all.metrics

Let's draw our major plot:

ggplot(all.metrics %>% filter(Group == 'RMSE')) +
  aes(x=value, y=Algorithm, color=Metric, shape=Metric) +
  geom_point() +
  facet_grid(Family ~ ., scales='free_y', space='free_y') +
  xlab(NULL) +
  theme(strip.text.y=element_text(angle=0),
        legend.position='bottom', legend.title=element_blank())

We can do the same thing with 'is correct' metrics:

ggplot(all.metrics %>% filter(Group == 'Correct')) +
  aes(x=value, y=Algorithm, color=Metric, shape=Metric) +
  geom_point() +
  facet_grid(Family ~ ., scales='free_y', space='free_y') +
  xlab(NULL) +
  theme(strip.text.y=element_text(angle=0),
        legend.position='bottom', legend.title=element_blank())

	User	Item	Rating	FunkSVD	ItemItem	Lucene	Mean	UserUser
1	5	28	3	3.977223	3.891284	4.504257	4.045896	3.968387
2	5	30	5	3.964557	3.741851	3.614607	3.614607	3.849966
3	5	52	4	3.561945	3.503652	2.873577	3.520495	3.937678
4	5	249	4	3.803318	3.585262	3.493223	3.65379	3.802025
5	5	308	3	4.026113	3.669473	3.420005	3.887884	4.268066
6	5	321	3	3.968648	3.752831	4.048923	3.733146	4.083675

	User	Item	FunkSVD	ItemItem	Lucene	Mean	UserUser
1	5	28	-0.9772226	-0.8912842	-1.504257	-1.045896	-0.9683868
2	5	30	1.035443	1.258149	1.385393	1.385393	1.150034
3	5	52	0.438055	0.4963475	1.126423	0.4795045	0.06232184
4	5	249	0.1966821	0.4147385	0.5067771	0.3462095	0.1979746
5	5	308	-1.026113	-0.669473	-0.4200048	-0.8878837	-1.268066
6	5	321	-0.9686483	-0.7528306	-1.048923	-0.7331463	-1.083675

	User	Item	FunkSVD	ItemItem	Lucene	Mean	UserUser
1	5	52	0.438055	0.4963475	1.126423	0.4795045	0.06232184
2	5	321	-0.9686483	-0.7528306	-1.048923	-0.7331463	-1.083675
3	5	532	1.936024	1.925663	1.765495	1.97519	1.645022
4	5	562	0.9504101	1.034382	0.9455245	1.213991	0.6759712
5	5	1199	1.058155	0.6718857	0.5650772	0.9833523	0.4217419
6	7	50	-0.225067	-0.2867515	-0.5417287	-0.458344	-0.3280759

	FunkSVD	ItemItem	Lucene	Mean	UserUser
FunkSVD	1.0000000	0.9531985	0.8899977	0.9212087	0.9357213
ItemItem	0.9531985	1.0000000	0.8922038	0.9130427	0.9264254
Lucene	0.8899977	0.8922038	1.0000000	0.9113573	0.8882744
Mean	0.9212087	0.9130427	0.9113573	1.0000000	0.9420421
UserUser	0.9357213	0.9264254	0.8882744	0.9420421	1.0000000

	ItemItem	FunkSVD	UserUser	Lucene	Mean
ItemItem	NA	15540	18799	19144	17295
FunkSVD	17096	NA	19550	21582	16541
UserUser	22390	21585	NA	23376	16491
Lucene	26933	27815	27574	NA	18612
Mean	29642	27332	25247	23170	NA

	User	Algorithm	MAE	RMSE	Correct	NPreds	FracCorrect
1	5	FunkSVD	1.070258	1.174847	2	5	0.4
2	5	ItemItem	0.9762218	1.099333	2	5	0.4
3	5	Lucene	1.090289	1.157529	1	5	0.2
4	5	Mean	1.077037	1.192439	2	5	0.4
5	5	UserUser	0.7777464	0.9506982	3	5	0.6
6	7	FunkSVD	0.5850241	0.6961736	4	5	0.8

	Algorithm	Count	By	Frac
1	UserUser	49277	Prediction	0.2209038
2	Mean	32726	Prediction	0.1467073
3	FunkSVD	47407	Prediction	0.2125207
4	ItemItem	45653	Prediction	0.2046577
5	Lucene	48007	Prediction	0.2152105
6	UserUser	9252	User RMSE	0.2073788
7	FunkSVD	11540	User RMSE	0.2586632
8	ItemItem	10980	User RMSE	0.2461111
9	Lucene	7806	User RMSE	0.1749675
10	Mean	5036	User RMSE	0.1128794
11	UserUser	3174	User # Correct	0.07114359
12	ItemItem	8014	User # Correct	0.1796297
13	FunkSVD	27278	User # Correct	0.6114224
14	Lucene	4637	User # Correct	0.103936
15	Mean	1511	User # Correct	0.03386829

	Algorithm	Good	Frac	CumFrac
1	ItemItem	113181	0.1014758	0.1014758
2	Lucene	19144	0.01716412	0.1186399
3	UserUser	10686	0.009580849	0.1282207
4	FunkSVD	5145	0.004612902	0.1328336
5	Mean	2633	0.002360694	0.1351943
6	Unclaimed	72281	0.06480567	0.2

	Algorithm	Good	Frac	CumFrac
1	FunkSVD	111625	0.1000807	0.1000807
2	Lucene	21582	0.01934998	0.1194307
3	UserUser	10116	0.009069799	0.1285005
4	ItemItem	4833	0.004333169	0.1328336
5	Mean	2633	0.002360694	0.1351943
6	Unclaimed	72281	0.06480567	0.2

	User	Item	Rating	Prediction	Error
1	5	52	4	3.556147	0.4438525
2	5	321	3	3.964859	-0.9648588
3	5	532	5	3.180012	1.819988
4	5	562	5	4.138803	0.8611973
5	5	1199	5	4.242837	0.7571629
6	7	50	4	4.163729	-0.1637288

	Family	Algorithm	RMSE.ByUser	RMSE.Global	Correct.ByUser	Correct.Global
1	Single	FunkSVD	0.742946	0.805641	0.5004035	0.5004035
2	Single	ItemItem	0.7455047	0.8115575	0.5073788	0.5073788
3	Single	Lucene	0.8105878	0.8806837	0.4724616	0.4724616
4	Single	Mean	0.8150593	0.881208	0.4520285	0.4520285
5	Single	UserUser	0.7718262	0.8400786	0.4912808	0.4912808

	Family	Algorithm	RMSE.ByUser	RMSE.Global	Correct.ByUser	Correct.Global
1	Per User	UserBestRMSE	0.6556089	0.7200369	0.5617609	0.5617609
2	Per User	TuneBestRMSE	0.6556089	0.7200369	0.5617609	0.5617609
3	Per User	UserMostRight	0.7276237	0.7969177	0.5826108	0.5826108
4	Per User	TuneMostRight	0.7181236	0.7898131	0.6229659	0.6229659

	Family	Algorithm	Metric	value	Group
1	Single	FunkSVD	RMSE.ByUser	0.742946	RMSE
2	Single	ItemItem	RMSE.ByUser	0.7455047	RMSE
3	Single	Lucene	RMSE.ByUser	0.8105878	RMSE
4	Single	Mean	RMSE.ByUser	0.8150593	RMSE
5	Single	UserUser	RMSE.ByUser	0.7718262	RMSE
6	Blend	Blend	RMSE.ByUser	0.7304989	RMSE
7	Oracle	BestPred	RMSE.ByUser	0.5558137	RMSE
8	Per User	UserBestRMSE	RMSE.ByUser	0.6556089	RMSE
9	Per User	TuneBestRMSE	RMSE.ByUser	0.6556089	RMSE
10	Per User	UserMostRight	RMSE.ByUser	0.7276237	RMSE
11	Per User	TuneMostRight	RMSE.ByUser	0.7181236	RMSE
12	Single	FunkSVD	RMSE.Global	0.805641	RMSE
13	Single	ItemItem	RMSE.Global	0.8115575	RMSE
14	Single	Lucene	RMSE.Global	0.8806837	RMSE
15	Single	Mean	RMSE.Global	0.881208	RMSE
16	Single	UserUser	RMSE.Global	0.8400786	RMSE
17	Blend	Blend	RMSE.Global	0.793823	RMSE
18	Oracle	BestPred	RMSE.Global	0.6270164	RMSE
19	Per User	UserBestRMSE	RMSE.Global	0.7200369	RMSE
20	Per User	TuneBestRMSE	RMSE.Global	0.7200369	RMSE
21	Per User	UserMostRight	RMSE.Global	0.7969177	RMSE
22	Per User	TuneMostRight	RMSE.Global	0.7898131	RMSE
23	Single	FunkSVD	Correct.ByUser	0.5004035	Correct
24	Single	ItemItem	Correct.ByUser	0.5073788	Correct
25	Single	Lucene	Correct.ByUser	0.4724616	Correct
26	Single	Mean	Correct.ByUser	0.4520285	Correct
27	Single	UserUser	Correct.ByUser	0.4912808	Correct
28	Blend	Blend	Correct.ByUser	0.5125073	Correct
29	Oracle	BestPred	Correct.ByUser	0.6759717	Correct
30	Per User	UserBestRMSE	Correct.ByUser	0.5617609	Correct
31	Per User	TuneBestRMSE	Correct.ByUser	0.5617609	Correct
32	Per User	UserMostRight	Correct.ByUser	0.5826108	Correct
33	Per User	TuneMostRight	Correct.ByUser	0.6229659	Correct
34	Single	FunkSVD	Correct.Global	0.5004035	Correct
35	Single	ItemItem	Correct.Global	0.5073788	Correct
36	Single	Lucene	Correct.Global	0.4724616	Correct
37	Single	Mean	Correct.Global	0.4520285	Correct
38	Single	UserUser	Correct.Global	0.4912808	Correct
39	Blend	Blend	Correct.Global	0.5125073	Correct
40	Oracle	BestPred	Correct.Global	0.6759717	Correct
41	Per User	UserBestRMSE	Correct.Global	0.5617609	Correct
42	Per User	TuneBestRMSE	Correct.Global	0.5617609	Correct
43	Per User	UserMostRight	Correct.Global	0.5826108	Correct
44	Per User	TuneMostRight	Correct.Global	0.6229659	Correct