This notebook reproduces the analysis in ‘When Recommenders Fail’.
There is some blending of dplyr
and data.table
code here. I try to mostly use dplyr, but there are a few places where I do data.table
joins directly.
Let us first load some libraries.
library(plyr)
library(data.table)
library(dplyr)
library(reshape2)
library(ggplot2)
library(ROCR)
library(lazyeval)
The following function samples a vector, returning a vector with of n TRUE
values at the selected items. It makes it easier to sample other data structures.
sample.true = function(n, l) {
result = vector(length=l)
result[sample(1:l, n)] = TRUE
result
}
Another little function to check if a file is up to date.
file.current = function(dst, src) {
if (!file.exists(dst)) {
FALSE
} else if (!file.exists(src)) {
TRUE
} else {
file.info(dst)$mtime >= file.info(src)$mtime
}
}
We now want to load the results of our LensKit experiment.
fn.cache = "build/predictions.Rdata"
fn.src = "build/individual.csv.gz"
if (file.current(fn.cache, fn.src)) {
message("loading cache file ", fn.cache)
load(fn.cache)
} else {
message("loading src file ", fn.src)
preds.tall = data.table(read.csv(gzfile(fn.src)))
preds.tall = preds.tall[,list(User,Item,Rating,Algorithm,Prediction)]
setkey(preds.tall, User, Item, Algorithm)
preds.tall = mutate(preds.tall, Error = Rating - Prediction)
users = summarize(group_by(preds.tall, User), n=length(unique(Item)))
users.usable = filter(users, n >= 10)
preds.tall = preds.tall[users.usable[,list(User)]]
message("writing cache file ", fn.cache)
save(file=fn.cache, compress='xz',
preds.tall, users, users.usable)
}
algorithms = levels(preds.tall$Algorithm)
Let's pivot the prediction table to have a column per algorithm.
predictions = preds.tall %>%
select(-Error) %>%
dcast.data.table(User + Item + Rating ~ Algorithm, value.var='Prediction')
setkey(predictions, User, Item)
head(predictions)
We now extract the ratings from the predictions table to get the test ratings.
ratings = select(predictions, User, Item, Rating)
setkey(ratings, User, Item)
For some of our analysis, we need the entire rating history. So let's load it.
fn.cache = "build/ratings.Rdata"
fn.src = "data/ml-10m/ratings.dat"
if (file.current(fn.cache, fn.src)) {
message("loading cache file ", fn.cache)
load(fn.cache)
} else {
message("loading src file ", fn.src)
all.ratings = read.csv(pipe("sed -e 's/::/,/g' -e 's/,[[:digit:]]*$//' data/ml-10m/ratings.dat"),
header=FALSE, col.names=c("User", "Item", "Rating"))
all.ratings = data.table(all.ratings, key=c('User', 'Item'))
train.ratings = all.ratings[!ratings]
rm(all.ratings)
message("writing cache file ", fn.cache)
save(file=fn.cache, compress='xz',
train.ratings)
}
Now that the data is loaded, we need to do some processing of it.
First step: pulling apart probe ratings (for training hybrids) and test ratings (for testing everything).
LensKit output a set of predictions for test items for each user. We pick 5 of those items for each user as probe items, and the rest as the actual test items. These actual test items will be our test items for the rest of the analysis.
To do this, we will first identify probe pairs: for each user, 5 random items. This will be stored in a frame test.pair.purpose
.
test.pair.purpose = ratings %>% select(User, Item) %>% group_by(User) %>% mutate(IsTest=sample(n()) > 5)
To make this easier to use, we will create a function to filter a table down to test or train ratings. If TRUE
(the default), it will look for test ratings, otherwise, probe ratings.
filter.pairs = function(tbl, want.test=TRUE) {
inner_join(tbl, test.pair.purpose) %>% filter(!IsTest) %>% select(-IsTest)
}
We will then select the predictions for these probe pairs to be the probe predictions, and the rest as test predictions. We'll use data.table
's joining capabilities for this.
probe.preds = filter.pairs(preds.tall, FALSE)
test.preds = filter.pairs(preds.tall)
test.rating.count = nrow(test.preds)
dim(probe.preds)
dim(test.preds)
Now, we will convert the predictions table to an errors table. This will include both test and probe predictions.
errors.full = dcast.data.table(select(preds.tall, -Rating, -Prediction),
User + Item ~ Algorithm)
head(errors.full)
test.errors = dcast.data.table(select(test.preds, -Rating, -Prediction),
User + Item ~ Algorithm)
head(test.errors)
And we will process our test predictions to find out the algorithm that makes the best prediction for each test rating.
best.preds = test.preds %>%
group_by(User, Item) %>%
summarise(Algorithm=Algorithm[which.min(abs(Error))],
Prediction=Prediction[which.min(abs(Error))],
# Rating should be same value. We'll fail if not.
Rating=unique(Rating)) %>%
mutate(Error = Rating - Prediction)
setkey(best.preds, User, Item)
head(best.preds)
Summarize these results:
best.algos.pred = best.preds %>%
group_by(Algorithm) %>%
summarise(Count=n()) %>%
mutate(By='Prediction')
best.algos.pred
Now, we will compute per-user error (rather than per item). We will then see how often each algorithm is best by user RMSE or by user # correct.
First, we group things by user.
errors.by.user = test.preds %>%
group_by(User, Algorithm) %>%
summarise(MAE = mean(abs(Error)),
RMSE = sqrt(mean(Error*Error)),
Correct = sum(abs(Error <= 0.5)),
NPreds = n()) %>%
mutate(FracCorrect = Correct / NPreds)
head(errors.by.user)
Next, we pick the best algorithm for each user, by two different metrics (RMSE and # Correct)
user.best = errors.by.user %>% group_by(User) %>%
summarise(Algorithm=Algorithm[which.min(RMSE)],
RMSE=min(RMSE))
user.best.correct = errors.by.user %>% group_by(User) %>%
summarise(Algorithm=Algorithm[which.max(Correct)],
Correct=max(Correct))
And add it to the best algorithms table.
best.algos = rbind(best.algos.pred %>% mutate(Frac=Count / sum(Count)),
user.best %>% group_by(Algorithm) %>%
summarise(Count=n()) %>% mutate(By='User RMSE', Frac=Count / sum(Count)),
user.best.correct %>% group_by(Algorithm) %>%
summarise(Count=n()) %>% mutate(By='User # Correct', Frac=Count / sum(Count)))
best.algos$By=as.factor(best.algos$By)
best.algos
First, what algorithms are best by each metric?
options(repr.plot.width=7, repr.plot.height=3)
ggplot(best.algos) +
aes(x=Algorithm, y=Frac) +
geom_bar(stat='identity') +
facet_wrap(~ By) +
theme(axis.text.x=element_text(angle=45, vjust=0.5)) +
ylab("Frac. of Preds/Users")
How correlated are the errors of our predictors?
errors.test = filter.pairs(errors.full)
error.cor.matrix = cor(select(errors.test, -User, -Item))
error.cor.matrix
How often is each algorithm correct?
algo.correct = summarise(group_by(test.preds, Algorithm),
N.Good.05 = sum(abs(Error) <= 0.5),
N.Good.07 = sum(abs(Error) <= 0.75),
N.Good.10 = sum(abs(Error) <= 1.0),
Good.05 = mean(abs(Error) <= 0.5),
Good.07 = mean(abs(Error) <= 0.75),
Good.10 = mean(abs(Error) <= 1.0))
correct.tall = melt(select(algo.correct, Algorithm, starts_with('Good')),
id.vars='Algorithm')
correct.tall = mutate(correct.tall, Thresh = c('ε ≤ 0.5', 'ε ≤ 0.75', 'ε ≤ 1.0')[as.integer(variable)])
ggplot(correct.tall) +
aes(x=Algorithm, y=value) +
geom_bar(stat='identity') +
facet_wrap(~ Thresh) +
ylab("Fraction Correct") + xlab(NULL) +
theme(axis.text.x=element_text(angle=45, vjust=0.5))
Let's rank the algorithms by # correct at 0.5 level.
algo.success = algo.correct %>%
select(Algorithm, N.Good.05, Good.05) %>%
arrange(-Good.05)
algo.success
How often do binary accuracy and RMSE pick the same 'best' algorithm for each user? Each panel represents users for which RMSE picked that algorithm to be best, and the bars indicate the fraction of those users for which '# Correct' picked each algorithm to be best.
user.picked = merge(user.best, user.best.correct, by='User',
suffixes=c('.RMSE', '.Correct'))
user.picked.summary = ddply(user.picked, .(Algorithm.RMSE), function(adf) {
result = summarise(group_by(adf, Algorithm.Correct), Count=n())
mutate(result, Frac=Count / sum(Count))
})
ggplot(user.picked.summary) +
aes(x=Algorithm.Correct, y=Frac) +
geom_bar(stat='identity') +
facet_grid(~ Algorithm.RMSE) +
theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1))
Now, we want to analyze the marginal benefit: how much does one algorithm get correct, after accounting for the correct predictions from other algorithms?
First up: we need a function to identify the ratings that a particular algorithm got correct, and remove them so we can test the next algorithm(s).
remove.correct = function(err.tbl, algo, thresh=0.5) {
message("removing algorithm ", algo)
# make a formula to select things where the error on algo excedes the threshold
sel = interp(~abs(algo) > thresh, algo=as.name(as.character(algo)), thresh=thresh)
filter_(err.tbl, sel)
}
With this function, we are going to create a marginal improvement matrix: in turn, remove each algorithm (the rows). For each other algorithm, compute how many predictions the first algorithm (A) missed but the second algorithm (B) got correct.
algos.ordered = as.character(algo.success$Algorithm)
algo.factor = function(as) {
factor(as, levels=algos.ordered, ordered=TRUE)
}
marginal.table = ldply(algos.ordered,
function(algo) {
message("finding marginal improvements from ", algo)
remaining = remove.correct(test.errors, algo)
rem.tall = melt(select_(remaining, paste("-", algo, sep="")),
id.vars=c("User", "Item"),
variable.name="Algorithm", value.name="Error")
improve = remaining %>%
select_(paste("-", algo, sep="")) %>%
melt(id.vars=c("User", "Item"),
variable.name="Algorithm", value.name="Error") %>%
group_by(Algorithm) %>%
summarise(Good = sum(abs(Error) <= 0.5))
data.table(Removed=algo.factor(algo),
Algorithm=algo.factor(improve$Algorithm),
Good=improve$Good)
})
marginal.matrix = acast(marginal.table, Removed ~ Algorithm)
marginal.matrix
Now, we are going to do this iteratively - remove the best algorithm's predictions, then the one that is the best on the remaining predictions, etc.
First, we need a function to perform ths cumulative removal. This function takes several parameters:
tbl : The error table to work on
use : The algorithms to start with, if we want to start with something other than the best.
algos.left : The remaining algorithms (callers should never set this, used only for recursive calls).
thresh : The error threshold
cum.marginal.good = function(tbl, use=NULL, algos.left=algorithms, thresh=0.5) {
if (length(algos.left) == 0) {
# No algorithms left - finish up
tbl.sum = summarise(group_by(tbl, User, Item), n=n())
data.frame(Algorithm = 'Unclaimed', Good = nrow(tbl.sum))
} else {
# make formulas to summarize each algorithm's column
formulas = lapply(algos.left, function(algo) {
interp(~sum(abs(algo) <= thresh), algo=as.name(algo), thresh=thresh)
})
# pass it off to summarise_, getting a summary table
algo.good = as.data.frame(do.call(summarise_, c(list(tbl), setNames(formulas, algos.left))))
if (length(use) == 0) {
cur.algo = with(melt(algo.good, id.vars=c()), variable[which.max(value)])
use.next = NULL
} else {
cur.algo = use[1]
use.next = tail(use, -1)
}
left.next = setdiff(algos.left, cur.algo)
row = data.table(Algorithm = cur.algo,
Good = algo.good[[cur.algo]])
tbl.next = remove.correct(tbl, cur.algo, thresh=thresh)
rbind(row,
cum.marginal.good(tbl.next, use=use.next, algos.left=left.next, thresh=thresh))
}
}
A function to enhance the output with some additional stats.
cum.table.enhance = function(tbl, ntotal = test.rating.count) {
mutate(tbl, Frac = Good / ntotal, CumFrac = cumsum(Frac))
}
Now, run and pick the best!
cum.best = cum.table.enhance(cum.marginal.good(test.errors))
cum.best
Let's try removing Funk-SVD right after Item-Item.
cum.best.ii.svd = cum.table.enhance(cum.marginal.good(test.errors,
use=c("ItemItem", "FunkSVD")))
cum.best.ii.svd
Let's try with Funk-SVD first.
cum.best.svd = cum.table.enhance(cum.marginal.good(test.errors,
use=c("FunkSVD")))
cum.best.svd
cum.best.svd.ii = cum.table.enhance(cum.marginal.good(test.errors,
use=c("FunkSVD", "ItemItem")))
cum.best.svd.ii
And user-user after FunkSVD.
cum.best.svd.uu = cum.table.enhance(cum.marginal.good(test.errors,
use=c("FunkSVD", "UserUser")))
cum.best.svd.uu
We're now going to look at the probe switching hybrid, which uses our probe ratings to pick the best predictor for each user.
First we need to identify the best algorithms for each user on the probe users. We'll start by summarizing the probe errors.
probe.user.errors = probe.preds %>%
group_by(User, Algorithm) %>%
summarise(MAE = mean(abs(Error)),
RMSE = sqrt(mean(Error * Error)),
Correct = sum(abs(Error) <= 0.5))
Now pick the best for each user.
probe.user.best.rmse = summarise(group_by(probe.user.errors, User),
Algorithm=Algorithm[which.min(RMSE)],
RMSE=min(RMSE))
probe.user.best.correct = summarise(group_by(probe.user.errors, User),
Algorithm=Algorithm[which.max(Correct)],
Correct=max(Correct))
Now we will merge these results with the user-best results.
user.best.rmse.merged = mutate(merge(user.best, probe.user.best.rmse,
by='User', suffixes=c('.user', '.probe')),
Agree = Algorithm.user == Algorithm.probe)
user.best.correct.merged = mutate(merge(user.best.correct, probe.user.best.correct,
by='User', suffixes=c('.user', '.probe')),
Agree = Algorithm.user == Algorithm.probe)
head(user.best.rmse.merged)
We'll now train a linear hybrid of our algorithms, using the probe ratings.
norm.predictions = mutate(predictions,
UserUser = UserUser - Mean,
ItemItem = ItemItem - Mean,
Lucene = Lucene - Mean,
FunkSVD = FunkSVD - Mean)
blend.model = lm(Rating ~ Mean + UserUser + ItemItem + Lucene + FunkSVD,
# get the probe predictions
filter.pairs(norm.predictions, FALSE))
summary(blend.model)
And use that to generate predictions.
blend.preds = with(new.env(), {
table = filter.pairs(norm.predictions)
message("Making ", nrow(table), " predictions")
pvec = predict(blend.model, table)
pvec = pmax(pvec, 0.5)
pvec = pmin(pvec, 5)
mutate(data.table(select(table, User, Item, Rating),
Prediction=pvec),
Error = Rating - Prediction)
})
head(blend.preds)
For this part of the analysis, we will examine per-user characteristics and try to use them to predict various kinds of models.
First, we must summarize the user data from the training ratings.
user.info = train.ratings %>%
group_by(User) %>%
summarise(RatingCount = n(),
MeanRating = mean(Rating),
RatingVar = var(Rating)) %>%
mutate(LogCount = log10(RatingCount))
user.count.all = nrow(user.info)
user.count.usable = nrow(user.best)
If we ignore the FunkSVD algorithm, can we predict whether Item-Item will be the best algorithm?
Prepare the model training data:
user.best.nosvd = summarise(group_by(filter(errors.by.user, Algorithm != 'FunkSVD'),
User),
BestAlgo=Algorithm[which.min(RMSE)],
BestRMSE = min(RMSE))
user.best.data = inner_join(user.info, user.best.nosvd) %>%
mutate(IIBest = BestAlgo == 'ItemItem')
summary(user.best.data$IIBest)
Then build the model:
user.best.test = sample_frac(user.best.data, size=0.2)
setkey(user.best.test, User)
user.best.train = user.best.data[!user.best.test]
user.best.model = glm(IIBest ~ LogCount + RatingVar,
data=user.best.train,
family=binomial())
summary(user.best.model)
Plot the performance (ROC) curve:
user.best.preds = prediction(predict(user.best.model, user.best.test),
user.best.test$IIBest)
options(repr.plot.height=5)
plot(performance(user.best.preds, measure='tpr', x.measure='fpr'))
And compute the area under the curve:
user.best.auc = performance(user.best.preds, 'auc')
print(user.best.auc@y.values)
Let's try to predict when item-item will be better than user-user for a particular user.
Again, prepare the data first.
user.wide.rmse = dcast(select(errors.by.user, User, Algorithm, RMSE),
User ~ Algorithm, value.var='RMSE')
user.wide.correct = dcast(select(errors.by.user, User, Algorithm, Correct),
User ~ Algorithm, value.var='Correct')
user.errors.wide = data.table(merge(user.wide.rmse, user.wide.correct,
by='User', suffixes=c('.RMSE', '.Correct')),
key='User')
user.ii.uu.data = mutate(user.info[user.errors.wide],
IIBest.RMSE = ItemItem.RMSE <= UserUser.RMSE,
IIBest.Correct = ItemItem.Correct >= UserUser.Correct)
summary(mutate(select(user.ii.uu.data, starts_with('IIBest.')),
Agree = IIBest.RMSE == IIBest.Correct))
Train the model, using user RMSE as the selection strategy:
user.ii.uu.test = sample_frac(user.ii.uu.data, size=0.2)
setkey(user.ii.uu.test, User)
user.ii.uu.train = user.ii.uu.data[!user.ii.uu.test]
user.ii.uu.model = glm(IIBest.RMSE ~ LogCount + RatingVar,
data=user.ii.uu.train,
family=binomial())
summary(user.ii.uu.model)
ROC curve:
user.ii.uu.preds = prediction(predict(user.ii.uu.model, user.ii.uu.test),
user.ii.uu.test$IIBest.RMSE)
plot(performance(user.ii.uu.preds, measure='tpr', x.measure='fpr'))
Area under the curve:
user.ii.uu.auc = performance(user.ii.uu.preds, 'auc')
print(user.ii.uu.auc@y.values)
Another model, using # Correct as the selection strategy.
user.ii.uu.cor.model = glm(IIBest.Correct ~ LogCount + MeanRating,
data=user.ii.uu.train,
family=binomial())
summary(user.ii.uu.cor.model)
ROC curve:
user.ii.uu.cor.preds = prediction(predict(user.ii.uu.cor.model, user.ii.uu.test),
user.ii.uu.test$IIBest.Correct)
plot(performance(user.ii.uu.cor.preds, measure='tpr', x.measure='fpr'))
AUC:
user.ii.uu.cor.auc = performance(user.ii.uu.cor.preds, 'auc')
print(user.ii.uu.cor.auc@y.values)
Now we're going to take the errors from our various models and pull them together into a single chart of errors.
Start with a helper function:
compute.metrics = function(tbl, thresh=0.5) {
per.user = summarise(group_by(tbl, User, Algorithm),
RMSE=sqrt(mean(Error*Error)),
SSE=sum(Error*Error),
NCorrect = sum(abs(Error) <= thresh),
n=n())
summarise(group_by(per.user, Algorithm),
RMSE.ByUser = mean(RMSE),
RMSE.Global = sqrt(sum(SSE) / sum(n)),
Correct.ByUser = mean(NCorrect / n),
Correct.Global = sum(NCorrect) / sum(n))
}
Summarize error of the individual algorithms:
algos.rmse = data.table(Family='Single',
compute.metrics(test.preds))
algos.rmse
And of the linear blend:
blend.rmse = data.table(Family='Blend',
compute.metrics(mutate(blend.preds, Algorithm='Blend')))
blend.rmse
Of our oracle hybrid that picks the best predictor for each individual prediciton:
best.rmse = data.table(Family='Oracle',
compute.metrics(mutate(best.preds, Algorithm='BestPred')))
best.rmse
Of each of our various per-user selection methods:
per.user.rmse = with(new.env(), {
preds = rbind(mutate(merge(test.preds, select(user.best, User, Algorithm),
by=c('User', 'Algorithm')),
Algorithm = 'UserBestRMSE'),
mutate(merge(test.preds, select(probe.user.best.rmse, User, Algorithm),
by=c('User', 'Algorithm')),
Algorithm = 'TuneBestRMSE'),
mutate(merge(test.preds, select(user.best.correct, User, Algorithm),
by=c('User', 'Algorithm')),
Algorithm = 'UserMostRight'),
mutate(merge(test.preds, select(probe.user.best.correct, User, Algorithm),
by=c('User', 'Algorithm')),
Algorithm = 'TuneMostRight'))
data.table(Family='Per User', compute.metrics(preds))
})
per.user.rmse
Pull things together!
all.rmse = rbind(algos.rmse, blend.rmse, best.rmse, per.user.rmse) %>%
mutate(Family=factor(Family, levels=c("Single", "Blend", "Per User", "Oracle"), ordered=TRUE))
all.metrics = melt(all.rmse, id.vars=c('Family', 'Algorithm'), variable.name='Metric') %>%
mutate(Group = gsub("\\..*$", "", Metric))
all.metrics
Let's draw our major plot:
ggplot(all.metrics %>% filter(Group == 'RMSE')) +
aes(x=value, y=Algorithm, color=Metric, shape=Metric) +
geom_point() +
facet_grid(Family ~ ., scales='free_y', space='free_y') +
xlab(NULL) +
theme(strip.text.y=element_text(angle=0),
legend.position='bottom', legend.title=element_blank())
We can do the same thing with 'is correct' metrics:
ggplot(all.metrics %>% filter(Group == 'Correct')) +
aes(x=value, y=Algorithm, color=Metric, shape=Metric) +
geom_point() +
facet_grid(Family ~ ., scales='free_y', space='free_y') +
xlab(NULL) +
theme(strip.text.y=element_text(angle=0),
legend.position='bottom', legend.title=element_blank())