Towards Recommender Engineering

Tools and Experiments for Identifying Recommender Differences

Michael Ekstrand
July 8, 2014

Dedication

This thesis dedicated in memory of John T. Riedl.

Recommender engineering is

Understand recommendation well enough to build recommenders from requirements.

#1tweetresearch

Different recommender algorithms have different behaviors. I try to figure out what those are.

… in relation to user needs

… so we can design effective recommenders

… because I'm curious

Contributions

Background
Tools
Experiments
User Study
Going Forward

Publications

Collaborators

Overview

Background
Tools
Experiments
User Study
Going Forward

Overview

Background
Tools
Experiments
User Study
Going Forward

Recommender Systems

GoodReads

Recommender Systems

Recommender architecture

recommending items to users

Recommender Approaches

Many algorithms:

  • Content-based filtering
  • Neighbor-based collaborative filtering [Resnick et al., 1994; Sarwar et al., 2001]
  • Matrix factorization [Deerwester et al., 1990; Sarwar et al., 2000; Chen et al., 2012]
  • Neural networks [Jennings & Higuchi, 1993; Salakhutidinov et al., 2007]
  • Graph search [Aggarwal et al., 1999; McNee et al., 2002]
  • Hybrids [Burke, 2002]

Common R&D Practice

  1. Develop recommender tech (algorithm, UI, etc.)
  2. Test on particular data/use case
  3. See if it is better than baseline
  4. Publish or deploy

Evaluating Recommenders

Many measurements:

Common R&D Practice

  1. Develop recommender tech (algorithm, UI, etc.)
  2. Test on particular data/use case
  3. See if it is better than baseline
  4. Publish or deploy

Learned: is new tech better for target application?

Not learned: for what applications is new tech better? why?

Algorithms are Different

Human-Recommender Interaction [McNee 2006]

Requirements for Engineering

To enable recommender engineering, we to understand:

All this needs to be reproduced, validated, and generalizable.

Prior Work

Most work designs for or uses a property. Still need:

My work

LensKit

enables reproducible research on wide variety of algorithms

Offline experiments
validate LensKit
tune algorithm configurations

demonstrate algorithm differences

User study

obtain user judgements of algorithm differences

Overview

Background
Tools
Experiments
User Study
Going Forward

LensKit Features

build
APIs for recommendation tasks
implementations of common algorithms

integrates with databases, web apps

research
numerous evaluation metrics
cross-validation

flexible, reconfigurable algorithms

learn
open-source code

study production-grade implementations

LensKit Algorithms

Algorithm Architecture

Principle: build algorithms from reusable, reconfigurable components.

Benefits

Grapht

Java-based dependency injector to configure and manipulate algorithms.

Evaluator

Outcomes of LensKit

Overview

Background
Tools
Experiments
User Study
Going Forward

Tuning Algorithms

Revisit Herlocker et al. [2002], Sarwar et al. [2001]

Tuning Results

User-User
  • Mostly consistent w/ Herlocker
  • Cosine similarity over normalized data works well, should not have been disregarded.
Item-Item
  • Reproduced Sarwar's results
  • Item-mean normalization outperforms previous strategies
  • Developed tuning strategy
FunkSVD
  • Small data sets need more training
  • Parameters highly entangled — strategy elusive

Item-Item Strategy

  1. Use item-user mean for unpredictable items
  2. Start with item mean centering for similarity normalization
  3. With full model, find neighborhood size by hill-climbing
  4. Decrease model size for size/quality tradeoff
  5. Try item-user mean, see if it is better

    • If yes, check some nearby neighborhood sizes
  6. If desired, add slight similarity damping

Rank-Based Evaluation

Ratings convey order, not degree. Does this matter?

Answers

  • Tunings consistent for nDCG (rank consistency)

  • MRR (top-\(N\)) has drastic changes

    • Different optimal normalization on sparse data set
    • But top-\(N\) metrics are ill-behaved
  • Can't rule out effect, but more sophistication needed.

When Recommenders Fail

Short paper, RecSys 2012; ML-10M data

Counting mispredictions (\(|p - r| > 0.5\)) gives different picture than prediction error.

Consider per-user fraction correct and RMSE:

Also:

Marginal Correct Predictions

Q1: Which algorithm has most successes (\(\epsilon \le 0.5\))?

Q\(n+1\): Which has most successes where \(1 … n\) failed?

Algorithm # Good %Good Cum. % Good
ItemItem 859,600 53.0 53.0
UserUser 131,356 8.1 61.1
Lucene 69,375 4.3 65.4
FunkSVD 44,960 2.8 68.2
Mean 16,470 1.0 69.2
Unexplained 498,850 30.8 100.0

Offline Experiment Lessons

Overview

Background
Tools
Experiments
User Study
Going Forward

User Study

Goal: identify user-perceptible differences.

RQ1

How do user-perceptible differences affect choice of algorithm?

RQ2

What differences do users perceive between algorithms?

RQ3

How do objective metrics relate to subjective perceptions?

Context: MovieLens

Algorithms

Three well-known algorithms for recommendation:

Each user assigned 2 algorithms

Predictions

Predicted ratings influence list perception.

To control, 3 prediction treatments:

Each user assigned 1 condition

No effect of condition.

Survey Design

Example Questions

Diversity

Which list has a more varied selection of movies?

Satisfaction

Which recommender would better help you find movies to watch?

Novelty

Which list has more movies you do not expect?

Analysis features

joint evaluation
users compare 2 lists
judgment-making different from separate eval
enables more subtle distinctions

hard to interpret

factor analysis
22 questions measure 5 factors
more robust than single questions

structural equation model tests relationships

Hypothesized Model

Response Summary

582 users completed

Condition (A v. B) N Pick A Pick B % Pick B
I-I v. U-U 201 144 57 28.4%
I-I v. SVD 198 101 97 49.0%
SVD v. U-U 183 136 47 25.7%

bold is significant (\(p < 0.001\), \(H_0: b/n = 0.5\))

Question Responses

Measurement Model

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Obsc. Ratio 1.308 ± 0.206 Sim. Ratio Acc. Ratio Satisfaction -0.700 ± 0.073 0.270 ± 0.061 1.057 ± 0.509 Diversity 0.184 ± 0.056 -51.756 ± 8.558

Differences from Hypothesis

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Obsc. Ratio 1.308 ± 0.206 Sim. Ratio Acc. Ratio Satisfaction -0.700 ± 0.073 0.270 ± 0.061 1.057 ± 0.509 Diversity 0.184 ± 0.056 -51.756 ± 8.558

RQ1: Factors of Choice

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Obsc. Ratio 1.308 ± 0.206 Sim. Ratio Acc. Ratio Satisfaction -0.700 ± 0.073 0.270 ± 0.061 1.057 ± 0.509 Diversity 0.184 ± 0.056 -51.756 ± 8.558

Choice: Satisfaction

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Satisfaction -0.700 ± 0.073 0.270 ± 0.061 Diversity 0.184 ± 0.056

Satisfaction positively affects impression and choice.

Choice: Diversity

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Satisfaction -0.700 ± 0.073 0.270 ± 0.061 Diversity 0.184 ± 0.056

Diversity positively influences satisfaction.

Choice: Novelty

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Satisfaction -0.700 ± 0.073 0.270 ± 0.061 Diversity 0.184 ± 0.056

Novelty hurts satisfaction and choice/preference.

Choice: Novelty (cont.)

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Satisfaction -0.700 ± 0.073 0.270 ± 0.061 Diversity 0.184 ± 0.056

Novelty improves diversity (slightly).

Choice: Novelty (cont.)

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Satisfaction -0.700 ± 0.073 0.270 ± 0.061 Diversity 0.184 ± 0.056

Novelty has direct negative impact on first impression.

Implications

RQ2: Algorithm Differences

Measurement Model

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Obsc. Ratio 1.308 ± 0.206 Sim. Ratio Acc. Ratio Satisfaction -0.700 ± 0.073 0.270 ± 0.061 1.057 ± 0.509 Diversity 0.184 ± 0.056 -51.756 ± 8.558

RQ2: Algorithm Differences

Choice in Pseudo-Experiments

Baseline Tested % Tested > Baseline
Item-Item SVD 48.99
User-User 28.36
SVD Item-Item 51.01
User-User 25.68
User-User Item-Item 71.64
SVD 74.32

RQ3: Objective Properties

Measure objective features of lists:

Novelty

obscurity (popularity rank)

Diversity

intra-list similarity (Ziegler)

  • Sim. metric: cosine over tag genome (Vig)
  • Also tried rating vectors, latent feature vectors
Accuracy/Sat

RMSE over last 5 ratings

Relativize: take log ratio of two lists' values

Property Distributions

Model with Objectives

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Obsc. Ratio 1.308 ± 0.206 Sim. Ratio Acc. Ratio Satisfaction -0.700 ± 0.073 0.270 ± 0.061 1.057 ± 0.509 Diversity 0.184 ± 0.056 -51.756 ± 8.558

Summary

Results and Expectations

Commonly-held offline beliefs:

Perceptual results (here and elsewhere):

Remaining Questions

Overview

Background
Tools
Experiments
User Study
Going Forward

Contributions

To promote recommender engineering:

Future Work

Thank you!

Also thanks to

  • Excellent collaborators
  • Jennifer for supporting me
  • GroupLens Research for amazing support
  • NSF for funding research
  • The Noun Project for great icons

    • ‘Document’ by Rob Gill
    • ‘Test Tube’ by Olivier Guin
    • ‘Experiment’ by Icons Pusher
    • ‘Future’ by Megan Sheehan

Correlation Functions

\[ \begin{align} Cor(u,v) & = \frac{\sum_i (r_{ui} - \mu_u) (r_{vi} - \mu_v)}{ \sqrt{\sum_i (r_{ui} - \mu_u)^2} \sqrt{\sum_i (r_{vi} - \mu_v)^2}} \\ & = \frac{(\vec{r_u} - \mu_u) \cdot (\vec{r_v} - \mu_v)} {\|\vec{r_u} - \mu_u\| \|\vec{r_v} - \mu_v\|} \\ & = cos(u, v) \end{align} \]

Difference is in summing domains.