User Perception of Differences in Recommender Algorithms

October 8, 2014
Michael D. Ekstrand
Texas State University
Max Harper
GroupLens Research, University of Minnesota
Martijn C. Willemsen
Eindhoven University of Technology
Joseph A. Konstan
GroupLens Research, University of Minnesota

TL;DR

Goal: identify user-perceptible differences between recommenders.

Why?

‘Accuracy is not enough’ [McNee et al. 2006]

Get data on

  • what users want
  • how algorithms differ

that can be used to calibrate metrics

Research Questions

RQ1
How do subjective properties affect choice of recommendations?
RQ2
What differences do users perceive between lists of recommendations produced by different algorithms?
RQ3
How do objective metrics relate to subjective perceptions?

Context: MovieLens

Algorithms

Three well-known algorithms for recommendation:

Each user assigned 2 algorithms

Predictions

Predicted ratings influence list perception.

To control, 3 prediction treatments:

Each user assigned 1 condition

No effect of predict condition.

Survey Design

  1. Initial ‘which do you like better?’

  2. 22 questions

    • ‘Which list has more movies that you find appealing?’
    • ‘much more A than B’ to ‘much more B than A’
    • Target 5 concepts
  3. Forced choice selection for future use

  4. Free-form text field

Hypothesized Model

Example Questions

Diversity
Which list has a more varied selection of movies?
Satisfaction
Which recommender would better help you find movies to watch?
Novelty
Which list has more movies you do not expect?

Analysis features

joint evaluation
users compare 2 lists
judgment-making different from separate eval
enables more subtle distinctions
hard to interpret
factor analysis
22 questions measure 5 factors
more robust than single questions
structural equation model tests relationships

Response Summary

582 users completed

Condition (A v. B) N Pick A Pick B % Pick B
I-I v. U-U 201 144 57 28.4%
I-I v. SVD 198 101 97 49.0%
SVD v. U-U 183 136 47 25.7%

bold is significant (p < 0.001, H0 : b/n = 0.5)

Measurement Model

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Obsc. Ratio 1.308 ± 0.206 Sim. Ratio Acc. Ratio Satisfaction -0.700 ± 0.073 0.270 ± 0.061 1.057 ± 0.509 Diversity 0.184 ± 0.056 -51.756 ± 8.558 Accuracy Understands Me

Differences from Hypothesis

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Obsc. Ratio 1.308 ± 0.206 Sim. Ratio Acc. Ratio Satisfaction -0.700 ± 0.073 0.270 ± 0.061 1.057 ± 0.509 Diversity 0.184 ± 0.056 -51.756 ± 8.558 Accuracy Understands Me

RQ1: Factors of Choice

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Obsc. Ratio 1.308 ± 0.206 Sim. Ratio Acc. Ratio Satisfaction -0.700 ± 0.073 0.270 ± 0.061 1.057 ± 0.509 Diversity 0.184 ± 0.056 -51.756 ± 8.558 Accuracy Understands Me

Choice: Satisfaction

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Satisfaction -0.700 ± 0.073 0.270 ± 0.061 Diversity 0.184 ± 0.056

Satisfaction positively affects impression and choice.

Choice: Diversity

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Satisfaction -0.700 ± 0.073 0.270 ± 0.061 Diversity 0.184 ± 0.056

Diversity positively influences satisfaction.

Choice: Novelty

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Satisfaction -0.700 ± 0.073 0.270 ± 0.061 Diversity 0.184 ± 0.056

Novelty hurts satisfaction and choice/preference.

Choice: Novelty (cont.)

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Satisfaction -0.700 ± 0.073 0.270 ± 0.061 Diversity 0.184 ± 0.056

Novelty improves diversity (slightly).

Choice: Novelty (cont.)

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Satisfaction -0.700 ± 0.073 0.270 ± 0.061 Diversity 0.184 ± 0.056

Novelty has direct negative impact on first impression.

Implications

RQ2: Algorithm Differences

Baseline Tested % Tested > Baseline
Item-Item SVD 48.99
User-User 28.36
SVD Item-Item 51.01
User-User 25.68
User-User Item-Item 71.64
SVD 74.32

RQ2 Summary

RQ3: Objective Properties

Measure objective features of lists:

Novelty
obscurity (popularity rank)
Diversity

intra-list similarity (Ziegler)

  • Sim. metric: cosine over tag genome (Vig)
  • Also tried rating vectors, latent feature vectors
Accuracy/Sat
RMSE over last 5 ratings

Relativize: take log ratio of two lists' values

Property Distributions

Model with Objectives

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Obsc. Ratio 1.308 ± 0.206 Sim. Ratio Acc. Ratio Satisfaction -0.700 ± 0.073 0.270 ± 0.061 1.057 ± 0.509 Diversity 0.184 ± 0.056 -51.756 ± 8.558 Accuracy Understands Me

Summary

Refining Expectations

Commonly-held offline beliefs:

Perceptual results (here and elsewhere):

Outstanding Questions

Questions?

This work funded by NSF grants IIS 08-08692 and 10-17697.