Dedication

This thesis dedicated in memory of John T. Riedl.

Recommender engineering is

designing and building a recommender system
for a particular application
from well-understood principles of algorithm behavior, application needs, and domain properties.

Understand recommendation well enough to build recommenders from requirements.

#1tweetresearch

Different recommender algorithms have different behaviors. I try to figure out what those are.

… in relation to user needs

… so we can design effective recommenders

… because I'm curious

Contributions

Background

Tools

Experiments

User Study

Going Forward

Publications

LensKit (RecSys 2011)
CF Survey (Foundations and Trends in HCI)
When Recommenders Fail (RecSys 2012, short)
User study (RecSys 2014)
Grapht (in preparation)
Parameter tuning (in preparation)

Collaborators

John Riedl
Joseph Konstan
Michael Ludwig
Jack Kolb
Max Harper
Lingfei He
Martijn Willemsen (TU Eindhoven)

Overview

Background

Tools

Experiments

User Study

Going Forward

Overview

Background

Tools

Experiments

User Study

Going Forward

Recommender Systems

Recommender Systems

recommending items to users

Recommender Approaches

Many algorithms:

Content-based filtering
Neighbor-based collaborative filtering [Resnick et al., 1994; Sarwar et al., 2001]
Matrix factorization [Deerwester et al., 1990; Sarwar et al., 2000; Chen et al., 2012]
Neural networks [Jennings & Higuchi, 1993; Salakhutidinov et al., 2007]
Graph search [Aggarwal et al., 1999; McNee et al., 2002]
Hybrids [Burke, 2002]

Common R&D Practice

Develop recommender tech (algorithm, UI, etc.)
Test on particular data/use case
See if it is better than baseline
Publish or deploy

Evaluating Recommenders

Many measurements:

ML/IR-style data set experiments [Herlocker et al., 2004; Gunawardana & Shani, 2009]
User studies [McNee et al., 2006; Pu et al., 2011; Knijnenburg et al., 2012]
A/B testing [Kohavi]
- Engagement metrics
- Business metrics

Common R&D Practice

Develop recommender tech (algorithm, UI, etc.)
Test on particular data/use case
See if it is better than baseline
Publish or deploy

Learned: is new tech better for target application?

Not learned: for what applications is new tech better? why?

Algorithms are Different

Algorithms perform differently
No reason to believe one size fits all
Quantitatively similar algorithms can have qualitatively different results [McNee, 2006]
Different algorithms make different errors
Opportunity to tailor system to task
Doesn't even count other system differences!

Human-Recommender Interaction [McNee 2006]

Harness recommender personality to meet specific user information needs
Task-centered design approach
- Interaction
- Algorithm
- Task
Not yet reality!

Requirements for Engineering

To enable recommender engineering, we to understand:

Algorithm behaviors and characteristics
Relevant properties of domains, use cases, etc. (applications)
How these interactions affect suitability

All this needs to be reproduced, validated, and generalizable.

Prior Work

Diversity [Ziegler et al., 2005; Vargas & Castells, 2011; Bollen et al., 2010]
Novelty [Zhang & Hurley, 2008; Vargas & Castells, 2011]
Stability [Burke 2002; Adomavicius & Zhang, 2012]

Most work designs for or uses a property. Still need:

what properties to algorithms have?
what do users find useful (in many contexts)?

My work

LensKit: enables reproducible research on wide variety of algorithms
Offline experiments: validate LensKit; tune algorithm configurations; demonstrate algorithm differences
User study: obtain user judgements of algorithm differences

Overview

Background

Tools

Experiments

User Study

Going Forward

An open-source toolkit for building, researching, and learning about recommender systems.

LensKit Features

build: APIs for recommendation tasks; implementations of common algorithms; integrates with databases, web apps
research: numerous evaluation metrics; cross-validation; flexible, reconfigurable algorithms
learn: open-source code; study production-grade implementations

LensKit Algorithms

Simple means (global, user, item, user-item)
User-user CF
Item-item CF
Biased matrix factorization (FunkSVD)
Slope-One
OrdRec (wrapper)

Algorithm Architecture

Principle: build algorithms from reusable, reconfigurable components.

Benefits

Reproduce many configurations
Try new ideas by replacing one piece
Reuse pieces in new algorithms

Grapht

Java-based dependency injector to configure and manipulate algorithms.

Algorithms are first-class objects
- Detect shared components (pre-build models, re-use parts in evaluation)
- Draw diagrams
Context-sensitive policy: components can be arbitrarily composed
Extensive defaulting capabilities

Evaluator

Cross-validate rating data sets
Train and measure recommenders
Many metrics
- Predict: RMSE, MAE, nDCG (rank-accuracy)
- Top-N: nDCG, P/R@N, MRR
- Easy to write new metrics
Optimized: reuses common algorithm components

Outcomes of LensKit

Public, open-source software, v. 2.1 coming soon
Basis for Coursera MOOC on recommender systems (~1000 students)
Produced research results in this thesis (and several papers)
Algorithmic analysis for interface research (Kluver et al., 2012; Nguyen et al., 2013)
Implementation behind MovieLens and BookLens
Used for CHI/CSCW Confer system
Gradually gaining more users

Overview

Background

Tools

Experiments

User Study

Going Forward

Tuning Algorithms

Revisit Herlocker et al. [2002], Sarwar et al. [2001]

Measure prediction accuracy of different configurations
Evaluate algorithm options
- similarity function
- neighborhood size
- data normalization
- etc.…
Look for systematic tuning strategies
Partially published in RecSys 2011

Tuning Results

User-User

Mostly consistent w/ Herlocker
Cosine similarity over normalized data works well, should not have been disregarded.

Item-Item

Reproduced Sarwar's results
Item-mean normalization outperforms previous strategies
Developed tuning strategy

FunkSVD

Small data sets need more training
Parameters highly entangled — strategy elusive

Item-Item Strategy

Use item-user mean for unpredictable items
Start with item mean centering for similarity normalization
With full model, find neighborhood size by hill-climbing
Decrease model size for size/quality tradeoff
Try item-user mean, see if it is better
- If yes, check some nearby neighborhood sizes
If desired, add slight similarity damping

Rank-Based Evaluation

Ratings convey order, not degree. Does this matter?

Do different tunings arise for rank-based metrics?
Top-\(N\) metrics?

Answers

Tunings consistent for nDCG (rank consistency)
MRR (top-\(N\)) has drastic changes
- Different optimal normalization on sparse data set
- But top-\(N\) metrics are ill-behaved
Can't rule out effect, but more sophistication needed.

When Recommenders Fail

Short paper, RecSys 2012; ML-10M data

Counting mispredictions (\(|p - r| > 0.5\)) gives different picture than prediction error.

Consider per-user fraction correct and RMSE:

Correlation is 0.41
Agreement on best algorithm: 32.1%
Rank-consistent for overall performance

Also:

Different algorithms make different mistakes
Different users have different best algorithms

Marginal Correct Predictions

Q1: Which algorithm has most successes (\(\epsilon \le 0.5\))?

Q\(n+1\): Which has most successes where \(1 … n\) failed?

Algorithm	# Good	%Good	Cum. % Good
ItemItem	859,600	53.0	53.0
UserUser	131,356	8.1	61.1
Lucene	69,375	4.3	65.4
FunkSVD	44,960	2.8	68.2
Mean	16,470	1.0	69.2
Unexplained	498,850	30.8	100.0

Offline Experiment Lessons

Revisit tunings: find new best practices
Tuning strategy for item-item
Rank-based evaluation produces consistent results
Top-\(N\) skews results
Different algorithms make different mistakes
- This supports my thesis: algorithms have exploitable differences beyond accuracy (also McNee [2006])

Overview

Background

Tools

Experiments

User Study

Going Forward

User Study

Goal: identify user-perceptible differences.

RQ1: How do user-perceptible differences affect choice of algorithm?
RQ2: What differences do users perceive between algorithms?
RQ3: How do objective metrics relate to subjective perceptions?

Context: MovieLens

Movie recommendation service & community
2500–3000 unique users/month
Extensive tagging features
Launching new version this summer
Experiment deployed as intro to beta access

Algorithms

Three well-known algorithms for recommendation:

User-user CF
Item-item CF
Biased matrix factorization (FunkSVD)
All restricted to 2500 most popular movies

Each user assigned 2 algorithms

Predictions

Predicted ratings influence list perception.

To control, 3 prediction treatments:

Standard raw predictions (0.5–5 stars)
No predictions
Normalized predictions

Each user assigned 1 condition

No effect of condition.

Survey Design

Initial ‘which do you like better?’
22 questions
- ‘Which list has more movies that you find appealing?’
- ‘much more A than B’ to ‘much more B than A’
- Target 5 concepts
Final ‘which do you like better?’

Example Questions

Diversity: Which list has a more varied selection of movies?
Satisfaction: Which recommender would better help you find movies to watch?
Novelty: Which list has more movies you do not expect?

Analysis features

joint evaluation: users compare 2 lists; judgment-making different from separate eval; enables more subtle distinctions; hard to interpret
factor analysis: 22 questions measure 5 factors; more robust than single questions; structural equation model tests relationships

Hypothesized Model

Condition (A v. B)	N	Pick A	Pick B	% Pick B
I-I v. U-U	201	144	57	28.4%
I-I v. SVD	198	101	97	49.0%
SVD v. U-U	183	136	47	25.7%

Measurement Model

Multi-level linear model
Direction comes from theory
Condition variables omitted for simplicity

Differences from Hypothesis

No Accuracy, Understands Me
Edge from Novelty to Diversity

RQ1: Factors of Choice

Choice: Satisfaction

Satisfaction positively affects impression and choice.

Choice: Diversity

Diversity positively influences satisfaction.

Satisfaction mediates diversity's impact on preference

Choice: Novelty

Novelty hurts satisfaction and choice/preference.

Choice: Novelty (cont.)

Novelty improves diversity (slightly).

outweighed by negative satisfaction effect

Choice: Novelty (cont.)

Novelty has direct negative impact on first impression.

Also seems stronger overall, but difficult to assess

Implications

Novelty boosts diversity, but hurts algorithm impression
- In context of choosing an algorithm
Negative impact of novelty diminishes with close scrutiny
- Can recommender get less conservative as users gain experience?
Diversity has positive impact on user satisfaction
Diversity does not trade off with perceived accuracy

Baseline	Tested	% Tested > Baseline
Item-Item	SVD	48.99
User-User	28.36
SVD	Item-Item	51.01
User-User	25.68
User-User	Item-Item	71.64
SVD	74.32

RQ3: Objective Properties

Measure objective features of lists:

Novelty

obscurity (popularity rank)

Diversity

intra-list similarity (Ziegler)

Sim. metric: cosine over tag genome (Vig)
Also tried rating vectors, latent feature vectors

Accuracy/Sat

RMSE over last 5 ratings

Relativize: take log ratio of two lists' values

Property Distributions

Obscurity & similarity metrics consistent with RQ2 results

Model with Objectives

Each metric predicts feature
Metric effect entirely mediated

Summary

Novelty has complex, largely negative effect
- Exact use case likely matters
- Complements McNee's notion of trust-building
Diversity is important, mildly influenced by novelty.
- Tag genome measures perceptible diversity best, but advantage is small.
User-user loses (likely due to obscure recommendations), but users are split on item-item vs. SVD
Consistent responses, reanalysis, and objective metrics

Results and Expectations

Commonly-held offline beliefs:

Novelty is good
Diversity and accuracy trade off

Perceptual results (here and elsewhere):

Novelty is complex
Diversity and accuracy both achievable

Remaining Questions

Are II/SVD interchangeable?
Can user-user be fixed w/ hybridization?
Do users' stated choices predict long-term preference?
- Will test when ML deploys pick-your-algorithm feature

Overview

Background

Tools

Experiments

User Study

Going Forward

Contributions

To promote recommender engineering:

LensKit toolkit for reproducible recommender research
New advances in configuring and tuning collaborative filtering algorithms
Demonstrated differences between algorithms in terms of correct predictions
Demonstrated user-perceptible differences between recommender algorithms, and factors important to user choice

Future Work

Follow up on online experiment
- Long-term user behavior
- Integrate with Kluver's results
LensKit enhancements
- Better new user experience (docs, examples)
- Implicit feedback support
- Temporal evaluation
- Automatic tuning
Much more work needed for engineering

Towards Recommender Engineering

Tools and Experiments for Identifying Recommender Differences

Dedication

#1tweetresearch

Contributions

Publications

Collaborators

Overview

Overview

Recommender Systems

Recommender Systems

Recommender Approaches

Common R&D Practice

Evaluating Recommenders

Common R&D Practice

Algorithms are Different

Human-Recommender Interaction [McNee 2006]

Requirements for Engineering

Prior Work

My work

Overview

LensKit Features

LensKit Algorithms

Algorithm Architecture

Grapht

Evaluator

Outcomes of LensKit

Overview

Tuning Algorithms

Tuning Results

Item-Item Strategy

Rank-Based Evaluation

When Recommenders Fail

Short paper, RecSys 2012; ML-10M data

Marginal Correct Predictions

Offline Experiment Lessons

Overview

User Study

Context: MovieLens

Algorithms

Predictions

Survey Design

Example Questions

Analysis features

Hypothesized Model

Response Summary

Question Responses

Measurement Model

Differences from Hypothesis

RQ1: Factors of Choice

Choice: Satisfaction

Choice: Diversity

Choice: Novelty

Choice: Novelty (cont.)

Choice: Novelty (cont.)

Implications

RQ2: Algorithm Differences

Measurement Model

RQ2: Algorithm Differences

Choice in Pseudo-Experiments

RQ3: Objective Properties

Property Distributions

Model with Objectives

Summary

Results and Expectations

Remaining Questions

Overview

Contributions

Future Work

Thank you!

Also thanks to

Correlation Functions