Recommender Engineering

Software and Experiments on Recommender Systems

Michael Ekstrand
September 12, 2014


Different recommender algorithms have different behaviors. I try to figure out what those are.

… in relation to user needs

… so we can design effective recommenders

… because I'm curious

Broader Goal

To help people find things.

  • human-computer interaction
  • information retrieval
  • machine learning
  • artificial intelligence
  • recommender systems

Recommender Systems


Recommender Systems


Recommender Systems

Recommender architecture

recommending items to users

Recommender Approaches

Many algorithms:

  • Content-based filtering
  • Neighbor-based collaborative filtering [Resnick et al., 1994; Sarwar et al., 2001]
  • Matrix factorization [Deerwester et al., 1990; Sarwar et al., 2000; Chen et al., 2012]
  • Neural networks [Jennings & Higuchi, 1993; Salakhutidinov et al., 2007]
  • Graph search [Aggarwal et al., 1999; McNee et al., 2002]
  • Hybrids [Burke, 2002]

Key Question

Does recommendation work?

Does this recommender work?

What recommender works ‘better’?




Research Overview

I try to answer these questions with several tools:

all from a human-computer interaction perspective

LensKit Features

APIs for recommendation tasks
implementations of common algorithms
integrates with databases, web apps
measure recommender behavior on many data sets
drive user studies and other advanced experiments
flexible, reconfigurable algorithms
open-source code
study production-grade implementations

LensKit Project

LensKit Algorithms

Algorithm Architecture

Principle: build algorithms from reusable, reconfigurable components.



Java-based dependency injector to configure and manipulate algorithms.

Offline Experiments

Take data set

and see what the recommender does

Use previously-collected data to estimate recommender's usefulness.

Offline Architecture

Compare and Measure

Tuning Algorithms

Example Output

Testing different variants.

Tuning Research

Previous work:

Future work:

New Ways of Measuring

Need new ways of measuring recommender behavior:

When Recommenders Fail

Short paper, RecSys 2012; ML-10M data

Counting mispredictions (|p − r| > 0.5) gives different picture than prediction error.

Consider per-user fraction correct and RMSE:


Marginal Correct Predictions

Q1: Which algorithm has most successes (ϵ ≤ 0.5)?

Qn + 1: Which has most successes where 1…n failed?

Algorithm # Good %Good Cum. % Good
ItemItem 859,600 53.0 53.0
UserUser 131,356 8.1 61.1
Lucene 69,375 4.3 65.4
FunkSVD 44,960 2.8 68.2
Mean 16,470 1.0 69.2
Unexplained 498,850 30.8 100.0

Future Work

User-Based Research

Offline has problems:

Answers: sort-of, maybe, and yes.

User Study

Goal: identify user-perceptible differences.

How do user-perceptible differences affect choice of algorithm?
What differences do users perceive between algorithms?
How do objective metrics relate to subjective perceptions?

Context: MovieLens


Three well-known algorithms for recommendation:

Each user assigned 2 algorithms

Survey Design

Example Questions

Which list has a more varied selection of movies?
Which recommender would better help you find movies to watch?
Which list has more movies you do not expect?

Analysis features

joint evaluation
users compare 2 lists
judgment-making different from separate eval
enables more subtle distinctions
hard to interpret
factor analysis
22 questions measure 5 factors
more robust than single questions
structural equation model tests relationships

Hypothesized Model

Response Summary

582 users completed

Condition (A v. B) N Pick A Pick B % Pick B
I-I v. U-U 201 144 57 28.4%
I-I v. SVD 198 101 97 49.0%
SVD v. U-U 183 136 47 25.7%

bold is significant (p < 0.001, H0 : b/n = 0.5)

Question Responses

Measurement Model

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Obsc. Ratio 1.308 ± 0.206 Sim. Ratio Acc. Ratio Satisfaction -0.700 ± 0.073 0.270 ± 0.061 1.057 ± 0.509 Diversity 0.184 ± 0.056 -51.756 ± 8.558

Differences from Hypothesis

image/svg+xml Novelty Choice 0.093 ± 0.031 0.664 ± 0.043 1st Imp. 0.542 ± 0.037 -0.249 ± 0.038 Obsc. Ratio 1.308 ± 0.206 Sim. Ratio Acc. Ratio Satisfaction -0.700 ± 0.073 0.270 ± 0.061 1.057 ± 0.509 Diversity 0.184 ± 0.056 -51.756 ± 8.558


Results and Expectations

Commonly-held offline beliefs:

Perceptual results (here and elsewhere):

What We've Done

Collected user feeedback

that validates some results

and challenges others

To do: validate more metrics

What You Can Do


Background Reading

If you want to do some reading in prep:

Things We'll be Working On

How to Get Involved