Estimating Error and Bias in Offline Evaluation Results
Offline evaluation protocols for recommender systems are intended to estimate users’ satisfaction with recommendations using static data from prior user interactions. These evaluations allow researchers and production developers to carry out first-pass estimates of the likely performance of a new system and weed out bad ideas before presenting them to users. However, offline evaluations cannot accurately assess novel, relevant recommendations, because the most novel recommendations items that were previously unknown to the user; such items are missing from the historical data, so they cannot be judged as relevant. A breakthrough that reliably produces novel, relevant recommendations would score poorly with current offline evaluation techniques.
While the existence of this problem is noted in the literature, its extent is not well-understood. We present a simulation study to estimate the error that such missing data causes in commonly-used evaluation metrics in order to assess its prevalence and impact. We find that missing data in the rating or observation process causes the evaluation protocol to systematically mis-estimate metric values, and in some cases erroneously determine that a popularity-based recommender outperforms even a perfect personalized recommender. Substantial breakthroughs in recommendation quality, therefore, will be difficult to assess with existing offline techniques.