Evaluating Recommenders with Distributions


Michael D. Ekstrand, Ben Carterette, and Fernando Diaz. 2021. Evaluating Recommenders with Distributions. In Proceedings of the RecSys 2021 Workshop on Perspectives on the Evaluation of Recommender Systems (RecSys '21). Cited 2 times.

Author order determined randomly. All authors contributed equally to this work.


Current practice for evaluating recommender systems typically focuses on point estimates of effectiveness or utility, often compared with statistical hypothesis tests, and sometimes combined with additional metrics for considerations such as diversity and novelty. In this viewpoint talk, we will argue for the need for RecSys researchers and practitioners to look beyond point estimates and instead to the distribution of system effects across items, stakeholders, and runs. We ground this argument in multi-stakeholder recommendation, recent developments in measuring provider exposure, and results from information retrieval, statistics, and artificial intelligence documenting the limitations of point estimates for properties of interest.

There are several distributions we believe need consideration: the marginal distribution of recommender utility within each stakeholder class, both individually and across subgroups; the distribution of differences in utility or performance, at least when paired observations are available; the difference in distributions in utility or performance between systems under comparison and - when available - between the system and its ideal; and the distribution of impact over repeated runs (e.g. with stochastic policies), rather than looking only at single-shot rankings, among other distributions.

These distributions also arise from multiple sources, including the set of test users; uncertainty in models of users, intents, and relevance; and stochasticity in ranking policy, whether introduced deliberately or as the emergent effect of re-training models with random components.

Examining distributions, through graphical comparison and metrics that capture critical aspects of effectiveness distributions beyond the typical mean will help recommender system evaluation move beyond treating users, producers, and other stakeholders as interchangeable, and is a vital part of ensuring that system improvements don’t leave some participants behind or treat their experience as expendable for the sake of an overall aggregate.

In order to make sure recommendation is good for everyone it affects, we need to look beyond mean or aggregate utility and consider its many distributions.

Listed Under