Similarity Functions in Item-Item CF

Published on Monday, June 15, 2015 and tagged with recsys. Updated on Wednesday, June 17, 2015.

The core of an item-item collaborative filter is the item similarity function: a function of two items $s(i,j): \mathcal{i}\times\mathcal{i} \to [-1,1]$ that measures how similar those items are. Common choices are vector similarity functions over the vectors of users’ ratings of each item, such as the cosine similarity or Pearson correlation.

Early on, Sarwar et al. tested a few choices:

The Pearson correlation from statistics:

$\frac{\sum{(r_{ui} - \mu_i) (r_{uj} - \mu_j)}}{\sqrt{\sum{(r_{ui} - \mu_i)^2}} \sqrt{\sum{(r_{uj} - \mu_j)^2}}}$
The cosine similarity between raw vectors:

$\frac{\vec{r_i}\cdot\vec{r_j}}{\|\vec{r_i}\|_2\|\vec{r_j}\|_2} = \frac{\sum{r_{ui} r_{uj}}}{\sqrt{\sum{r_{ui}^2}} \sqrt{\sum{r_{uj}^2}}}$
The adjusted cosine similarity between vectors normalized by subtracting the user’s mean rating:

$\frac{\vec{\hat{r}_i}\cdot\vec{\hat{r}_j}}{\|\vec{\hat{r}_i}\|_2\|\vec{\hat{r}_j}\|_2} = \frac{\sum{(r_{ui} - \mu_u) (r_{uj} - \mu_u)}}{\sqrt{\sum{(r_{ui} - \mu_u)^2}} \sqrt{\sum{(r_{uj} - \mu_u)^2}}}$

They found adjusted cosine to work better, and so far as I know, it has been the dominant similarity function for rating-based item-item CF systems.

In our early LensKit work, we demonstrated LensKit’s algorithmic flexibility by testing many options and combinations, including several that were not considered by Sarwar and other authors. This included adjusting the cosine by the user mean, the item mean, and the personalized mean. We found that subtracting the item mean worked better than user mean (on standard offline accuracy measures).

At the time, this was surprising. The user-mean-centered cosine similarity corrects for the obvious problem: users rating on different parts of the rating scale. Why was it more useful to adjust by the item’s average rating?

The answer dawned on me while trying to explain the result to my recommender systems students this spring. As I noted before on user-user CF, if we ignore missing observations, computing the cosine over two mean-centered vectors is identical to taking their Pearson correlation. This can be seen in the similarity between the Pearson and adjusted cosine formulas: the only clear difference (omitting the range of the summations) is which mean is subtracted from each rating. This may be intuitively obvious to many, but is a point that has been missing from much of the CF literature. Also, when we treat missing values in the mean-centered vectors as 0 (rather than omitting them, as is typically done when computing the Pearson correlation), the resulting similarity function is biased towards 0 for vectors with relatively few common entries, which is exactly the functionality we want to avoid over-emphasizing similarities computed over little data (and does so in a more natural fashion than significance weighting).

This is nice, but when we center by subtracting the user’s mean, we aren’t really mean-centering the vectors — we are normalizing them by some other value. The resulting concept loses the statistical meaning of a correlation.

When we subtract the item’s mean rating, however, we do mean-center the values, resulting in a similarity that is again a Pearson correlation.

I suspect that this is what is happening to produce these results:

The naturally self-damping characteristics of a ‘natural’ cosine similarity override the statistical invalidity, so that adjusted cosine (self-damping for low co-rating counts) works better then Pearson (statistically valid).
Computing a Pearson correlation with self-damping (cosine over item-mean-centered vectors) is the best of both worlds, as it is both damped for item pairs with relatively few common ratings and is statistically meaningful.

We had the results, and I believed them (and MovieLens has been using item-centered similarity for its item-item CF for some time now), but now I have a theoretical understanding for why they are the way they are.