Mathematical Notation for Recommender Systems
Over the years of teaching and research, I have gradually standardized the notation that I use for describing the math of recommender systems. This is the notation that I use in my classes, Joe Konstan and I have adopted for our MOOC, and that I use in most of my research papers. (And thanks to Joe for helping revise it to its current form.)
If you haven’t already settled on a notation, perhaps you would consider adopting this one. I also welcome feedback on improving it.
I have tried to strike a balance between clarity and clutter. I slightly overload the meaning of some symbols; in particular, I am loose with distinctions between sets and matrices, because it is generally clear from context which is being invoked; I do not overload external referents, however. I also have tried to keep this notation so that it can be hand-written, making it more useful in teaching but meaning that I cannot rely as much on typography to distinguish different objects (e.g. separating \(U\) and \(\mathcal{U}\) would be questionable).
Input Data
Our input data, for collaborative filtering, consists of:
- \(U\)
- The set of users in the system or data set.
- \(I\)
- The set of items in the system or data set.
- \(R\)
- The set of ratings in the data set. \(R\) can be used as either a set of rating observations or as a (partially observed) \(|U| \times |I|\) matrix; context makes it clear which meaning is intended.
Within each of these sets, we can refer to individual entries:
- \(u, v \in U\)
- I use \(u\) and \(v\) as variables referencing individual users. If I need more than two, then I use numeric subscripts \(u_1\), \(u_2\), etc.
- \(i, j \in I\)
- I use \(i\) and \(j\) as variables referencing individual items. Again, for more than two, I use numeric subscripts. This does mean that \(i\) is not available as a counter variable, but I do not find that to be too much of a difficulty in practice.
- \(r_{ui} \in R\)
- An individual rating value, the rating user \(u\) gave for item \(i\). In an implicit feedback setting, this takes on whatever value you are using as the ‘rating’: 1/0, a play count, etc.
One advantage of always using \(u,v\) for users and \(i,j\) for items is that meaning is clear from looking at a variable or subscripted variable. It also allows the following subset notations:
- \(I_u \subset I\)
- The set of items rated by user \(u\).
- \(U_i \subset U\)
- The set of users who have rated or purchased item \(i\).
- \(R_u \subset R\)
- The set of ratings given by user \(u\)
- \(R_i \subset R\)
- The set of items for item \(i\)
- \(\vec{r}_u\)
- User \(u\)’s rating vector, an \(|I|\)-dimensional vector with missing values for unrated items.
- \(\vec{r}_i\)
- Item \(i\)’s rating vector, a \(|U|\)-dimensional vector with missing values for users who have not rated the item.
Scores and Similarities
With this notation, we can write things like the user-user rating prediction formula:
\[\hat r_{ui} = s(i;u) = \frac{\sum_{v \in N(u,i)} w(u,v)(r_{vi}-\bar r_v)}{\sum_{v \in N(u,i)} |w(u,v)|} + \bar r_u\]
\(N(u;i)\) is the neighborhood for user \(u\) for the purpose of scoring item \(i\), and \(\bar r_u\) is user \(u\)’s average rating. I explicitly state the ranges of my summations for two reasons, even if they are implicitly clear for experienced recsys researchers: it is a common tripping point for students, and it is a place where subtle implementation differences get buried. I find that it is not overly cumbersome with this notation.
Matrix Factorization
For basic decomposition of the ratings matrix, I usually use \(P\) and \(Q\) these days:
\[R \approx PQ^{\mathrm{T}}\]
We then have user vectors \(\vec p_u\) and item vectors \(\vec q_i\). I like using the transpose notation for the factorization, so that latent features are on the columns of both \(P\) and \(Q\).
Advanced decompositions that involve more than two matrices will need to find additional matrices; one unfortunate tradeoff of using \(U\) for the set of users is that it is no longer available for a matrix.