Collaborative filtering via graph signal processing

This paper develops new designs for recommender systems inspired by recent advances in graph signal processing. Recommender systems aim to predict unknown ratings by exploiting the information revealed in a subset of user-item observed ratings. Leveraging the notions of graph frequency and graph filters, we demonstrate that a common collaborative filtering method — fc-nearest neighbors — can be modeled as a specific band-stop graph filter on networks describing similarities between users or items. These new interpretations pave the way to new methods for enhanced rating prediction. For collaborative filtering, we develop more general band stop graph filters. The performance of our algorithms is assessed in the MovieLens-100k dataset, showing that our designs reduce the root mean squared error (up to a 6.20% improvement) compared to one incurred by the benchmark collaborative filtering approach.


I. INTRODUCTION
The widespread deployment of the Internet technologies has generated a massive enrollment of online customers in web services, propelling the need for implementation of recommender systems (RS) to assist customers in making decisions. In a succinct way, RS are algorithms that collect information about how users of a particular service rate different items. The collected information is then used, along with additional sources of exogenous information, to provide customers with recommendations for the unrated items [1], [2].
Research on RS includes the so-called content filtering approach, which starts by defining a set of features that characterize users and items and then uses those to perform predictions on the unrated items [1], [2]. It also includes the collaborative filtering (CoFi) approach, which relies mostly on past user behavior and carries out predictions without defining an a priori set of features. Although CoFi comes with certain disadvantages (in particular when rating new products or users), it typically requires less assumptions than content filtering and yields a superior performance in real datasets [2]. As a result, it has emerged as the central approach for RS. A common technique to design CoFi algorithms is nearest neighborhood methods (NNM), which work under the assumption that users who are similar tend to give similar ratings for the same product, proceed into two phases. Firstly, using a pre-specified similarity metric, a similarity score is computed for each pair of users. Secondly, the unknown ratings for a particular user are obtained by combining the Work in this paper is supported by the Spanish MINECO grants No TEC2013-41604-R and TEC2016-75361-R, and the USA NSF CCF-1217963. W. Huang and A. Ribeiro are with the Dept. of Electrical and Systems Eng., Univ. of Pennsylvania. A. G. Marques is with the Dept. of Signal Theory and Comms., King Juan Carlos Univ. ratings that similar users have given to the unrated items. To avoid overfitting and simplify computations, only a subset of the users (the ones who are more similar and have rated the item) is considered. A similar approach can be used to compute a similarity score among items, giving rise to the so-caller item-based collaborative approaches.
The goal in this paper is to reinterpret CoFi algorithms using tools from graph signal processing (SP). In simple words, graph SP addresses the problem of analyzing and extracting information from data defined not in regular domains such as time or space, but on more irregular domains that can be conveniently represented by a graph. The tacit assumption is that the network structure defines a notion of proximity or dependence among the nodes of the graph [3], [4], which must be leveraged when generalizing classical SP algorithms to process signals defined in more irregular graph domains. The theory and applications of graph SP is growing rapidly [5]- [10]. This paper designs new and more general schemes, but equally relevant unveils important connections between CoFi and graph SP. More precisely, we show that NNM can be viewed as algorithms that obtain the ratings by processing the available information with a graph filter. This interpretation not only provides a better understanding on the differences and similarities between both approaches, but it also opens the door to the design of more advanced algorithms leading to a better recommendation accuracy. In short, the contributions of this paper are: (a) To demonstrate how the CoFi approaches based on NNM can be considered from graph SP approach. (b) To exploit this interpretation to design more general algorithms for NNM. (c) To show that the proposed methods produce significant improvement for the MovieLens-100k dataset [11] 1 .

II. FUNDAMENTALS OF COFI AND GRAPH SP
We start by introducing the basic notation and formulating the CoFi problem. We then describe the NNM method and review the graph SP tools used in the following sections.
Consider an RS setup with U users indexed by u, and I items indexed by i. The rating that user u has given to item i is represented as X u,i . For mathematical convenience, such ratings can be collected either into the rating matrix X ∈ R U ×I , or into the rating vector x = vec(X) ∈ R U I . Additionally, vectors x u = [X u,1 , ..., X u,I ] ∈ R I represent the ratings by the u-th user. To account for the fact that not all ratings are available, let S denote the set of indexes that identify user-item pairs whose rating is known. Similarly, S u denotes a set containing the indexes of the items that user u has rated. We can then use x S ∈ R |S| to denote a vector containing the known ratings. The problem of interest is as follows: Given the ratings x S for the item-user pairs in S, estimate the full rating vector x (matrix X).

A. CoFi via NNM
As explained in the introduction, NNM builds on the assumption that if a pair of users u and v possess similar taste, then their ratings X iu and X iv for a particular item i, are going to be similar as well. To formulate this rigorously, we start with user-based NNM, and let B ∈ R U ×U be a matrix whose entry B uv denotes the similarity between the pair of users u and v. Given the particularities of a CoFi setup, B u,v has to be computed using a metric that takes into account only the ratings available in x S . Define the set S uv as the intersection of S u and S v , i.e., the set of items that have been rated by both u and v, a common choice to compute the similarity score is finding first the sample correlations as with µ uv := i∈Suv X ui /|S uv |. Note that the previous covariances and means are found using only the items that were commonly rated by u and v. The similarity score would then be found by simply setting B uv = Σ U uv . In the context of RS, a more common approach is to use Pearson correlations, and B uu = 0. The main idea behind NNM is that when predicting the rating X ui , only the ratings X vi from users v that are very similar to u must be used. To do so, denote K ui as the set of k users who are the most similar to u (largest values of B uv ) and have rated the item i. Leveraging these definitions, the unknown ratings are finally predicted aŝ where µ u = i∈Su X ui /|S u |. At an intuitive level, the subtraction and addition of µ v and µ u account for the fact that different users may be more generous than others.

B. Graph SP
Consider a directed graph G with a set of N nodes or vertices N and a set of links E, such that if node n is connected to m, then (n, m) ∈ E. For any given graph we define the adjacency matrix A as a sparse N × N matrix with non-zero elements A m,n if and only if (n, m) ∈ E. The value of A m,n captures the strength of the connection from n to m.
The focus of graph SP is on graph signals defined on the set of nodes N . Formally, each of these signals can be represented as a vector z ∈ R N where the n-th element represents the value of the signal at node n. To facilitate the connections with NNM, in this work we chose as shift the adjacency matrix A; Fig. 1: CoFi as graph filters. The ratings for each item can be considered as graph signals on a network that depends on the item. For each specific item, edges starting from users who have not rate them are removed. Then, given a specific user u, for all the edges coming into u, only the ones with the khighest edge weights are kept. Proper normalization are then applied to make each B i right stochastic.
however, our results can be easily generalized for other choices such as Laplacians [3]. We assume S is diagonalizable, so that Graph filters are a particular class of linear graph-signal operators able to be represented as matrix polynomials of S [4] H := For a given input z, the output of the filter is simply y = Hz. The filter coefficients are collected into h := [h 0 , . . . , h L−1 ] , with L − 1 denoting the filter degree. The eigendecomposition of S is used to define the frequency representation of graph signals and filters. For a signal z ∈ R N and a graph shift operator S = VΛV −1 ∈ R. The vectors form a Graph Fourier Transform (GFT) pair [3], [4]. The GFT encodes a notion of variability for graph signals akin to one that the Fourier transform encodes for temporal signals [4], [12]. Specifically, the smaller the distance between λ p and |λ max | in the complex spectrum, the lower the frequency it represents. This idea is based on defining the total variation of a graph signal z as TV(z) = z − Sz/λ max (S) 1 , with smoothness being associated to small values of TV. Then, given a (λ p , v p ) pair, one has that TV(v p ), which provides an intuitive way to order the different frequencies.

III. COFI FROM A GRAPH SP PERSPECTIVE
In this section, we show that if the ratings x are viewed as graph signals defined on user-to-user networks, then NNM predict signalsx that are bandlimited in the frequency domain of those networks. That is, signals that can be expressed as a combination of a few eigenvectors of the graph shift operator. This viewpoint allows us to develop more general algorithms with better performance. To that end, let us focus on the generation ofx i , i.e., the predicted ratings for item i, using the ratings from other users and the similarities among them.
The first step is to define the input graph signal denoted aš Since the bias has been removed, setting the unknown ratings to zero assigns a neutral preference for the item. The second step is to construct the user-similarity network, which will serve as graph shift operator. To this end, we start with the matrix B whose entries are given in (2). Then, in order to account for the fact that ratings from users who do not rate i should not be considered when predicting i, we remove any edges starting from v if X vi is unknown. This implies that the similarity network, which will be denoted as B i , will depend on the particular item i. The final steps are to keep only the edges corresponding to the k most similar users and normalize each row so that the resultant matrix is left stochastic [cf. the denominator in (3)]. Mathematically, this implies that the matrix B i ∈ R U ×U is defined as where we recall that K u,i contains the k users that are most similar to u and have rated item i. An example of this procedure using the MovieLenss-100k dataset is illustrated in Figure 1, where the top network represents the original B and the subsequent plots represent B i for several items. Once the graph signalx i and the graph shift-operator B i are defined, the predicted ratings are simply given bŷ cf. (3). In words, the estimated ratings are obtained after applying the graph filter H = B i to the input signalx i . We now analyze the behavior of (7) in the frequency domain, to conclude that H = B i acts as a band-stop graph filter. Given an item i, consider the eigen-decomposition for the user-similarity network S = B i = VΛV −1 . Denote the GFT of the known input signal asx i = V −1xi , and the GFT of the predicted rating asx i = V −1xi . The two GFTs are related viã Therefore, the frequency response of the filter implementing NNM is diag(h) = diag(b i ) = Λ and the p-th frequency coefficient of the predicted output is [ is likely to be non-symmetric, λ p is expected to be a complex number. Remember that λ max (B i ) is always 1 because of right stochasticity and that eigenvectors can be ordered according to TV(v p ) = |λ p −1|; see [12] and the related discussion after Definition 1. As a result, smooth (low-frequency) eigenvectors are signals where v q − B i v q ≈ 0; i.e., full rating signals where users that are similar tend to agree.
To gain further intuition on the spectral behavior of (8), we examine the frequency response of B i for the MovieLenss-100k dataset. Specifically, for each B i , we order its eigenvalues according to |λ p − 1|, and record the frequency responsẽ b i for low, middle, and high frequencies. The I frequency responses obtained using this procedure are then averaged across i, giving rise to the single transfer function depicted in Figure 2 (a). To help visualization, the scale in the horizontal axis is not homogeneous and only the real part of the eigenvalues is shown (the imaginary part is very small). The main finding is that the frequency response is zero for more than 90% of the frequencies, implying that the predicted signal will be graph bandlimited. Another observation of interest is that the frequencies not rejected by the filter and that are present in the predicted output are the ones associated with the first eigenvectors (low values of p) and the last eigenvectors (high values of p). The first eigenvectors represent signals of small total variation, while the last ones are associated with signals of high variation. Since the diagonal elements of each matrix B i are zero, the sum of the eigenvalues is zero, with the eigenvalues associated with low frequencies being positive, and those associated with signals of large total variance associated being negative. The low-pass components represent signals where similar users tend to have similar ratings, providing the the big picture for the predicted rating. Differently, the high pass component focuses on the differences between users with similar taste for the particular item. With this interpretation one can see (8) as a filter that eliminates the irrelevant features (middle frequencies), smoothes out the similar components (low frequencies) and preserves the discriminative features (high frequencies). This band-stop behavior where both high and low graph frequencies are preserved is not uncommon in image processing (image de-noising and image sharpening, respectively) [13], and was also observed in brain signal analytics [8], [14].

IV. ENHANCING COFI VIA GRAPH SP
Using definitions and tools from graph SP, the previous section demonstrated that the rating predictions generated by NNM can be understood as signals that are sparse in a graph frequency domain. In this section, we illustrate how these interpretations can be leveraged to design novel graph-SPbased CoFi methods with enhanced prediction performance.
As shown in Section III, the user-based NNM predict the rating for item i viax i = B i x i , which can be modeled as the implementation of a band-stop graph filter of order one. Our proposal here is, using S = B i as shift, to design other types of band-stop graph filters H(S) to perform rating predictions. Consider first H = B 2 i , whose frequency response is diag(h) = diag(b i ) 2 = Λ 2 . The fact of B i being a bandstop filter implies that many of the entries of its frequency responseb i are zero. As a result, B 2 i has a band-stop behavior too and the same holds true for any positive power of B i . Since all powers of B i are band-stop operators, the unknown ratings predicted with graph filters of the form will also give rise to bandlimited signals. Hence, predictions in (9) are generalizations of the traditional NNM in (7), which estimatex i using a filter H = B i of order one. This graph-frequency interpretation can be complemented by understanding the effect of B i on the graph vertex domain. To do so, note that B 0 ix i =x i coincides with the original signal, B ix i is an average of the ratings given by one-hop neighbors, B 2 ix i is an average of elements in nodes that interact via intermediate common neighbors, and, in general, B l ix i describes interactions between l-hop neighbors. Therefore, on top of using the ratings of very similar users to make predictions, the powers of the matrix B l i in the graph filter in (9) also account for chains of users with similar taste, exploiting them to generate enhanced predictions.
Compared to classical NNM, the filter coefficients h are not known a priori, and therefore need to be learned from a training set. Besides, h 0 is irrelevant since B 0 i x i = x i and therefore would not be helpful in predictions. Then, the filter coefficients are found by solving where r is a regularization parameter that can be tuned by cross-validation on the training set to avoid overfitting. Note that formulations in (10) are least square problems, which using the Moore-Penrose pseudo inverse, admit "closed form" solutions. If the value of L is too large and a sparse vector of filter coefficients is desired, the regularizer h 2 2 can be either replaced or augmented with h 1 .

V. NUMERICAL EXPERIMENTS
In this section, we illustrate how our methods improve the rating accuracy in real data. For that purpose we use the MovieLens-100k dataset [11], which contains ratings from 943 users on 1,682 movies. The number of available ratings is 100,000, i.e., the 6.3% of the total number of user-item   pairs. We randomly select 100 ratings as the testing set X ts , and use the rest as training set X tr . The set containing the indexes of elements in X ts and X tr is denoted as S ts and S tr , respectively. The networks and filter coefficients are only trained on the training set. As a performance metric we use the global root mean squared error (RMSE). User-based NNM is used as benchmark algorithms. To get an estimate for the regularization constant r used in (10), we perform crossvalidation by breaking the ratings in the training set into three equally sized subsets. Before we start to compare different approaches, the first task is to assess the best performance that one can achieve with the setup at hand. To this end, we use the networks B i learned on the training set X tr and learn the filter coefficients by solving the problem in (10) using the testing set X ts Since the coefficients above are biased towards the data in X ts and all other schemes will be trained using X ts , the performance achieved by (11) on X ts will serve as a benchmark for all other schemes. The RMSE across ratings in the testing set X ts using the h trained in (11) for different values of L and r is presented in Table I. There are several interesting observations. Firstly, the RMSE for both r = 0 and r = 0.5 decreases as the number of filter taps L increases from 1 to 6, remaining flat for L > 6. This seems to imply that considering chains of more than 6 users does not improve prediction performance (recall U = 943 and k = 40). Secondly, the RMSE with large L and r = 0 is around 0.80, which will be the value considered as the benchmark for algorithm that learn h in X tr and test their performance in X ts . Finally, the difference between RMSE for r = 0 and r = 0.5 is around 0.05, which represents an increase of approximately 6%.
When we solve the actual problem in (10) with coefficients learned on the training set X tr , we rely on the results in Table  I to limit the maximum number of taps to 6. The RMSE on the testing set X ts for different values of L and r is presented in Table II. The main observations are: i) higher order filters perform better than the traditional order-one NNM (user-based NNM attain an RMSE of 0.9116, while for L = 6 and r = 0.5 our method attains an error of 0.8551, which is 6.20% smaller); and ii) the prediction performance, especially that for the case where r = 0.5, is not much worse than that shown in Table I, and the trends are also similar to those shown in Table I. Moreover, when proper regularization is applied, the optimal coefficients learned from the training set are also close to the coefficients learned from the testing set.
Another interesting observation is that the optimal coefficients learned from either training set in (10) or testing set in (11) tend to satisfy that L l=1 h l 1, as illustrated in Figure 2 (c). Such a property does not seem to depend on the number of taps used in the filter. Recall that traditional user-based NNM can be considered as a specific graph band-pass filter with coefficients h 1 = 1 and h l = 0 for any l = 1. Therefore, this supports the idea that traditional user-based NNM is the optimal design for a band-stop graph filter, if only filters with one tap are considered.
To gain further insights, the frequency response for userbased filter with L = 6 and h learned in X tr is illustrated in Figure 2 (b). The frequency response is highly similar to the one of user-based NNM in Figure 2 (a), since both of them are band-stop filters; however, there are two major differences. Firstly, both the amplitude and the range for low and high frequencies with high response in absolute value increases. Secondly, the frequency response for high frequency in Figure 2 (b) becomes positive, whereas the response for high frequency in NNM is negative. The second point is potentially the reason that designing filters by training filter coefficients can improve RMSE by 6.10%. The two differences can be considered as the advantages of designing filters to have a more flexible form compared to NNM.

VI. CONCLUSIONS
This paper exploited results from graph SP to propose new interpretations for RS methods. Our first contribution was to show that CoFi can be considered as a specific band-stop graph filter operating on the network describing similarities between users or items. Leveraging this, we then proposed a new method for RS using other types of graph band stop filters. We also proposed a computationally efficient scheme to design the parameters that define our methods and assessed their performance in the MovieLens-100k dataset. The results obtained showed that, compared to the benchmark approaches, we reduced the RMSE by a rate of 6.20%. Relevant observations regarding how the networks are formed as well as on filter coefficients and the corresponding frequency response were also discussed. Future work would be to consider other types of graph filters and to investigate matrix completion from graph SP perspectives.