1 Introduction

Recommender systems are useful tools for presenting appropriate information from a large amount of information available to users in the age of information explosion. In particular, recommender systems are being used increasingly in web services such as e-commerce, news sites, and music streaming platforms to increase the revenue of these sites [1, 2].

One of the most successful methods of recommender system is matrix factorization [3]. Matrix factorization is a technique for decomposing a user’s evaluation information into a user latent matrix and an item latent matrix with an implicit factor. Probabilistic matrix factorization (PMF) [3], which employs a probabilistic algorithm that works well on sparse datasets, was invented. To solve the cold-start problem, a model that adds various auxiliary information such as tags [4], text [5, 6], images [1, 7], and social relations [8, 9] to the matrix decomposition has been proposed.

With the advent of news sites and web media, and the development of review functions in services, such as Twitter and Amazon, people have access to a large amount of text data. However, as the amount of data increases, users have difficulty accessing the information appropriately. Therefore, among all the supplementary information available, the value of effectively using text data to recommend suitable items to users is increasing.

ConvMF [5] is a typical example of research that uses text information in matrix factorization to improve recommendation accuracy. In this method, features are extracted from the text representing the features of the item using CNN and incorporated into PMF. Here, the text that represents the features of an item is the item description written by each store and reviews posted for the item. On the other side, user reviews truly reflect user characteristics. For example, when a user posts "This shirt is comfortable to wear", we can understand that this user is looking for clothes that are comfortable to the touch. However, ConvMF cannot capture user characteristics. Therefore, we propose Double-ConvMF, which extracts features not only from item description text but also from user description text and integrates them into PMF. In this paper we use the text that represents the characteristics of both users and items, and incorporate it into matrix factorization using CNN. The contributions of this paper are as follows.

  • Double-ConvMF, in which user reviews are considered as text representing user characteristics and integrated into ConvMF, showed high accuracy on three real-world datasets.

  • The effectiveness of item and user description texts was examined by comparing the results of them not only the ConvMF, which uses only item description text, but also a new model that uses only user description text.

2 Related work

The most famous text-based recommendation method is the one that extracts keywords from text describing items and user characteristics, and uses them for recommendation. Beel et al. proposed a recommendation system using TF-IDF, which is a measure of the frequency of word occurrence in a document [10]. Musto et al. proposed a recommendation system using word2vec, a word embedding model [11].

In addition, with the recent improvement in computing technology, many methods using deep learning have been introduced. Zhang et al. [12] extracted information from users’ purchase histories and reviews. Additionally, they used hierarchical RNNs to learn from each other to improve the accuracy.

Moreover, a new method applies convolutional neural networks (CNN) to text to capture the back-and-forth relationships between sentences and improve the recommendation accuracy. Wu et al. [13] used CNN to capture the context of news article titles to improve the recommendation accuracy. Kim et al. [5] proposed ConvMF, which captures contextual information and incorporates it into PMF. ConvMF achieves improved accuracy by applying CNN to the item feature text and appropriately adjusting the prior distribution of items in PMF. In this paper, we use user description text, which is not considered in ConvMF. By appropriately adjusting the prior distribution of the user in PMF, we can improve the recommendation accuracy and examine the effectiveness of using the text of the item and the user.

A method similar to ours is BiConvMF by Liu et al. [14].However, their application of the dataset is limited, and furthermore, no experiments were conducted focusing only on user text. In our paper, a method that applies document features only to the user side is also proposed, and comparative experiments are conducted on three real-world datasets.

3 Method

In this section, we explain our proposed method called Double-ConvMF. First, we introduce the probabilistic model of PMF [3]. Then, we describe the stochastic model and optimization method for simultaneously incorporating item and user description text into PMF.

3.1 PMF (probabilistic matrix factorization)

Salakhutdinov et al. [3] proposed a recommendation method called PMF, which is a kind of matrix factorization method. PMF assumes users N, items M, an arbitrary integer D, and a rating matrix \(R\in \mathbb {R}^{N\times {M}}\) obtained from the user’s rating information. The PMF is a matrix factorization R into a user matrix \(U\in \mathbb {R}^{D\times {N}} =\{u_1, u_2,..., u_N\}\) and an item matrix \(V\in \mathbb {R}^{D\times {M}}=\{v_1, v_2,..., v_M\}\). The measured score \(r_{ij}\) is made by user i for item j. In this case, R is expressed by the following equation:

$$\begin{aligned} p(R|U,V,\sigma ^2)=\prod _{i}^{N}\prod _{j}^{M}N(r_{ij}|u^{T}_{i}v_{j},\sigma ^2)^{I_{ij}}, \end{aligned}$$
(1)

where \(N(x|u,\sigma ^2)\) is the probability density function of the Gaussian normal distribution with mean u and variance \(\sigma ^2\). \(\sigma\) is Gaussian noise of R. \(I_{ij}\) is the indicator function that is equal 1 if user i rated item j and equal to 0 otherwise.

The optimal matrix U, V minimizes the loss function \(\varepsilon\), as shown in the following:

$$\begin{aligned} \begin{aligned} min\;\varepsilon (U,V)=&\sum _{i}^{N}\sum _{j}^{M}\frac{I_{ij}}{2} (r_{ij}-u_{i}^{T}v_{j})^2\\ {}&+\frac{\lambda _U}{2} \sum _{i}^N \Vert u_i \Vert ^2 + \frac{\lambda _V}{2}\sum _{j}^M \Vert v_j \Vert ^2, \end{aligned} \end{aligned}$$
(2)

where \(\lambda _U\) and \(\lambda _V\) are the \(L_2\) regularization terms derived from the Gaussian noise of R, U, and V.

3.2 Double-ConvMF

Fig. 1
figure 1

Graphical model of Double-ConvMF: PMF part in center (blue), ConvMF part in right (green), and Double-ConvMF added to left part (red)

Figure 1 shows an overview of the probabilistic model for Double-ConvMF. X \(=\{x_1,x_2,..., x_M\}\) is the set of description documents of items, and W is the weight vector of the CNN architecture of items. \(X^+=\{x^+_1,x^+_2,..., x^+_N\}\) is the set of description documents of users, and \(W^+\) is the weight vector of the CNN architecture of users. In PMF, R is generated from U, V, and \(\sigma\). In ConvMF, V is generated from X,W and \(\sigma _V\) representing the Gaussian noise. In this way, the prior distribution in PMF is adjusted appropriately.

In our proposed Double-ConvMF, we consider the effectiveness of incorporating \(X^+\), \(W^+\), \(\sigma _U\) to represent the Gaussian noise in matrix factorization and improving the recommendation accuracy by appropriately adjusting the prior distribution by using the user description text for U.

In this paper, we use the CNN architecture proposed in ConvMF, which consists of four layers: embedding, convolution, pooling, and output. The following \(cnn(W,x_j)\) is the feature vector of item j obtained by using CNN architecture from the document vector \(x_j\) of item j, and \(cnn(W^+,x^+_i)\) is the feature vector of user i obtained by using CNN architecture from the document vector \(x^+_i\) of user i.

When \(cnn(W,x_j)\) and \(cnn(W^+,x^+_i)\) are used, V and U in the PMF probability model can be expressed by the following prior distribution equations:

$$\begin{aligned}{} & {} p(V|W,X,\sigma ^2_V)=\prod _{j}^{M}N\left( v_{j}|cnn\left( W,x_j\right) ,\sigma ^2_VI_K\right) \end{aligned}$$
(3)
$$\begin{aligned}{} & {} p(U|W^+,X^+,\sigma ^2_U)=\prod _{i}^{N}N\left( u_{i}|cnn\left( W^+,x^+_i\right) ,\sigma ^2_UI_K\right) , \end{aligned}$$
(4)

where \(I_K\) represents the identification matrix. Equations (1),(3) and (4) can be rewritten as follows:

$$\begin{aligned} \begin{aligned}&p\left( U,V,W,W^+|R,X,X^+,\sigma ^2,\sigma ^2_U,\sigma ^2_V,\sigma ^2_W,\sigma ^2_{W^+}\right) \\ {}&\propto p\left( R|U,V,\sigma ^2\right) p\left( U|X^+,W^+,\sigma ^2_U\right) p\left( V|X,W,\sigma ^2_V\right) \\ {}&\quad p\left( W|\sigma ^2_W\right) p\left( W^+|\sigma ^2_{W^+}\right) . \end{aligned} \end{aligned}$$
(5)

Now, to optimize (5), we use the following maximum a posteriori estimation:

$$\begin{aligned} \begin{aligned}&\underset{U,V,W,W^+}{max}p\left( U,V,W,W^+|R,X,X^+,\sigma ^2,\sigma ^2_V,\sigma ^2_U,\sigma ^2_W,\sigma ^2_{W^+}\right) \\ {}&=\underset{U,V,W,W^+}{max}[p\left( R|U,V,\sigma ^2\right) p\left( U|X^+,W^+,\sigma _U^2\right) p\left( V|X,W,\sigma _V^2\right) \\ {}&\qquad \qquad \qquad p\left( W^+|\sigma ^2_{W^+}\right) p\left( W|\sigma ^2_W\right) ]. \end{aligned} \end{aligned}$$
(6)

By taking the negative logarithm in (6), we can reformulate it as follows:

$$\begin{aligned} \begin{aligned} min\;\varepsilon \left( U,V,W,W^+\right) =\sum _{i}^{N}\sum _{j}^{M}\frac{I_{ij}}{2}\left( r_{ij}-u^T_iv_j\right) ^2 \\+\frac{\lambda _U}{2}\sum ^{N}_{i}\Vert u_i-cnn\left( W^+,x^+_i\right) \Vert ^2 \\+\frac{\lambda _V}{2}\sum ^{M}_{j}\Vert v_j-cnn\left( W,x_j\right) \Vert ^2\\+\frac{\lambda _{W^+}}{2}\sum ^{|W^+_e|}_{e}\Vert W^+_e\Vert ^2+\frac{\lambda _{W}}{2}\sum ^{|W_d|}_{d}\Vert W_d\Vert ^2, \end{aligned} \end{aligned}$$
(7)

where \(\lambda _U\),\(\lambda _V\),\(\lambda _W\), and \(\lambda _{W^+}\) are the regularization terms derived from the Gaussian noise in U,V,W, and \(W^+\), respectively. Partial differentiation of (7) by U and V respectively yields the following equation:

$$\begin{aligned}{} & {} u_i=\left( VI_iV^T+\lambda _UI_K\right) ^{-1}\left( VR_i+\lambda _Ucnn\left( W^+,x^{+}_{i}\right) \right) \end{aligned}$$
(8)
$$\begin{aligned}{} & {} v_j=\left( UI_jU^T+\lambda _VI_K\right) ^{-1}\left( UR_j+\lambda _Vcnn\left( W,x_{j}\right) \right) . \end{aligned}$$
(9)

In these equation, we treat \(u_i\) as a variable and treat others as constants in (8) and we treat \(v_j\) as a variable and treat others as constants in (9) where \(I_i\) is a diagonal matrix whose diagonal components are the indicator vector \(\{I_{i1}, I_{i2},..., I_{iM}\}\) that indicate whether user i evaluated each item. Similarly, \(I_j\) is a diagonal matrix whose diagonal components are the indicator vector \(\{I_{1j}, I_{2j},..., I_{Nj}\}\). \(R_i\) is a rating vector \(\{r_{i1}, r_{i2},...,r_{iM}\}\). Similarly, \(R_j\) is a rating vector \(\{r_{1j}, r_{2j},...,r_{Nj}\}\). Based on (8) and (9), U and V are updated by stochastic gradient descent to obtain the optimal user matrix U and item matrix V.

However, W and \(W^+\) can not be optimized in the same way as U and V because they are closely related to the features of CNN architecture, such as the max pooling layer and nonlinear activation function. Therefore, we temporarily fix U and V and use the error back-propagation method to estimate W and \(W^+\).

4 Experiment

4.1 Goal, dataset, and experiment configuration

We compare the performance of the proposed and existing methods using three different real-world datasets to check the feasibility of our proposed method. We first explain the data set, comparison method, and evaluation metrics. Then, we discuss the experimental results.

This experiment uses the Amazon dataset [7] for clothes, the Rakuten dataset for women’s [15] accessories,Footnote 1 and the Yelp dataset for restaurants in British Columbia.Footnote 2 These datasets are believed that user preferences are more likely to be reflected in reviews. The user description text is the reviews posted by each user, the item description text for Rakuten is the item description text prepared by each store, and for Amazon and Yelp, the item description text is a concatenated document of the reviews posted for the item.

Table 1 Dataset details

The evaluation value for each dataset takes a value from 1 to 5. As this experiment deals with text data, items without text representing the item and users without text representing the user were excluded from the data set. In addition, users who rated only one item were excluded because we could not split the data into training and test data. As a result of these processes, the statistics for each data set are shown in Table 1. In the actual experiments, the data set was randomly divided into training, validation, and test data in a ratio of 8:1:1.

For each text, the following preprocessing was performed as in ConvMF [5]: (1) set the maximum length of raw documents to 300 words, (2) removed stop words, (3) calculated the TF-IDF score for each word, (4) removed corpus-specific stop words that have a document frequency higher than 0.5, (5) selected top 8000 distinct words as vocabulary, and (6) removed all non-vocabulary words from raw documents.

In this experiment, we adopted root mean square error (RMSE) as evaluation index, and took an average of five trials to ensure reliability.

$$\begin{aligned} RMSE=\sqrt{\frac{\sum ^{N,M}_{i,j}\left( r_{ij}-\hat{r_{ij}}^2\right) }{ratings}}, \end{aligned}$$
(10)

where \(\hat{r_{ij}}\) is the predicted score of user i for item j, and ratings is the total number of scores.

We compare Double-ConvMF with the following base lines:

  • PMF [3]: Probabilistic matrix factorization is a standard method of matrix factorization that only uses user’s ratings.

  • ConvMF [5]: Convolutional matrix factorization is a method that extracts features from item description text using CNN and incorporates them into PMF.

  • Left-ConvMF:Left-ConvMF is a method that extracts features from user description text using CNN and incorporates them into PMF.

  • Double-ConvMF:Double-ConvMF is our proposed method that extracts features from item and user description text using CNN and incorporates them into PMF.

In addition to the above methods, ConvMF+, Left-ConvMF+, and Double-ConvMF+, in which a pre-trained model called Globe [16], was applied to each method, were also used as competitors.

Table 2 Overall test RMSE

To find the best values for \(\lambda _U\) and \(\lambda _V\), a grid search was conducted in the range of [1,25,50,75,100].

The parameters of the other experiments were set as follows. These parameters are followed to ConvMF [5] The dimensionality D of the user matrix U and the item matrix V is set to 50. The maximum number of words in each document is set to 300. The dimensionality of the pre-trained word embedding model is set to 300. The dropout rate used in training the CNN architecture is set to 0.2.

4.2 Experimental results

Fig. 2
figure 2

Parameter analysis \(\lambda _U\) and \(\lambda _V\) on Amazon Dataset

Fig. 3
figure 3

Parameter analysis \(\lambda _U\) and \(\lambda _V\) on Rakuten Dataset

Fig. 4
figure 4

Parameter analysis \(\lambda _U\) and \(\lambda _V\) on Yelp Dataset

Table 2 shows the RMSE of each model. Here, the bold value means the best value of methods including our proposed methods, “Improve” stands for the percentage improvement between the best value of Double-ConvMF or Double-ConvMF+ and the best value of the compared methods. From this table, we can see that the best accuracy is obtained by the proposed method Double-ConvMF or Double-ConvMF+ in any dataset. This result suggests that using the user and item description text and incorporating it into the matrix factorization is effective. Then, we compare the Improve values for each dataset: 12% for the Amazon dataset, 9.9% for the Rakuten dataset, and 8.8% for the Yelp dataset, indicating that the improvement rate increases as the dataset becomes more sparse. This result shows that our proposed method works particularly well on a sparse dataset and is effective in improving the cold-start problem.

4.2.1 Influence of user and item description text

To examine the effectiveness of the CNN architecture given to users and items respectively, we calculated the RMSE of ConvMF and Left-ConvMF for each dataset in terms of improvement over PMF, a plain matrix factorization model. In the Amazon dataset, ConvMF was 24.5% and Left-ConvMF was 11.8%, in the Rakuten dataset, ConvMF was 29.5% and Left-ConvMF was 10.2%, and in the Yelp dataset, ConvMF was 12.5% and Left-ConvMF was \(-\)0.82%. The overall trend showed a higher improvement rate in ConvMF using CNN architecture on the item side than in Left-ConvMF using CNN architecture on the user side. Therefore, this result indicates that capturing item relationships more powerfully is effective. However, in all datasets, including the Yelp dataset where the Left-ConvMF value was worse than the PMF value, our proposed method, Double-ConvMF, gave the best value, suggesting that the use of both item and user description text complementarily enhanced the model.

4.2.2 Impact of pre-training model

We discuss the effectiveness of the pre-training model in Double-ConvMF. We use Glove [16] as the pretraining-model. Glove is used for initializing embedding layer of the CNN. As shown in Table 2, the improvement rate when changing from Double-ConvMF to Double-ConvMF+ is \(-\)1.3% for the Amazon dataset, 1.9% for the Rakuten dataset, and 0.27% for the Yelp dataset. This result indicates that the pre-training model is more effective in datasets with a large number of users than items. The reason is that the context-supplementing ability of the prior learning model compensates for the small amount of user data.

4.2.3 Parameter analysis

Figures 2, 3, and 4 show the relationship between \(\lambda _U\) and \(\lambda _V\) and RMSE for each method in each data set. The overall trend is that for Double-ConvMF, RMSE increases as both \(\lambda _U\) and \(\lambda _V\) become smaller. This result suggests that U and V may have fallen into the local optimum solution when \(\lambda _U\) and \(\lambda _V\) were small. However, in the case of ConvMF and Left-ConvMF except for ConvMF on Yelp, RMSE improved only when \(\lambda _U\) and \(\lambda _V\) were small, while RMSE deteriorated in many other cases.These results show that our proposed method is more robust than existing methods. We believe that the reason is that applying the CNN architecture to both the item and user sides optimizes U and V in a balanced manner. However, we need to be careful not to fall into the trap of locally optimal solutions.

Table 3 RMSE by number of reviews

4.2.4 Impact of number of reviews

Table 3 illustrates RMSE values for the proposed Double-ConvMF method when users are categorized by the number of reviews in each dataset. Specifically, the column “under2“ represents the group of users with two or fewer reviews, while the column “over5“ represents users with five or more reviews. It is important to note that in the Rakuten dataset, users with two or fewer reviews were preprocessed and subsequently excluded from the dataset, resulting in blank entries for this category.

Overall, there is a noticeable trend in groups with a review count of 4 or more, particularly the over5 group, tend to exhibit improved RMSE values compared to groups with fewer reviews. This suggests that as users contribute more reviews, our proposed model can better capture their preferences, leading to enhanced accuracy.

However, it is worth highlighting that the Rakuten dataset exhibits results that deviate from this general trend. This suggests the possibility of dataset-specific dependencies, warranting further investigation and careful scrutiny to gain a deeper understanding of the underlying factors contributing to these variations.

5 Conclusion

This paper proposed a method to capture contextual information from the text of user and item description using CNN and incorporate this information into the matrix factorization. Then, we explained the algorithm for optimizing the equation obtained from the prior distribution equation. The experimental results using the three datasets showed an improvement of 8.8% - 12%. This suggests that using the user and item description text at the same time is more effective than using the item description text alone.

In the future, we would like to develop models which use not only text information but also other supportive information such as images, social networks into PMF.