Keywords

1 Introduction

Recently, an increasing number of news articles have been provided to us by various news platforms, such as Google newsFootnote 1 and Gunosy.Footnote 2 Since it is impossible for users to read all of them, personalized news recommendation systems have become an important research topic [25]. On behalf of users, personalized news recommendation systems make recommendations by utilizing several pieces of information, such as users and news articles. Personally recommended articles help users save time and improves their user experience.

Although several news recommendation methods have been proposed, most recent research focuses on deep learning techniques [8, 9, 11, 14, 19,20,21,22,23] and aims to achieve higher performance by obtaining distributed representations of both users and articles. For example, in Ref. [14], the authors propose the mechanism that uses denoising autoencoder and triplet loss to represent news articles. The representations of users are based on their browsing history and sessions. NRMS [21] applies word2vec [13] to news titles. The obtained word embeddings are further fed into a multihead self-attention mechanism to capture word relationships. A pretrained language model such as BERT is also used in news recommendations [4, 24].

However, to the best of our knowledge, recent research on news recommendations mentioned above deals with the single-domain setting, and there are not enough studies, especially on deep learning-based cross-domain news recommendations. In fact, there is a cross-domain situation where the data of multiple news platforms are available such that a news company deals with multiple news brands (e.g., Japanese news provider “Gunosy” has three brands named “Gunosy”, “Newspass”, and “Lucra”). In this setting, news articles might be more robustly represented compared with a single-domain setting, and users might show different characteristics in each domain. Therefore, we propose investigating the performance of deep learning-based methods under a cross-domain news recommendation setting.

In this paper, we propose pretraining the model by using all domains’ data and freezing news-related model parameters for fine-tuning. It is expected that pretraining with all domains’ data enables news embedding to be much more robust, and domain-specific user characteristics are expressed by fine-tuning. As a result of the experiments, the proposed framework worked well, especially in domains where the number of data points was small and increased improvement to 10 % compared with the single domain setting.

The contributions of this paper are as follows:

  • To the best of our knowledge, this is the first study in the cross-domain news recommendation field.

  • We find that using multiple domain data in the news recommendation brings better performance. Our results are useful for other researchers who would like to know the performance of deep models in this setting.

The remainder of this paper is constructed as follows: We describe the related works on news recommendation and state research questions in Sect. 2. The proposed framework to answer the research questions is described in Sect. 3. In Sect. 4, we evaluate the proposed framework. Finally, we conclude this paper and mention future works in Sect. 5.

2 Related Works

We can roughly classify the news recommendation methods into two approaches, “traditional methods” and “deep learning-based methods”. We introduce the representative methods in this section.

2.1 Traditional Methods

In the early stage of research on news recommendations, traditional collaborative filtering methods [1, 3, 5, 16] were representative. In those methods, the news that people with similar preferences liked (clicked) in the past is recommended to a user. However, since the user-user similarity and recommended items are defined based on articles’ ID, it is intrinsically difficult to recommend novel news articles, which is also known as the “cold start problem”. News recommendation is especially sensitive to this problem since news arrives continuously and users can easily change their preferences. To overcome the cold start problem, the features of news content have been proposed. For example, by using the TF-IDF (Term Frequency-Inverse Document Frequency)-like algorithm, some methods take the contents of news articles into consideration [2, 6, 7]. Since TF-IDF is a technique that can extract keywords from documents, it is utilized for creating a feature vector of news articles. Furthermore, popularity, categories, sentiment information, and news location are represented as features and incorporated into the model [10, 12, 15, 18]. However, these types of handcrafted features are usually not optimal in representing the semantic information encoded in news texts [22].

2.2 Deep Learning-Based Methods

With the surge of deep learning techniques, almost all recent recommendation models adopt neural networks. In particular, the content of news articles is captured by a deep neural network-based NLP (Natural Language Processing) technique, and users are represented by their browsing history of news articles in many cases. In Ref. [14], the authors propose the mechanism that uses DAE (Denoising Auto Encoder) and triplet loss to represent news articles. The representations of users are based on their browsing history and sessions, which are also considered in other research [8, 9]. To obtain a richer representation, CNNs (Convolutional Neural Networks) and attention mechanisms are applied to news content [19,20,21]. For example, in NRMS [21], it applies word2vec [13] to news titles. The obtained word embeddings are further fed into a multihead self-attention mechanism to capture word relationships. A pretrained language model such as BERT is also used in news recommendations [4, 24]. In addition to using a deep NLP model, some works predict user behavior, such as “active-time” and “satisfaction” [11, 23]. They learn the models in a multitask fashion and improve performance.

2.3 Question

To the best of our knowledge, recent research on news recommendations mentioned above addresses the single-domain setting, and there are not enough studies, especially on deep learning-based cross-domain news recommendations. Basically, the model should be learned by using a large amount of data with good quality in many machine learning scenarios. In this context, using multiple-domain data might compensate for the amount of data in a single domain. On the other hand, it is easy to assume that users’ features are different in each news platform. We would like to find a good way to tune news articles and user embeddings in a cross-domain setting.

Our interests are summarized as follows;

  1. 1.

    Is news embedding much more robust if we use multiple domain data?

  2. 2.

    Should we consider the characteristics of domain-specific users?

We aim to answer these questions through the proposed framework.

3 Proposed Framework

To clarify the above questions, we propose pretraining the model by using all domain data and freezing news-related model parameters for fine-tuning. It is expected that pretraining with all of the domain data enables news embedding to be more robust, and domain-specific user characteristics to be expressed by fine-tuning.

We show the overview of the proposed framework in Sect. 3.1, and a detailed explanation of the proposed architecture is described in Sect. 3.2.

3.1 Overview

We consider the cross-domain situation where the data of all domains (platforms) are available such that a news company deals with multiple news brands. The data included in news domains is denoted as \(\{T_i| 1 \le i \le N_{\mathrm {domain}}\}\), where \(N_{\mathrm {domain}}\) is the number of domains. \(T_i\) contains the information of user-news interactions, e.g., displayed news, clicked news, and timestamps. Please note that news is shared in each domain, but users are not shared in our assumption. Furthermore, \(T_i\) is divided into training, validation, and test sets. That is, \(T_i= T_{i}^{\mathrm {train}} \cup T_{i}^{\mathrm {val}} \cup T_{i}^{\mathrm {test}}\). Let \(S\) denote the data in which each \(T_i\) is combined. \(S^{\mathrm {train}}\) is denoted as \(S^{\mathrm {train}} = \bigcup _{i}T_{i}^{\mathrm {train}}\). The same holds on \(S^{\mathrm {val}}\) and \(S^{\mathrm {test}}\). In our framework, the model is first learned by using \(S^{\mathrm {train}}\) and \(S^{\mathrm {val}}\). Then, the model is fine-tuned by using \(T_{i}^{\mathrm {train}}\) and \(T_{i}^{\mathrm {val}}\) in each domain. Finally, test sets \(T_{i}^{\mathrm {test}}\) and \(S^{\mathrm {test}}\) are used for the evaluation. Figure 1 shows an example of data utilization flow.

Fig. 1.
figure 1

Overview of the data flow. This is an example of the case where \(N_{\mathrm {domain}}= 3\). First, S is used for learning the recommendation model. The model is then fine-tuned in each domain using \(T_1\), \(T_2\), and \(T_3\).

3.2 Architecture and Procedure

We follow the recent models’ architecture, which captures the contents of news articles by NLP, and users are represented by a set of browsed news embeddings. These types of methods are mentioned in Sect. 2. The following explanation accompanies Fig. 2. Let the user’s browsed articles and the recommendation candidate article denote \(\{D_i | 1 \le i \le H\}\) and \(D_{\mathrm {cand}}\), where H is the number of histories input to the model. Each of the browsed articles is input to news encoders whose projection is defined as f. The \(\mathbf {e_i}\), i-th output of the news encoder, is formulated as

$$\begin{aligned} \mathbf {e_i} = f(D_i; \theta _{\mathrm {news}}), \end{aligned}$$
(1)

where \(\theta _{\mathrm {news}}\) is the trainable parameter. Let \(\mathbf {u}\) denote an embedding of the user. \(\mathbf {u}\) is expressed as

$$\begin{aligned} \mathbf {u} = g_1(E; \theta _{1}), \end{aligned}$$
(2)

where \(g_1\), E, and \(\theta _{1}\) are the aggregator & conversion function, set of \(\mathbf {e_i}\), and trainable parameter, respectively. Similarly, the click prediction \(\hat{p}\) is expressed as

$$\begin{aligned} \hat{p} = g_2(\mathbf {u}, \mathbf {e_{\mathrm {cand}}}; \theta _2). \end{aligned}$$
(3)

The model should output higher probability when \(D_{\mathrm {cand}}\) is the article to be recommended, and vice versa. Since the number of news articles is generally too large, it is difficult to train the model by using all articles. To overcome that, most methods adopt a negative sampling strategy in the training phase. That is, \(\mathbf {u}\), \(D^+\) (clicked article), and \(D_{1}^-, D_{2}^-,...,D_{K}^-,\) (K-sampled non-clicked articles) are used to train the model. Although the loss calculation depends on the base model, for example, it can be formulated as

$$\begin{aligned} \mathrm {loss} = \sum _{i=0}^{K}\mathcal {L}(p_i, y_i), \end{aligned}$$
(4)

where \(y_0 = 1\), \(p_0\) is the predicted click probability of the positive sample, \(y_1, ..., y_k = 0\), \(p_1, ..., p_K\) are those of the negative sample, and \(\mathcal {L}\) is a loss function.

In the pretraining process, we use dataset S and train the above model parameters \(\theta _{\mathrm {news}}\), \(\theta _{1}\), and \(\theta _{2}\) by minimizing loss using optimizers such as SGD (Stochastic Gradient Descent). In fine-tuning, \(T_{i}^{\mathrm {train}}\) and \(T_{i}^{\mathrm {val}}\) in each domain are used, and \(\theta _{\mathrm {news}}\) is frozen. Thus, we have i models that are dedicated to each domain.

Fig. 2.
figure 2

Overview of the assumed model. Many works (e.g. [14, 19,20,21]) adopt this kind of architecture and we follow that. In this paper, we basically follow NRMS [21], architecture. Parameters in components of gray are fine-tuned in each domain, and other parameters are frozen.

4 Experiments

4.1 Base Model

In the following experiments, we select NRMS [21] as the base model and apply the proposed framework with the same hyper-parameter settings. This is because its architecture is simple and brings effective results. That model adopts an NLP-based news encoder, and user embeddings are based on their browsed history, which also suits our assumption. Please note that other models can also be used for our experiments.

4.2 Dataset

We use three datasets: “Gunosy”, “Newspass”, and “Lucra”, which are the names of news platforms run by a single company. Table 1 shows the details of dataset. As shown in Table 1, the news articles are partly shared in each dataset, and users are not shared. Further, Lucra targets female users. Thus, its user characteristics are supposed to be different from other domains. Since each article is written in Japanese, we used Japanese word embeddings (GloVe) made by asahi.com [17]. We believe they are suitable for the news recommendation task because they are made from newspaper articles. Within the shown period, the data of 31st July 2020 are only used as the test set, and the others are the training and validation sets. For the training phase, we try two negative sampling strategies based on two sample sources: “All articles” and “Impression”. When the source is “all articles”, negative samples for a user are randomly chosen from all unread articles in the training set. These datasets include impression data, which is a set of news articles displayed in users’ devices and also used as the negative sampling source by randomly choosing unread articles. The number of negative samples K is set to 4 by following NRMS.

Similarly, we make two types of evaluation data from test sets. In the first case, we randomly choose 200 negative samples from all articles in the test set for each user. In the second case, we use all negative samples in the impression data for each user. We denote “A” and “I” as negative sampling source (“All articles” and “Impression”, which are mentioned above). For example, (A, I) indicates that train set contains negative samples picked from All articles and test set contains negative samples picked from Impression.

Table 1. The number of users and articles in each dataset. G, N, and L indicate Gunosy, Newspass, and Lucra, respectively. Shared users and articles are represented by X \(\cap \) Y. We picked a maximum of 3,000 users who have a larger number of clicks from each domain and the articles they clicked were extracted.

4.3 Metrics

Following NRSM, we evaluate AUC (area under the ROC curve), MRR (mean reciprocal rank), and nDCG@k (normalized discounted cumulative gain at k). All take values between 0.0 to 1.0. When the value is 1.0, it indicates that the model is perfect.

  • AUC: AUC is a widely used metric and indicates overall performance. It takes a higher value when positive samples tend to rank higher than negative samples in the recommendation list.

  • MRR: When positive samples tend to hit at the top of the recommendation list, MRR takes a higher value. In contrast, the value hardly improves when positive samples hit the bottom of the list.

  • nDCG@k: For the top k news articles in the recommendation list, articles that should be highly recommended but appearing lower in a list are penalized. We evaluate the case of \(k=5\) and \(k=10\).

4.4 Performance

Overall Results. Table 2 shows the performance of the model in each dataset pair. As we can see from Table 2, learning with G+N+L and fine-tuning tend to achieve better results regardless of dataset and negative sampling sources. In many cases, the overall performance (AUC) and the quality of recommendation (MRR, nDCG) improved. This reflects the effectiveness of the proposed framework. Especially, AUC of Lucra is effectively improved compared with single domain setting. We consider this is because the number of articles in Lucra is relatively small and it targets female users. This is the just situation where the proposed framework seems to effectively work. Learning with G+N+L enables for news embeddings to be much more robust and the characteristics of users can be captured by domain-specific fine-tuning. On the other hand, the improvements in Gunosy and Newspass are relatively small. This is because Gunosy and Newspass share about 50% news articles and the user characteristics between them seem to be similar. In addition to that, the fact that Lucra’s articles are targeting female users might be another reason. Even if the information of Lucra’s article is reflected in the model, it does not impact on the recommendation results very much in those domains or works as noise in some cases.

Comparing negative sampling strategies, the models tested by “All articles” sampling (*, A) achieve better performance. In impression data, displayed news articles include the effect of recommendation system which have already been working. Thus, using impression data is more practical case and classification becomes more difficult. Although the results are better when training and testing adopt the same source, we cannot judge the superiority between training with All articles and training with Impression.

From the business perspective, this result implies that it is possible to transfer model trained by many articles to other platform dealing with similar news articles. Since it is unnecessary to share the user information in this framework, there is no privacy concern in providing the model. This is a useful merit and we consider there are many situation that the proposed framework can be applied in real setting.

Table 2. The performance on each dataset. G, N, and L are the same as Table 1. If the models are fine-tuned, the value on the column “FT” is True. NS indicates Negative Samples used in dataset. Test set results are shown. For fine-tuned models, the result of final epoch is shown. Bold values are the best results in the same NS-test pair.

Fine-Tuning Performance. Figure 3 shows the performance under fine-tuning. We only show the result of fine-tuning in Gunosy(A, A), Newspass(A, A), and Lucra(A, A) since the tendency is almost the same in other negative sampling pairs. In this experiment, although recommendation quality measures keep rising as epochs increases, we stop fine-tuning in epoch 20 since there is no large improvement.

As we can see from Fig. 3, the larger epochs become, the larger the metrics are. While the improvement is relatively large in Lucra, the results are saturated in other domains. This is the same reason mentioned in Sect. 4.4. We can say that domain-specific fine-tuning is effective in the domain whose user characteristics are different from pre-trained model and it is marginal in other domains. In practical use, domain-specific fine-tuning can be skipped in this kind of domains.

Fig. 3.
figure 3

The performance (AUC, MRR, nDCG@5, and nDCG@10) under fine-tuning. Gunosy(A, A), Newspass(A, A) and Lucra(A, A) are used as dataset.

5 Conclusion

In this paper, we described a simple model for news recommendation tasks in multiple domain settings that uses freezing parameters and fine-tuning. Through experiments using the proposed framework on a real-world dataset, we found that a learning model using multiple domain data is effective for obtaining robust news embeddings. Moreover, our empirical results imply that the characteristics of domain-specific users can be captured from a multiple domain model by fine-tuning the domain. In particular, the proposed framework is effective in domains whose number of data points is small. As a future work, we would like to try additional experiments by changing the base model and clarify the effective type of models for cross-domain news recommendations .