1 Introduction

Recommender systems play a crucial role in modern online applications like music/video websites and e-commerce platforms, recommending potentially interesting items to users based on their historical behavior. Various recommendation models have been developed, including Collaborative Filtering (CF) [1,2,3,4,5], which relies on user-item ratings, and some models utilize textual reviews [6,7,8,9]. Recently, the attention mechanism has been included in neural networks [10,11,12,13,14,15] to optimize recommendation performance by discriminating the contributions of different data features.

However, these recommendation models all suffer from the data sparsity problem. Data sparsity refers to that users in general interact with a limited number of items only, which leads to relatively insufficient user information and reduces the accuracy of recommendations. Particularly for cold-start problems, where users lack prior interactions in a specific domain, recommendation models encounter increased difficulty in suggesting suitable items for these users.

To address these two problems, especially the cold-start problem, researchers have proposed Cross-Domain Recommendation [16,17,18,19,20,21,22]. Cross-domain recommendation recommends to a user the item in one domain (target domain) by learning and transferring the user’s historical behavior from another domain (auxiliary domain). An example of music domain recommendation by referring to the movie domain is illustrated in Figure 1. Existing cross-domain recommendation approaches often consist of three steps.

First, get the feature vector representation of the users. Second, a mapping of overlapping users is established from the auxiliary domain (movie in Figure 1) to the target domain (music in the example). Third, based on the user mapping, the recommendation for a cold-start user in the target domain is achieved by transferring the user preferences from the auxiliary domain.

However, existing cross-domain recommendation approaches still face several challenges. The first is the incomplete utilization of review information. User reviews contain various information about users and items, and some models fail to make use of information-rich user reviews (e.g., [17]), or underutilize reviews (e.g., [18]). Different reviews are not independent but related to each other. For example, a user is particularly harsh and always makes habitual negative reviews. In the reviews for a certain item, it is important to distinguish and separately process the reviews from the particular user and those from other normal users. Another challenge is how to map the interest transformation between different domains. In cross-domain recommendation, the user’s interest is bifurcated into two facets: domain-shared and domain-specific interests. While the former can offer benefits across different domains, the latter is relevant only within a single domain, and applying it directly to another domain may result in adverse ‘negative transfer’ effects. Unfortunately, common methodologies often neglect this aspect, resulting in a conflation of interests across domains, which can subsequently diminish the accuracy of recommendations. Although recent advancements (e.g., [23,24,25,26,27]) have introduced techniques to disentangle these interests, they primarily capitalize on domain-shared interests to bolster overlapping users’ representations in both domains, often ignoring strategies for recommending to cold-start users.

Fig. 1
figure 1

An example of cross-domain recommendation

Motivated as such, we propose an Extract-Map-Predict Neural Network Architecture (EMPNet) that exploits and differentiates user reviews for cross-domain recommendation.

EMPNet innovates and integrates multiple technologies to produce cross-domain recommendations. First of all, EMPNet considers the review information to increase data availability and diversity. When processing the review text information, EMPNet leverages Bidirectional Encoder Representations from Transformers (BERT) [28] and the Identity-Enhanced Multi-Head Attention Mechanism to improve the utilization of review information. Next, to improve the accuracy of the cross-domain user mapping, EMPNet  employs a Domain Mapping Variational Autoencoder (DM-VAE). In particular, DM-VAE disentangles user feature vectors into domain-shared and domain-specific interests, transfers only the domain-specific interests across domains, and integrates transfer results with the domain-shared interests to get the feature vector of cold-start users in the target domain.

Finally, to improve the prediction accuracy of cross-domain recommendation, EMPNet improves an attentional factorization machine (AFM) by adding to it three biases that represent the inherent features of users, items, and domains. The user feature bias represents the user’s scoring habits, the item feature bias includes the intrinsic qualities of the item and the domain feature bias represents the overall rating of the domain.

We make the following major contributions:

  • We propose EMPNet, an innovative model that discerns the significance of reviews and disentangles user interests to enhance cross-domain recommendation performance.

  • We add the user ID to the review feature vector of multi-headed attention to distinguish users who write reviews of different quality and design a biased AFM for EMPNet by incorporating biases on user features, item features, and domain features to indicate their historical scoring preferences.

  • We propose the DM-VAE method, which disentangles user feature vectors into domain-shared and domain-specific interests. It then transfers the domain-specific interests from the auxiliary domain to the target domain, where they are merged with the domain-shared interests to form the user feature vector in the target domain.

  • We evaluate EMPNet on the Amazon dataset. The experimental results show that EMPNetclearly improves accuracy in the cross-domain recommendation.

The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 elaborates on our EMPNet model. Section 4 reports on experiments. Section 5 concludes the paper and discusses future research.

2 Related work

Single-domain recommendation models roughly fall into two categories.

Review-based recommendation models utilize textual reviews on items to improve recommendations. ALFM (Aspect-aware Latent Factor Model) [6] models user preferences and item features on review texts, and integrates them with latent factors learned from the user-item rating matrix. MTPR (Multi-Task Pairwise Ranking) [8] combines collaborative embedding and content embedding to address the cold-start problem in the multimedia recommendation. For normal items, both kinds of embedding are used. For cold-start items, collaborative embedding is replaced by a zero vector. DeepCoNN (Deep Cooperative Neural Networks) [9] uses two parallel CNNs to generate representations of user behaviors and item properties over their related reviews, followed by a Factorization Machine (FM) to predict the user-item rating.

Attention-based recommendation models further discriminate the importance of different data items. AFM (Attentional FM) [13] weights all different feature interactions according to their contribution to the result. KGAT (Knowledge Graph ATtention network) [14] embeds users, items, and item attributes from a knowledge graph. It uses an attention mechanism to compute the importance of graph neighbors. Note that none of the above-recommended models can resolve the cold-start problem.

Table 1 Feature comparison of recommendation models

Cross-domain recommendation models mitigate the cold-start problem by exploiting data from the auxiliary domain. \(\pi \)-Net (Parallel Information-sharing Network) [29] makes sequential recommendations simultaneously for two domains where the user behaviors are synchronously shared at each timestamp. Unlike \(\pi \)-Net, our EMPNet does not require shared accounts across domains. CPR (cross-domain paper recommendation) [30] is a cross-domain recommendation model for paper recommendation. CPR learns the probabilistic associations of paper content with the existing discipline classification. Then a user interest is represented as a probabilistic distribution over the target domain semantics. Finally, relevant papers are recommended to users according to user interest and paper content. EMCDR (Embedding and Mapping framework for Cross-Domain Recommendation) [17] captures non-linear mapping across different domains through an MLP embedding process. RC-DFM (Review and Content-based Deep Fusion Model) [18] uses additional stacked denoising autoencoder (aSDAE) [33] to fuse review texts and item contents with the rating matrix in both domains. CATN (Cross-domain recommendation framework via Aspect Transfer Network) [31] extracts multiple-aspect review documents as well as auxiliary reviews of users, and learns inter-aspect correlations across domains through an attention mechanism. PTUPCDR (Personalized Transfer of User Preferences for Cross-domain Recommendation) [32] learns a meta-network to generate personalized bridge functions to achieve personalized preference transfer for each user. In recent years articles applying disentangled representation learning in cross-domain recommendation have begun to emerge. SER [27] introduced a method that leverages user reviews for domain disentanglement, focusing on enhancing the performance of recommendation systems through text analysis and domain identification. However, this paper only focuses on cross-domain recommendations without overlapping users and does not utilize information from overlapping users. DisenCDR (Disentangled Representations for CDR) [26] learns disentangled user representations through mutual information regularizers to distinguish between domain sharing and specific information. This approach applies to the scenario of dual-domain boosting for shared users and does not apply to the cold-start problem.

Our EMPNet distinguishes itself from existing cross-domain recommendation approaches in two major aspects. First, we make full use of reviews and use an Identity-Enhanced Multi-Head Attention Mechanism to classify the importance of reviews. Second, most models directly map information from the auxiliary domain to the target domain without a finer delineation of representations across different domains. Some methods that employ disentangled representations have not fully harnessed the diverse information in cold-start scenarios, which involve data from both overlapping and cold-start users. We disentangle user representations and utilize overlapping users to learn cross-domain mappings for cold-start users.

Table 1 compares these Recommendation models.

3 EMPNet

Table 2 lists the symbols frequently used in this paper. We utilize overlapping users \(U_o\) in the auxiliary domain \(D^A\) and target domain \(D^T\) to make cross-domain recommendations for cold-start users \(U_c\). We harness the textual content of reviews and associated review entities to construct feature vectors for both users and items. Specifically, an item’s feature vector \(\textbf{f}_i\) is generated from its aggregate historical reviews \(C_i = \{c_{i1},c_{i2},...,c_{ik}\}\) and the corresponding users \(U_i = \{u_{i1},u_{i2},...,u_{ik}\}\), while a user’s feature vector \(\textbf{f}_u\) is derived from their aggregate historical reviews \(C_u\) and the associated items \(I_u\). The key of our approach is to effectively transfer the feature vectors of cold-start users from an auxiliary domain to the target domain, thereby improving accurate cross-domain recommendations.

Table 2 Notations
Fig. 2
figure 2

Overall architecture of EMPNet

The architecture of EMPNet is shown in Figure 2, where symbols with the superscripts A and T represent data in auxiliary and target domains, respectively. The structure of EMPNet can be divided into three modules:

  • The feature extraction module employs BERT and the Identity-Enhanced Multi-Head Attention Mechanism to extract user (resp. item) feature vectors \(\textbf{f}_u\) (resp. \(\textbf{f}_i\)) from review texts.

  • The cross-domain user mapping module employs feature vectors from overlapping users \(\textbf{f}_{u_o}\) to train the DM-VAE, learning the cross-domain mappings from the auxiliary to the target domain, which is subsequently applied to cold-start users to derive their feature vectors in the target domain \(\textbf{f}_{u_c}^T\). Additionally, this module utilizes an MLP to determine the bias \(b_{u_c}^T\) for each of these cold-start users.

  • The prediction module calculates the user u’s rating of each item by combining the user feature vector \(\textbf{f}_u\) and the item feature vector \(\textbf{f}_i\). The item with the highest rating is recommended to the user u.

It is noteworthy that EMPNet supports both single-domain recommendation and cross-domain recommendation, and the former lies as the foundation for the latter. At first, EMPNet combines feature extraction module and prediction module to enable single-domain recommendation for each domain. For cross-domain recommendation, EMPNet takes the intermediate results of single-domain recommendations from the target and auxiliary domains, i.e., the feature vector of user \(\textbf{f}_u\) and the feature vector of item \(\textbf{f}_i\), feeds it into cross-domain user mapping module, and finally runs another prediction module to realize cross-domain recommendation.

Next, we describe these three modules in detail.

3.1 Feature extraction module

Users typically produce reviews when purchasing items. Such reviews encompass both user and item feature data. To this end, this module extracts the corresponding feature vectors from these reviews. This module is common to both auxiliary and target domains. Also, the process of this module is the same for users as for items. For simplicity, Figure 2 only illustrates the process for items. The process for obtaining item feature vectors is as follows.

To extract features from reviews, this module uses BERT, a foundational model optimized for tasks in the realm of natural language processing. It can convert a sequence of text into a fixed-size vector, capturing the contextual relationships between words. Given the review set \(C_i\) for an item i, this module converts each review \(c_{ik}\) into a review feature vector \(\textbf{v}_{ik}\in \mathbb {R}^{k_1}\) using BERT. Here, \(k_1\) denotes the dimension of word vectors is 768 and the number of reviews is \(k_0\). The vector \(\textbf{v}_{ik}\) encapsulates the sentiment and content of the review, making it a valuable feature for our recommendation system.

Every review corresponds to a user. Some users may tend to be more critical in their ratings, resulting in lower ratings, while others may have a stronger preference for certain items. To account for this, we introduce a user encoding \(\textbf{u}_{ik}\in \mathbb {R}^{k_1}\) for each review. This encoding is a complex vector representing the user’s profile.

As different reviews contribute differently to the overall evaluation of an item, this module assigns weights to reviews through a multi-head attention mechanism. The multi-head attention mechanism employed in this study is termed the Identity-Enhanced Multi-Head Attention Mechanism. According to the method of adding positional encoding to the input embeddings in a previous work [34], we add the user encoding \(\textbf{u}_{ik}\) about who writes the review to the review feature vector \(\textbf{v}_{ik}\) as \(\textbf{o}_{ik}\in \mathbb {R}^{k_1}\). The Identity-Enhanced Multi-Head Attention Mechanism is defined as:

$$\begin{aligned} \textbf{o}_{ik}= & {} \textbf{v}_{ik}+\textbf{u}_{ik}\\ \nonumber \textbf{Q}=\textbf{K}= & {} \textbf{V}=\textbf{O}_{i}=\{\textbf{o}_{i1},\textbf{o}_{i2},...\textbf{o}_{ik_0}\}\\ \nonumber Attention(\textbf{Q},\textbf{K},\textbf{V})= & {} softmax(\frac{\textbf{QK}^T}{\sqrt{d_v}})\textbf{V}\\ \nonumber \textbf{head}_i= & {} Attention(\textbf{QW}_i^Q,\textbf{KW}_i^K,\textbf{VW}_i^V)\\ \nonumber MultiHead(\textbf{Q},\textbf{K},\textbf{V})= & {} Concat(\textbf{head}_1,...,\textbf{head}_h)\textbf{W}^O\\ \nonumber \textbf{O}^\prime _{i}=\{\textbf{o}^\prime _{i1},\textbf{o}^\prime _{i2},...\textbf{o}^\prime _{ik_0}\}= & {} MultiHead(\textbf{Q},\textbf{K},\textbf{V}) \end{aligned}$$
(1)

where \(\textbf{O}_{ik}\in \mathbb {R}^{k_0 \times k_1}\) represents the set of reviews for item i, \(\textbf{Q}\in \mathbb {R}^{k_0 \times k_1}\), \(\textbf{K}\in \mathbb {R}^{k_0 \times k_1}\), \(\textbf{V}\in \mathbb {R}^{k_0 \times k_1}\) represents the query, keys, and values respectively, and the value of all three of them is the input vector \(\textbf{O}_{ik}\); \(\textbf{o}^\prime _{ik}\in \mathbb {R}^{k_1}\) represents the weighted \(\textbf{o}_{ik}\); \(\textbf{W}_i^Q\in \mathbb {R}^{k_1 \times d_v}\), \(\textbf{W}_i^K\in \mathbb {R}^{k_1 \times d_v}\), \(\textbf{W}_i^V\in \mathbb {R}^{k_1 \times d_v}\), \(\textbf{W}^O\in \mathbb {R}^{hd_v \times k_1}\); h denote the number of heads respectively, and \(d_v = k_1/h\).

After obtaining the weighted feature vector \(\textbf{o}^\prime _{ik}\) by the Identity-Enhanced Multi-Head Attention Mechanism, this module feeds the weighted sum of the review features \(\textbf{f}^\prime _{i}\) to the MLP to obtain the item feature vector \(\textbf{f}_i\). The \(\textbf{f}_i\) is calculated as follows:

$$\begin{aligned}{} & {} \textbf{f}^\prime _{i}=\sum _{k=1}^{k_0}\textbf{o}^\prime _{ik}\\ \nonumber{} & {} \textbf{f}_{i}=f_{mlp}(\textbf{f}^\prime _{i}) \end{aligned}$$
(2)

where \(\textbf{f}^\prime _{i}\in \mathbb {R}^{k_1}\) and \(\textbf{f}_{i}\in \mathbb {R}^{k_f}\). Here, \(k_f\) denotes the dimensionality of feature vectors.

For a user u, the user feature vector \(\textbf{f}_u\) \(\in \mathbb {R}^{k_f}\) is obtained in the same way, except that the input includes an item set \(I_u\) and review set \(C_u\). The set \(I_u\) contains the items for which user u has written a review.

3.2 Prediction module

Then we use prediction module in both domains to accomplish single-domain recommendations. This module uses a biased AFM to predict a user’s rating for an item using their feature vectors. The biased AFM consists of five parts: the paired interaction part, the linear regression part, the average rating of a domain, the user feature bias, and the item feature bias.

The paired interaction part works as follows. A pair of user and item feature vectors \(\textbf{f}_u\) and \(\textbf{f}_i\) are concatenated to generate the rating feature vector \(\textbf{z}_{ui} \in \mathbb {R}^{n}\), where n is the sum of the dimensionalities of \(\textbf{f}_u\) and \(\textbf{f}_i\). The interaction result \(\textbf{p}_{kl}\) between each pair of components \(z_k\) and \(z_l\) in \(\textbf{z}_{ui}\) is calculated as

$$\begin{aligned} \textbf{p}_{kl}=(\textbf{v}_k \bigodot \textbf{v}_l)z_k z_l,\ \ \ \ z_k, z_l\in \textbf{z}_{ui} \end{aligned}$$
(3)

where \(\bigodot \) denotes the element-wise product of two vectors, \(\textbf{p}_{kl} \in \mathbb {R}^{k_2}\), \(\textbf{v}_k\) (\(\textbf{v}_l\)) \(\in \mathbb {R}^{k_2}\) is the weight vector of \(z_k\) (\(z_l\)), and \(k_2\) denotes the dimensionality of the weight vector.

As the interaction result \(\textbf{p}_{kl}\) does not always contribute to the final result with the same significance, we use an attention mechanism to get the attention score for a \(\textbf{p}_{kl}\).

$$\begin{aligned} \textbf{a}_{kl}^\prime =\mathbf {h_p}^\textsf{T}ReLU\left( \textbf{W}_p\textbf{p}_{kl}+\textbf{b}_1\right) +b_2 \end{aligned}$$
(4)
$$\begin{aligned} \textbf{a}_{kl}=\frac{\exp {\left( \textbf{a}_{kl}^\prime \right) }}{\sum _{k\in n,l\in n} \exp {\left( \textbf{a}_{kl}^\prime \right) }\ \ } \end{aligned}$$
(5)

where \(\textbf{W}_p \in \mathbb {R}^{k_3 \times k_2}\) is the weight matrix of \(\textbf{p}_{kl}\). We have \(\textbf{b}_1\) \(\in \mathbb {R}^{k_3}\), \(b_2\) \(\in \mathbb {R}^{1}\) and \(\mathbf {h_p}\) \(\in \mathbb {R}^{k_3}\). We normalize \(\textbf{a}_{kl}^\prime \) to \(\textbf{a}_{kl}\). The final result of the paired interaction part is obtained as

$$\begin{aligned} \textbf{h}_p^\textsf{T}\sum \nolimits _{k=1}^{n}\sum \nolimits _{l=k+1}^{n}\textbf{a}_{kl}(\textbf{v}_k \bigodot \textbf{v}_l)z_k z_l \end{aligned}$$
(6)

where \(\textbf{h}_p^\textsf{T} \in \mathbb {R}^{k_2}\) is the weight of the paired interactive part.

The resultant formula of the biased AFM is

$$\begin{aligned} r_{ui}=\textbf{h}_p^\textsf{T}\sum _{k=1}^{n}\sum _{l=k+1}^{n}\textbf{a}_{kl}(\textbf{v}_k \bigodot \textbf{v}_l)z_kz_l+(\sum _{k=1}^{n}{w_kz_k}+b_z)+\mu +b_u+b_i \end{aligned}$$
(7)

where \(r_{ui}\) is user u’s predicted rating on item i, and \(w_k \in \mathbb {R}^{1}\) (resp., \(b_z\) \(\in \mathbb {R}^{1}\)) represents the weight (resp., bias) of the linear regression part. \(\mu \) is the average rating of a domain which serves as the feature of that domain and can be calculated directly. \(b_u\) is the user feature bias indicating a user’s scoring habits and \(b_i\) is the item feature bias representing the overall scoring situation of an item. Both \(b_u\) and \(b_i\) are subject to random initialization and undergo subsequent training to achieve optimal performance.

3.3 Cross-domain user mapping module

After the single-domain recommendation, we use its intermediate result: user feature vector and user bias for the cross-domain recommendation. To tackle the cold-start problem, we propose the cross-domain user mapping module.

In the auxiliary domain, user feature vectors \(\textbf{f}_{u}^A\) are divided into those of overlapping users \(\textbf{f}_{u_o}^A\) and cold-start users \(\textbf{f}_{u_c}^A\). Likewise, user bias \(b_{u}^A\) is split into the bias of the overlapping user \(b_{u_o}^A\) and the cold-start user \(b_{u_c}^A\). In the target domain, user feature vectors \(\textbf{f}_{u}^T\) and biases \(b_{u}^T\) are the overlapping user’s feature vector \(\textbf{f}_{u_o}^T\) and bias \(b_{u_o}^T\). This module aims to learn the user feature vector \(\textbf{f}_{u_c}^T\) and bias \(b_{u_c}^T\) of a cold-start user in the target domain.

Map user feature vectors This module employs the DM-VAE approach to map cold-start user feature vectors \(\textbf{f}_{u_c}\) from the auxiliary domain to the target domain. In either domain, DM-VAE independently trains a VAE structure, which comprises an encoder and a decoder, both of which are constructed using MLP. As an example of an overlapping user on the auxiliary domain in Figure 2, we use the encoder to disentangle user feature vectors \(\textbf{f}_{u_o}^A\) into two sub-vectors, representing domain-shared interests \(\textbf{e}_{u_o}^S\) and domain-specific interests \(\textbf{e}_{u_o}^A\). The formula for this part is as follows:

$$\begin{aligned} \mu _1, \sigma _1, \mu _2, \sigma _2= & {} encoder^A (\textbf{f}_{u}^A) \\ \nonumber \textbf{e}_{u_o}^S= & {} reparam(\mu _1, \log (\sigma _1^2)) \\ \nonumber \textbf{e}_{u_o}^A= & {} reparam(\mu _2, \log (\sigma _2^2)) \\ \nonumber \hat{\textbf{f}}_{u}^A= & {} decoder^A (\textbf{e}_{u_o}^S, \textbf{e}_{u_o}^A) \end{aligned}$$
(8)

where the encoder first samples the input vector, generating two sets of means and variances \(\mu _1, \sigma _1, \mu _2, \sigma _2\). Then, using the reparameterization trick, we generate samples from these means and variances, where one sample represents domain-shared interests \(\textbf{e}_{u_o}^S\) and the other represents domain-specific interests \(\textbf{e}_{u_o}^A\). These two interest vectors are then concatenated and processed through the decoder to retrieve the original feature vector.

To guarantee the effective disentanglement of domain-shared interests \(\textbf{e}_{u_o}^S\) and domain-specific interests \(\textbf{e}_{u_o}^A\), we utilize a pair of Kullback-Leibler (KL) divergence losses \(\text {KL}_{\text {shared}}^A\) and \(\text {KL}_{\text {specific}}^A\). Subsequently, a reconstruction loss \(\mathcal {L}_{\text {recon}}^A\) is applied to ensure that the output of the decoder is a close approximation of the original input. Each loss is defined as:

$$\begin{aligned} \begin{aligned} \text {KL}_{\text {shared}}^A&= \text {KL}(\mu _1, \sigma _1^2)\\ \text {KL}_{\text {specific}}^A&= \text {KL}(\mu _2, \sigma _2^2)\\ \mathcal {L}_{\text {recon}}^A&= \text {loss}_{\text {recon}}(\textbf{f}_{u}^A, \hat{\textbf{f}}_{u}^A) \end{aligned} \end{aligned}$$
(9)

After training a VAE in each of the two domains, we obtained the domain-specific interests of the overlapping users \(\textbf{e}_{u_o}^A\) and \(\textbf{e}_{u_o}^T\) in both domains as well as their domain-shared interests \(\textbf{e}_{u_o}^S\). Subsequently, we trained an MLP to understand how the domain-specific interests of users transition across different domains:

$$\begin{aligned} \textbf{e}_{u_o}^T = f_{mlp}(\textbf{e}_{u_o}^A) \end{aligned}$$
(10)

We use a loss function \(\mathcal {L}_{\text {map}}\) to ensure that domain-specific interests are mapped from the auxiliary domain to the target domain. For the domain-shared interests, we employed a loss function \(\mathcal {L}_{\text {com}}\) to ensure that the domain-shared interests learned by the two VAEs are consistent.

Once trained with data from overlapping users, the DM-VAE can be applied to the feature vectors of cold-start users \(\textbf{f}_{u_c}\) as shown in Figure 2. Since cold-start users only have interactions in the auxiliary domain, we first decompose them using the encoder of the auxiliary domain, obtaining their domain-specific interests \(\textbf{e}_{u_c}^A\) and domain-shared interests \(\textbf{e}_{u_c}^S\). Then, using the trained MLP, we map the domain-specific interests from the auxiliary domain \(\textbf{e}_{u_c}^A\) to the target domain \(\textbf{e}_{u_c}^T\). Finally, we obtain the domain-specific interests \(\textbf{e}_{u_c}^A\) and domain-shared interests \(\textbf{e}_{u_c}^S\) of the cold-start users in the target domain and use the decoder of the target domain to derive the feature vector of the cold-start users in the target domain \(\textbf{f}_{u_c}^T\).

Mapping user bias The user bias vector encapsulates the user’s intrinsic attributes. Given its relatively simple structure, we employ an MLP to learn the mapping of user bias from the auxiliary domain \(b_{u}^A\) to the target domain \(b_{u}^T\), the loss is \(\mathcal {L}_{\text {bias}}\). This is informed by the overlapping user bias \(b_{u_o}\) across the two domains. By applying this mapping approach to the cold-start user’s bias in the auxiliary domain \(b_{u_c}^A\), we can deduce the cold-start user’s bias in the target domain \(b_{u_c}^T\).

Cross-domain recommendations Upon acquiring the feature vectors and biases for cold-start users in the target domain, we can deploy the prediction module to generate cross-domain recommendations. The distinctive aspect of applying the prediction module in the cross-domain scenario, as opposed to the single-domain scenario, lies in the utilization of data: user feature vector and bias use only the information of the cold-start user in the target domain, and the item feature vector and bias use the information of the items in the target domain that have already been trained.

After all item ratings are computed for cold-start users, the items with the highest ratings will be recommended to them.

3.4 Model training

For single-domain recommendations, the feature extraction module and prediction module are trained end-to-end. The loss function is the loss of predicted ratings.

For cross-domain recommendations, We use the intermediate results of single-domain recommendations as input and train the cross-domain user mapping module with feature vectors of overlapping users with bias. The loss function for cross-domain recommendation \(\mathcal {L}_{\text {cross}}\) is:

$$\begin{aligned} \mathcal {L}_{\text {recon}}= & {} \mathcal {L}_{\text {recon}}^A+\mathcal {L}_{\text {recon}}^T\\ \nonumber \mathcal {L}_{\text {KL}}= & {} \text {KL}_{\text {shared}}^A+\text {KL}_{\text {shared}}^T+\text {KL}_{\text {specific}}^A+\text {KL}_{\text {specific}}^T\\ \nonumber \mathcal {L}_{\text {cross}}= & {} \alpha \mathcal {L}_{\text {recon}} +\beta \mathcal {L}_{\text {KL}} +\gamma (\mathcal {L}_{\text {map}} +\mathcal {L}_{\text {com}})+\delta \mathcal {L}_{\text {bias}}+\epsilon \mathcal {L}_{\text {score}} \end{aligned}$$
(11)

where \(\mathcal {L}_{\text {recon}}\) represents the reconstruction loss of VAE, \(\mathcal {L}_{\text {KL}}\) represents the KL scatter loss, \(\mathcal {L}_{\text {score}}\) represents the loss of predicted ratings, and \(\alpha , \beta , \gamma , \delta \) and \(\epsilon \) represent the weights of the components.

In our experiments, we use Adam [35] as the optimizer for training. It minimizes the error between the predicted rating and the real rating. We apply dropout to the review feature vector in the feature extraction module and the paired interactions part in the prediction module. We also apply L2 regularization to the weight matrices in the two attention mechanisms. These measures help to avoid overfitting.

4 Experiments

4.1 Dataset and evaluation metrics

The Amazon datasetFootnote 1 contains users, items, and ratings/reviews on items, where each rating is coupled with a review.

From the total 21 item categories, we select the three largest pairs of categories for experiments, namely movie-music, movie-book, and book-music. As some items and users receive only small numbers of reviews, we preprocess the data as follows. In particular, we select all items with more than 20 reviews in each domain, and then select the overlapping users with more than 10 reviews.

Since excluding all other users affects the number of reviews on the selected items, we repeat the process. The statistics of the datasets are shown in Table 3.

Table 3 Statistics of the datasets

Following [31, 32], we use Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) as the evaluation metrics of user-item rating prediction.

$$\begin{aligned} \text {RMSE}=\bigg ( \frac{1}{H}\sum \nolimits _{h=1}^{H}\left( r_h^\prime -r_h\right) ^2 \bigg )^{1/2} \end{aligned}$$
(12)
$$\begin{aligned} \text {MAE}=\frac{1}{H}\sum \nolimits _{h=1}^{H}{|r_h^\prime -r_h|} \end{aligned}$$
(13)

where H denotes the number of test ratings, with \(r_h^\prime \) and \(r_h\) denoting the predicted and actual ratings for the h-th instance, respectively.

4.2 Experimental setup

We implement our framework using PyTorch and GPU. For each experiment, we randomly select half of the users and remove their information in the target domain, designating them as cold-start users. Initially, we conduct single-domain experiments in each domain, referred to as EPNet (Extract-Predict). For these single-domain experiments, we partition the data into training, validation, and test sets with a ratio of 8:1:1 and employ five-fold cross-validation to ensure the accuracy of the results. We utilize BERT for preprocessing review texts, with each input review limited to a maximum length of 512, resulting in an output review vector of 768 dimensions. Through grid search optimization, we set the learning rate, regularization parameter, number of multi-head attention heads, and dimensions of user/item feature vectors to 0.0005, 0.0001, 4, and 20, respectively. For the cross-domain recommendation parameters, we employ Bayesian optimization to determine the optimal parameters for each experimental group.

Table 4 Performance of cross-domain recommendation on RMSE and MAE

4.3 Cross-domain recommendation performance results

We compare EMPNet with the following four alternatives.

  • EMCDR [17]: This model first applies matrix factorization to learn the latent factors, and then uses the MLP network to map the user latent factors from the auxiliary domain to the target domain.

  • R-DFM [18]: It is a simple version of RC-DFM [18]Footnote 2. It merges ratings and reviews through the extended aSDAE to enhance the presentation of users and items. MLP is also adopted in cross-domain user mapping.

  • CATN [31]: This model aims to extract multiple aspects from per-user and per-item review documents as well as auxiliary reviews of users with similar interests, and learn inter-aspect correlations across domains through an attention mechanism.

  • PTUPCDR [32]: This model learns a meta-network fed by user feature embeddings to generate personalized bridge functions that achieve personalized preference transfer for each user.

For the evaluation metrics RMSE and MAE, the results are shown in Table 4. Clearly, EMPNet outperforms all baselines in most cross-domain recommendations, demonstrating the superiority of our proposed model. The result of EMCDR is the worst. The main reason is that this model does not use reviews, and the use of ratings is relatively simple compared to other baselines. R-DFM incorporates review information as incidental content into the rating mechanism. This results in a low utilization rate of reviews, thus making the matrix factorization method adopted in predicting ratings ineffective. The CATN model extracts multiple aspects of users and items from the review documents for cross-domain transfer, making full use of the review data, so the performance is better than the R-DFM. On a majority of datasets, PTUPCDR demonstrates performance surpassed only by EMPNet, a distinction attributable to its innovative application of personalized preference transfer.

It is worth noting that on both the “Book to Music” and “Music to Book” experiments, the results of CATN are better than those of PTUPCDR, which may be attributed to the amount of data in the experiments. It can be seen from Table 3 that experiments with “Book to Music” and “Music to Book” have the least amount of data. The main reason is that the use of additional review data improves the results more significantly when the amount of data is small.

In the “Book to Movie” experiment, the performance improvement of EMPNet is relatively small, which is related to the particularity of the data. It can be seen from Table 3 that the “Book to Movie” experiment has the largest gap between the number of items in the auxiliary domain and the target domain. EMPNet performs the same operations on users and items in multiple steps, whereas CATN and PTUPCDR have a predilection for user-centric information extraction. Consequently, in the “Book to Movie” experiment, the superiority of our EMPNet is somewhat subdued. Conversely, in the “Book to Music” experiment, the gap in the number of items is the smallest, which directly reflects EMPNet’s most pronounced performance improvement in this experiment.

While EMPNet demonstrates significant improvements in RMSE, the enhancements in MAE are not so pronounced. This discrepancy may stem from RMSE’s strengthened sensitivity to larger prediction errors, which our model’s optimization strategy may be able to mitigate more effectively. Given that our optimization function is tailored for RMSE, this could also account for the less noticeable performance gains in reducing average errors compared to squared errors. We leave it for future work to explore the adoption of alternative loss functions to achieve a more balanced enhancement across both metrics.

Table 5 Performance of the Ablation Study on EPNet-MLP and EPNet-AFM (best result in bold)

4.4 Ablation study

As mentioned in Section 4.3, our proposed model outperforms the baselines. These improvements come from three innovations of our model: Identity-Enhanced Multi-Head Attention Mechanism in the feature extraction module, DM-VAE in the cross-domain user mapping module, and biased AFM in the prediction module.

In this section, we conduct an ablation study to demonstrate the importance of each of the three innovations. Given that the feature extraction module and the prediction module can constitute a single-domain recommendation, we directly test the efficacy of the Identity-Enhanced Multi-Head Attention Mechanism and the biased AFM in the single-domain recommendation. The effectiveness of DM-VAE is evaluated within the cross-domain recommendation. Specifically, we compare the proposed model with the following variants:

  • EPNet-ATN: It does not utilize the identity information of the reviews. Instead, it directly feeds the review vectors output by BERT into the multi-head attention mechanism in the feature extraction module.

  • EPNet-AFM: It replaces the biased AFM with the unbiased AFM in the prediction module.

  • EMPNet-MLP: It uses the ordinary MLP without DM-VAE in the cross-domain user mapping module of EMPNet.

Table 6 Performance of the ablation study on EMPNet-MLP (best result in bold)

The results of the ablation experiments are shown in Tables 5 and 6, and our proposed designs are effective in all experiments. The results show that using the Identity-Enhanced Multi-Head Attention Mechanism improves the results, and adding the user or item information corresponding to the review is beneficial to identifying valuable reviews. The performance results also verify the effectiveness of the biased AFM. Without the biases representing inherent features of users, items, and domains, EPNet-AFM makes less relevant recommendations. Also, EMPNet outperforms EMPNet-MLP. This can be attributed to the DM-VAE disentangling the user feature vector into domain-shared interests and domain-specific interests. By the strategy of only mapping domain-specific interests while keeping domain-shared interests unchanged, it can realize more accurate cross-domain interest transfer compared to directly mapping the entire user feature vector.

5 Conclusion and future work

In this paper, we propose a cross-domain recommendation model EMPNet. For feature extraction, EMPNet uses the BERT and Identity-Enhanced Multi-Head Attention Mechanism to distinguish the impact of different quality user and item reviews on the ratings. For cross-domain user mapping, EMPNet employs DM-VAE to disentangle the user feature vector into domain-shared and domain-specific interests, facilitating the cross-domain transfer to derive the cold-start user’s feature vector in the target domain. For rating prediction, EMPNet considers and differentiates multiple kinds of biases that represent the inherent features of users, items, and domains. Experiments on real data verify the effectiveness of these designs and the performance superiority of EMPNet.

Several directions exist for future work. First, input data from multiple, diverse auxiliary domains may further improve cross-domain recommendation. Second, combining conventional recommendation models with foundation models may help cross-domain recommendation. Third, using multi-modal reviews, such as image and video may also improve cross-domain recommendation.