1 Introduction

Recommender systems are information search and decision support tools that address the problem of information overload by generating personalized suggestions for items, e.g., products and services, that suit the specific user’s needs and constraints (Resnick and Varian 1997; Ricci et al. 2011; Jannach et al. 2010). Collaborative filtering (CF) is a well known recommendation technique that exploits ratings for items given by a network of users. A CF system generates recommendations by analyzing the similarities and relationships between the users, which can be extracted by observing the users’ interactions with the items managed by the system (Koren and Bell 2011; Desrosiers and Karypis 2011). Some popular examples of successful CF recommender systems are Amazon,Footnote 1 Netflix,Footnote 2 iTunes Footnote 3 and Last.fm.Footnote 4

CF can be implemented in several variants, such as user- (Herlocker et al. 1999) and item-based (Linden et al. 2003) heuristics, and matrix factorization (MF) models (Koren and Bell 2011). However, regardless of the specific variant that is used, CF methods have a common limitation: the so called new user cold-start problem, which occurs when a system cannot generate personalized and relevant recommendations for a user who has just registered into the system. Although many solutions have been proposed (Elahi et al. 2014b; Enrich et al. 2013; Hu and Pu 2011; Park and Chu 2009; Rashid et al. 2008; Tkalcic et al. 2011; Lika et al. 2014; Son 2014), this problem is still challenging, and there is not a unique solution for it that can be applied to any domain or situation. Indeed, as we shall show later, different approaches better suit specific situations, e.g., when the new user has entered either zero or only a few ratings.

In this paper we propose to address the new user problem by assuming that the system has information about the users’ personality. Such information is used to enhance the effectiveness of CF. In fact, research has already shown that, in certain domains, people with similar personality traits are likely to have similar preferences (Cantador et al. 2013; Chausson 2010; Rawlings and Ciancarelli 1997; Rentfrow and Gosling 2003), and that correlations between user preferences and personality traits allow improving personalized recommendations (Hu and Pu 2011; Tkalcic et al. 2011). Hence, the common gist of the methods proposed in this paper is to exploit user personality information in order to identify the most useful user preference information (ratings or “likes”) for the system to generate accurate recommendations for a new user.

More specifically, we present three novel methods to alleviate the new user problem in CF: (a) personality-based CF, which directly improves the recommendation prediction model by incorporating user personality information, (b) personality-based active learning (AL), which utilizes personality information for identifying additional and useful user preference data to be elicited in a target domain, and (c) personality-based cross-domain recommendation, which exploits personality information to better exploit user preference data from auxiliary domains in order to compensate the lack of user preference data in the target domain.

The first method exploits the users’ personality information in a MF CF model. We focus on MF since it is an accurate CF technique (Koren and Bell 2011). While the classical MF is trained exclusively on a set of ratings, our personality-based MF method allows to partially compensate the lack of ratings with personality information. In CF it has already been shown that personality can help overcoming the new user problem (Hu and Pu 2011; Tkalcic et al. 2011). Nonetheless, differently to our work, previous studies have investigated the incorporation of personality into CF heuristics instead of into MF models. In this paper, we extend the MF model (Hu et al. 2008) by incorporating additional latent feature vectors, which are related to user personality, and by performing a new training procedure based on the alternating least squares (ALS) technique. As we shall show, our method is beneficial in cold-start situations where there is no rating for the target user.

The second method is an AL technique that, by knowing the users’ personality, aims at finding and eliciting the most informative user ratings with minimal number of requests. In general, in AL it has been shown that asking a user to provide ratings for a set of selected items can improve the accuracy of CF (Rubens et al. 2011; Elahi et al. 2012, 2014a, b; Rashid et al. 2002, 2008). Traditional AL methods need some pre-existent ratings in order to select even more ratings to collect from the user. Our method identifies the items to request the user to rate by considering the user personality, which may be easier to obtain than a bootstrapping set of ratings (Braunhofer et al. 2014b), and, since personality is not domain dependent, can be obtained once and then reused in several recommendation domains. Some existing AL approaches have already exploited user personality information (Elahi et al. 2013, 2014b; Braunhofer et al. 2014b). In contrast to previous work, our personality-based AL method is optimized for positive-only feedback (e.g., likes, click-through data, item consumption counts) instead of ratings, which is more easily acquired by the system. We also provide experimental results on a relatively large dataset in several domains, and show that our personality-aware AL method is able to acquire more likes than a number of baseline methods for users completely new to the system.

Finally, the third method addresses the new user problem by enhancing cross-domain recommendation techniques with user personality information. Cross-domain recommender systems aim at improving their performance in a target domain by relying on information about the user’s preferences in a related source domain (Cantador and Cremonesi 2014). The goal of these systems is to transfer useful knowledge from auxiliary domains to the target domain in order to build in the target domain better rating predictions and item recommendations. Recently, cross-domain recommendation techniques have been applied to address the new user problem, by leveraging the knowledge of common rating patterns (Gao et al. 2013; Li et al. 2009; Pan et al. 2010), shared social tags (Shi et al. 2011; Enrich et al. 2013), and linked semantic concepts (Kaminskas et al. 2014) in the source and target domains. To the best of our knowledge, our cross-domain recommendation method represents the first attempt to exploit user personality as a “bridge” to transfer user preference knowledge between domains, and address cold-start situations in the target domain.

Conducting experiments on a large dataset in various application domains, namely movie, music and book recommendations, our empirical results reveal that the proposed methods, based on the exploitation of user personality, allow CF to better tackle the new user problem. This is especially true in the case of completely new users with no rating history at all (i.e., users having no training ratings), who are typically provided with non-personalized suggestions based only on the popularity of the items. We show that these users can significantly benefit from the application of the proposed methods, which are able to generate personalized recommendations that boost the precision of the system in ranges from 6 to 94 %, depending on the domain. We also show, however, that this benefit vanishes once a sufficient number of training ratings for a user becomes available.

The remainder of the paper is structured as follows. In Sect. 2 we review the relevant state of the art, and in Sect. 3 we detail our methods. In Sect. 4 we describe the dataset and methodology used to evaluate the methods. Next, in Sect. 5 we report the results of the conducted evaluation and provide a comprehensive discussion comparing the methods and describing possible application scenarios for them. Finally, in Sect. 6 we provide some conclusions and future work research lines.

2 Related work

2.1 On the relationships between user personality and preferences

Personality is a predictable and quite stable factor that forms human behaviors. In psychology literature, personality is described as a “consistent behavior pattern and interpersonal processes originating within the individual” (Burger 2010), accounting for individual differences in people’s emotional, interpersonal, experiential, attitudinal and motivational styles (John and Srivastava 1999). Different models have been proposed to characterize and represent human personality. Among them, the five factor model (FFM; Costa and McCrae 1992) is considered one of the most comprehensive, and has been mostly used to build user profiles (Hu and Pu 2011). The FFM introduces five broad dimensions—called factors or traits, and commonly known as the Big Five—to describe an individual’s personality: openness (OPE), conscientiousness (COS), extraversion (EXT), agreeableness (AGR) and neuroticism (NEU). The OPE factor reflects a person’s tendency to intellectual curiosity, creativity and preference for novelty and variety of experiences. The COS factor reflects a person’s tendency to show self-discipline and aim for personal achievements, and have an organized (not spontaneous) and dependable behavior. The EXT factor reflects a person’s tendency to seek stimulation in the company of others, and put energy in finding positive emotions. The AGR factor reflects a person’s tendency to be kind, concerned, truthful and cooperative towards others. Finally, the NEU factor reflects a person’s tendency to experience unpleasant emotions, and refers to the degree of emotional stability and impulse control.

The measurement of the five factors is usually performed by assessing “items” that are self-descriptive sentences or adjectives, and are commonly presented to the subjects in the form of short questions. In this context, the international personality item poolFootnote 5 (IPIP) is a publicly available collection of items for use in psychometric tests, and the 20–100 item IPIP proxy for Costa and McCrae’s commercial NEO PI-R test (IPIP-NEO, see Goldberg et al. 2006) is one of the most popular and widely accepted questionnaires to measure the Big Five in adult men and women without overt psychopathology. Alternatively, approaches exist that attempt to infer the people’ personality factors implicitly, e.g., by mining user generated contents in social media (Farnadi et al. 2016) and analyzing social network structure (Lepri et al. 2016).

Personality influences how people make decisions (Nunes and Hu 2012), and it also has been shown that people with similar personality traits are likely to have similar tastes. For example, Rentfrow and Gosling (2003) investigated how music preferences are related with personality in terms of the FFM. They show that “reflective” people with high OPE usually have preferences for jazz, blues and classical music, and “energetic” people with high degree of EXT and AGR usually appreciate rap, hip-hop, funk and electronic music. Rawlings and Ciancarelli (1997) observed that OPE and EXT are the personality factors that best explain the variance in personal music preferences. They showed that people with high OPE tend to like diverse music styles, and people with high EXT are likely to have preferences for popular music. In the movie domain, Chausson (2010) presented a study showing that people open to experiences are likely to prefer comedy and fantasy movies, conscientious individuals are more inclined to enjoy action movies, and neurotic people tend to like romantic movies. Odic et al. (2013) explored the relations between personality factors and induced emotions while watching movies in different social contexts (e.g., watching a movie alone and with someone else), and observed different patterns in experienced emotions as functions of the EXT, AGR and NEU factors. More recently, Braunhofer et al. (2015) have shown that exploiting personality information in CF is more effective than exploiting demographic information of users, which is a more typical approach for dealing with the new user problem in recommender systems. In particular, they showed that exploiting even a single personality factor (out of the five factor) may lead to a considerable improvement in recommendation accuracy.

Extending the spectrum of analyzed domains, Rentfrow et al. (2011) studied the relations between personality factors and user preferences in several entertainment domains, namely movies, TV shows, books, magazines and music. They focused their study on five content categories: aesthetic, cerebral, communal, dark and thrilling. The authors observed positive and negative relations between such categories and some of the personality factors, e.g., they showed that aesthetic content relate positively with AGR and negatively with NEU, and that cerebral content correlate with EXT. Cantador et al. (2013) considered also several domains (movies, TV shows, books and music), and presented a preliminary study on the relations between personality types and entertainment preferences. Analyzing a large dataset of personality factor and genre preference user profiles, the authors extracted personality-based user stereotypes for each genre, and inferred association rules and similarities between types of personality of people with preferences for particular genres. Finally, in the multi-domain scenario of the Web, Kosinski et al. (2012) presented a study revealing meaningful psychological relations between user preferences and personality for certain websites and website categories.

We notice that, as showed in Cantador et al. (2013), additional user characteristics, such as the user’s age and gender, and more fine-grained personality representations such as those based on personality facets, e.g., the imagination, artistic interests, and emotionality facets for the OPE factor, are likely to be of importance when discovering relationships between user preferences and personality. In the reviewed works and in this paper, such type of characteristics are not taken into consideration, and are left for future investigation. We develop and evaluate our methods upon the fact that there exist certain relationships between user preferences and personality, which can benefit CF in the cold-start, as done in previous work and shown in the next subsection.

2.2 Personality-based collaborative filtering

The existence of certain relationships between personality characteristics and user preferences has motivated earlier studies supporting the hypothesis that exploiting personality information in CF is beneficial. In 2011 Tkalcic et al. applied and evaluated three user similarity metrics for heuristic-based CF: a typical rating-based similarity based on Euclidean distance with personality data (five factors), and a similarity based on a weighted Euclidean distance with personality data. Their results show that approaches using personality data may perform statistically equivalent or better than rating-only based approaches, especially in cold-start situations. In her PhD dissertation Nunes (2009), explored the use of a personality user profile composed of IPIP-NEO items and facets in addition to the Big Five factors, showing that fine-grained personality user profiles can achieve better recommendations. Following the findings of Rentfrow and Gosling (2003), in 2009 and 2011 Hu et al. presented a CF approach that leverages the correlations between personality types and music preferences: the similarity between two users is estimated by means of the Pearson’s correlation coefficient on the users’ five factors scores. Combining this approach with a rating-based CF technique, the authors showed significant improvements over the baseline of considering only ratings data. Finally, in 2012 Roshchina presented an approach that extracts five factors profiles by analyzing hotel reviews written by users, and incorporates these profiles into a nearest neighbor algorithm to enhance personalized recommendations.

It is worth noting that the above mentioned works on personality-aware CF make use of heuristic-based methods to compute user similarities and item rating estimations. Differently from them, in this paper we propose a MF CF model—which has been shown to be more effective than heuristic approaches—that integrates the users’ rating data and personality information. Moreover, with respect to previous work, the experimental study presented here has been conducted on much larger datasets composed of “likes”, which are positive only evaluations, rather than ratings, in several domains. Specifically, as described in Sect. 4, our dataset consisted of 159,551 users and 16,303 items in the movie, music and book domains, while in Hu et al. (2009) and Hu and Pu (2011) the considered data set contains only 111 users, in Nunes (2009) and Roshchina (2012) it is around 100, and in Tkalcic et al. (2011) 52; and all of these data sets contain only a very limited number of items.

Moreover, we shall illustrate a more diverse set of results. We will observe that the users’ personality is not equally useful in all the considered domains. For instance, the usage of personality in the movie and music domains yields to higher precision compared to the book domain. This can be due to the strength of the correlation between people personality and the characteristics of the domain. Hence, the personality may affect much more their decision on choosing which movie to watch or which song to listen rather than which book to read. We will also show that, while the personality information can improve significantly the recommendation precision when the user has not yet entered a single like, it is not effective anymore when the user starts entering certain number of likes.

2.3 Active learning in collaborative filtering

AL in CF aims to actively acquire user preference data in order to improve the output of the recommendation process (Rubens et al. 2011; Elahi et al. 2012, 2014a, b). In two separate works Rashid et al. (2002, 2008) proposed eight techniques that CF systems can use to acquire user preferences in the sign-up stage: entropy, where items with the largest rating entropy are preferred, random, popularity, \(log(popularity)\,*\,entropy\), where items that are both popular and have diverse ratings are preferred; and finally item–item personalized, where the items are proposed randomly until one rating is acquired, and then a recommender system is used to predict the ratings for items that the user is likely to have seen, IGCN,  which builds a tree where each node is labeled by a particular item to be asked to the user to rate, and Entropy0, which extends the entropy method by considering the missing value as a possible rating (category 0). They conducted offline and online simulations, and concluded that IGCN and Entropy0 perform the best in terms of accuracy.

In 2010 and 2011, Golbandi et al. proposed four strategies for rating elicitation in CF. The first method, GreedyExtend,  selects the items that minimize the root mean square error (RMSE) of the rating prediction (on the training set). The second one, named Var,  selects the items with the largest \(\sqrt{popularity}\,*\,variance,\) i.e., those with many and diverse ratings in the training set. The third one, called Coverage,  selects the items with the largest coverage, which is defined as the total number of users who co-rated both the selected item and any other item. The fourth method, called Adaptive,  is based on decision trees where each node is labeled with an item (movie). The node divides the users into three groups based on their ratings for that items: lovers, who rated the item high, haters, who rated the item low, and unknowns, who did not rate the item. Their results showed excellent performance of the Adaptive method in terms of reduction of RMSE compared with other strategies.

Among the previous strategies, we have choosen Entropy0 as baseline method, since it (or its similar variant) has shown excellent results in several works (Rashid et al. 2002, 2008; McNee et al. 2003; Carenini et al. 2003; Rubens and Sugiyama 2007; Mello et al. 2010). This method aims at balancing the quantity and quality of the acquired ratings, in the sense that it attempts to collect as many ratings as possible, but also take their relative informativeness into account. It not only scores higher the items that are likely to be known by the users, but also brings valuable information about their preferences. We note that we could not use a decision tree based strategy as baseline, since that type of solutions exploit an RMSE reduction criterion, in training datasets that contain ratings in a Likert 1–5 scale. In contrast, in this work we use unary rated data sets (a user expresses only a set of “likes”) and hence, RMSE is not an appropriate error measure. That is also why we cannot use traditional recommendation evaluation metrics, such as RMSE or MAE.

2.4 Cross-domain recommender systems

The proliferation of e-commerce sites and online social networks has led users to provide feedback and maintain profiles in multiple systems, reflecting a variety of their tastes and interests. Leveraging all the user preferences available in several systems or domains may be beneficial for generating more encompassing user models and better recommendations, e.g., through mitigating the cold-start and sparsity problems in a target domain, or enabling personalized cross-selling recommendations for items from multiple domains.

In this context, cross-domain recommender systems aim to generate or enhance personalized recommendations in a target domain by exploiting knowledge (mainly user preferences) from auxiliary source domains (Fernández-Tobías et al. 2012; Winoto and Tang 2008). This problem has been addressed from various perspectives in different research areas. It has been tackled by means of user preference aggregation and mediation strategies for the cross-system personalization problem in user modeling (Abel et al. 2013; Berkovsky et al. 2008; Shapira et al. 2013; Cantador et al. 2015), as a potential solution to mitigate the cold-start and sparsity problems in recommender systems (Cremonesi et al. 2011; Shi et al. 2011; Tiroshi et al. 2013; Enrich et al. 2013) and as a practical application of knowledge transfer in machine learning (Gao et al. 2013; Li et al. 2009; Pan et al. 2010).

We distinguish between two main types of cross-domain recommendation approaches: those that aggregate knowledge from various source domains into the target domain where recommendations are performed, and those that link or transfer knowledge between domains to support recommendations in the target domain. The knowledge aggregation methods merge user preferences (e.g., ratings, social tags, and semantic concepts; Abel et al. 2013), mediate user modeling data exploited by various recommender systems (e.g., user similarities and user neighborhoods; Shapira et al. 2013), and combine single-domain recommendations (e.g., rating estimations and rating probability distributions; Berkovsky et al. 2008). The knowledge linkage and transfer methods may relate domains by a common knowledge (e.g., item attributes, association rules, semantic networks, and inter-domain correlations; Cremonesi et al. 2011; Tiroshi et al. 2013), share implicit latent features that relate source and target domains (Pan et al. 2010; Shi et al. 2011), and exploit explicit or implicit rating patterns from source domains in the target domain (Gao et al. 2013; Li et al. 2009).

With respect to previous work, to the best of our knowledge, this paper presents the first attempts to exploit user personality for cross-domain recommendation. Our methods can be considered as aggregation cross-domain recommendation approaches that merge both user personality traits and preferences from various domains.

3 Personality-based recommendation methods for new users

As already mentioned, in this paper we address the new user cold-start problem in CF, which occurs when a recommender system is unable to accurately suggest items in a target domain to a user for which no or few preference data are available in that domain. In order to illustrate this problem, we consider the typical input dataset used by CF, i.e., a sparse matrix that contains some users’ ratings (likes) for a certain number of items. The rows of this matrix contain the users’ ratings and the columns contain the items’ ratings. When a new user registers into the system, a new empty row is added to the rating matrix and the system is unable to generate any rating predictions for this user.

Fig. 1
figure 1

Exploiting user personality and cross-domain preferences to address the new user cold-start problem. The users’ personality factors and preferences (likes) for items in two domains—a target domain (music) and an auxiliary source domain (movies)—are considered. User \(u_1\) has no ratings in the target domain

The way in which we tackle the problem is depicted in Fig. 1. We distinguish between two main types of approaches: (i) directly asking the user to rate some items in order to gather some information about her preferences before computing any recommendation, and (ii) exploiting auxiliary information about the user’s preferences or other personal information that might be useful for the system to establish similarities between her and other users.

Both approaches have specific advantages and disadvantages. Preference elicitation approaches are more robust in the sense that they tackle the problem by directly acquiring the data the system ultimately needs. On the negative side, they require an initial effort from the user to rate some items when she registers, which may discourage her from using the system. Also, the system has to be careful when selecting the items to request the user to rate: the user should be familiar with them; otherwise she will not be able to rate them and her perception of the usefulness of the system could also be negatively affected. In contrast, approaches that exploit auxiliary information do not require the user to rate items. Nonetheless, a method to inject the additional user data into the CF framework has to be devised, and it may be hard to know in advance whether the auxiliary information will actually be helpful in the cold-start. Furthermore, it is typically the case that the auxiliary information is explicitly requested to the user, thus requiring an initial effort from her as in preference elicitation strategies.

Our work is based on the hypothesis that information about the user’s personality is available and can be used to enhance both types of approaches for tackling the cold-start problem. First, we propose a novel extension of a MF technique for positive-only feedback (click-through data, browsing history, item consuming counts) that is capable of exploiting auxiliary personality information for recommendation. Second, we propose a personality-based MF algorithm for preference elicitation. Before recommending items, personality is exploited in the “likes” acquisition phase to more effectively select items that the user is likely to be able to like. Finally, we further extend the proposed personality-based MF algorithm integrating cross-domain user preferences. When little or no information about the user is available in a target domain (e.g., music), but her preferences in a different source domain are known (e.g., movie), we conjecture that personality information can be used to better leverage the auxiliary cross-domain data in order to provide recommendations in the target domain. We detail these three methods in the following subsections.

3.1 Personality-based matrix factorization for positive-only user feedback

In this section we describe the proposed MF method extended with personality information. First, let \(U,\,I\) be the sets of users and items registered in a system, respectively, and let \(\mathbf p _u \in {\mathbb {R}}^k,\,\mathbf q _i \in {\mathbb {R}}^k\) be latent feature vectors for user \(u \in U\) and item \(i \in I.\) In a simple MF model, the user u’s preference towards item i is estimated with the dot product of the user and item latent feature vectors:

$$\begin{aligned} \hat{r}_{ui} = \mathbf p _u \cdot \mathbf q _i. \end{aligned}$$
(1)

A list of recommended items for user u is generated by sorting the items in I by decreasing order of estimated preference, eventually ignoring those that the user has already rated.

This method has been extensively studied in the literature, and it is known to yield inaccurate predictions in cold-start situations. When little information about the user is known, the learned parameter \(\mathbf p _u\) is unlikely to model properly the user’s latent preferences, and for users completely new to the system this method is simply unable to compute any rating prediction. In our adaptation, we overcome this limitations by introducing additional parameters to model the user’s personality.

Among the existing models for representing personality, in this work we focus on the five factor one. As explained in Sect. 2, in the FFM the personality of each user is described using five independent dimensions or factors, namely OPE, COS, EXT, AGR and NEU. A user’s personality profile is thus represented with a score for each factor, typically a real number in a range such as \([1,\,5].\)

In order to use this information we follow the same strategy as in Elahi et al. (2013) and map the five factors to a fixed set of Boolean attributes \({\mathcal {A}}.\) Specifically, let \(\mathbf u = (ope_u,\,cos_u,\,ext_u,\,agr_u,\,neu_u)\) be the vector representation of user u’s FF scores. We first discretize each score by rounding it to the nearest integer, and then we map each score to a different attribute depending on its value and factor. Therefore, we consider 25 possible attributes, 5 for each personality factor: \({ope_1,\,ope_2,\ldots ,ope_5,\,cos_1,\ldots , neu_5}.\) For instance, a user with personality profile \(\mathbf u = (2.3,\,4.0,\, 3.6,\, 5.0,\,1.2)\) will be assigned the set of attributes \({\mathcal {A}}(u) = \lbrace ope_2,\,cos_4,\,ext_4,\, agr_5,\, neu_1 \rbrace .\) We considered and evaluated other personality factor discretization schemes and personality-based user profile representations, but they performed worse when incorporated into the MF model.

Once the user’s personality factor scores are transformed, we modify Eq. 1 as in Elahi et al. (2013) to take personality information into account when computing predictions. Specifically, they define a new additional latent feature vector \(\mathbf y _a \in {\mathbb {R}}^k\) for each attribute \(a \in {\mathcal {A}}.\) Now, the users are not only modeled in terms of their preferences, but also considering their personality attributes:

$$\begin{aligned} \hat{r}_{ui} = \mathbf q _i \cdot \left( \mathbf p _u + \sum _{a \in {\mathcal {A}}(u)} \mathbf y _a \right) . \end{aligned}$$
(2)

One important feature of this model is that it is capable of computing rating/like predictions even if the user is completely new to the system, making it ideal for cold-start situations. In such cases, the vector \(\mathbf p _u\) is ignored and ratings/likes are estimated only on the basis of attribute information.

The prediction model defined in Eq. 2 is inspired by the well known and widely used SVD++ model (Koren 2008). SVD++ incorporates implicit feedback by introducing latent feature vectors for items rated by the user, whereas in Eq. 2 the user’s profile is augmented with latent feature vectors that model her personality. Unlike Elahi et al. (2013) and Koren (2008), the method we propose in this paper is intended for the top N recommendation task in the presence of positive-only user feedback (likes), rather than rating prediction. We argue that positive-only feedback is more common in real applications, where users are usually not inclined to explicitly evaluate the items. Click-through data, browsing history, or item consuming counts are instead more easily acquired by the system, without requiring any effort of the user. However, it must be taken into account that in this setting, information about the users’ dislikes is not available, and the fact that a user did not select a particular item might either indicate that the item is unknown to her or that she actually dislikes it.

In order to deal with the different nature of this kind of feedback, we follow Hu et al.’s (2008) work , where the MF method was adapted for positive-only feedback. In their model, predictions are still computed using Eq. 1, but unlike standard MF that learns the model parameters by only considering the observed ratings, Hu et al.’s method also takes into account the not observed ones. They argue that in this case the commonly used stochastic gradient descent algorithm is no longer efficient, and propose an alternative optimization procedure based on ALS. In our personality-based model we incorporate the same learning approach but for a different prediction model, namely we used that one shown in Eq. 2. Moreover, the resulting loss function penalizes prediction errors over all possible user–item pairs, not only those for which an interaction was observed, and includes the additional model parameters for the personality factors:

$$\begin{aligned} {\mathcal {L}}(\mathbf P ,\,\mathbf Q ,\,\mathbf Y ) = \sum _u \sum _i c_{ui} \left( x_{ui} - \hat{x}_{ui} \right) ^2 + \lambda \left( \Vert \mathbf P \Vert ^2 + \Vert \mathbf Q \Vert ^2 + \Vert \mathbf Y \Vert ^2 \right) . \end{aligned}$$
(3)

Here \(x_{ui} = 1\) if user u consumed (clicked, liked, purchased) item i,  and \(x_{ui} = 0\) otherwise. \(\hat{x}_{ui}\) is the model’s prediction computed using Eq. 2. Each row of the matrices \(\mathbf P \in {\mathbb {R}}^{\vert U \vert \times k},\, \mathbf Q \in {\mathbb {R}}^{\vert I \vert \times k},\, \mathbf Y \in {\mathbb {R}}^{\vert A \vert \times k}\) contains the latent feature vector of a user, an item and an attribute, respectively. The confidence parameter \(c_{ui}\) controls how much the model penalizes mistakes in the prediction of \(x_{ui},\) and is set to \(c_{ui} = 1 + \alpha k_{ui}\) as proposed in Hu et al. (2008). \(k_{ui}\) represents user u’s feedback for item i,  which is binary in the case of clicks and likes, or a positive number, e.g., for item consuming counts, and is set to \(k_{ui} = 0\) in the case that no interaction was observed. The constant \(\alpha \) models the increase in confidence for observed feedback. Finally, the regularization parameter \(\lambda \in {\mathbb {R}}^{+}\) is used to prevent overfitting.

The model parameters \(\mathbf P ,\, \mathbf Q \) and \(\mathbf Y \) are automatically learned by minimizing the loss function over all the user–item training pairs. We extend the method in Hu et al. (2008) and derive an ALS-based algorithm with an extra step for the additional \(\mathbf Y \) parameters of the minimization problem defined in Eq. 3. ALS is based on the observation that when all the parameters but one are fixed, (3) becomes a standard least-squares problem with a solution that can be explicitly computed. First, we fix \(\mathbf Q \) and \(\mathbf Y ,\) and solve the optimization problem analytically for each \(\mathbf p _u\) by setting the gradient to zero:

$$\begin{aligned} \mathbf p _u = \left( \mathbf Q ^T \mathbf C ^u \mathbf Q + \lambda \mathbf I \right) ^{-1} \mathbf Q ^T \mathbf C ^u \left( \mathbf x (u) - \mathbf Q \textstyle \sum \limits _{a \in A(u)} \mathbf y _a \right) , \end{aligned}$$
(4)

where \(\mathbf C ^u\) is a \(\vert I \vert \times \vert I \vert \) diagonal matrix such that \(\mathbf C _{ii}^u = c_{ui},\,\mathbf x (u)\) is a column vector with all the \(x_{ui}\) values for user u. Let for simplicity \(\mathbf z _u = \mathbf p _u + \sum _{a \in A(u)} \mathbf y _a.\) We then fix P and Y and optimize each \(\mathbf q _i\) in a similar fashion:

$$\begin{aligned} \mathbf q _i = \left( \mathbf Z ^T \mathbf C ^i \mathbf Z + \lambda \mathbf I \right) ^{-1} \mathbf Z ^T \mathbf C ^i \mathbf x (i). \end{aligned}$$
(5)

Analogously, \(\mathbf C ^i\) is a \(\vert U \vert \times \vert U \vert \) diagonal matrix where \(\mathbf C _{uu}^i = c_{ui},\,\mathbf x (i)\) is a column vector with all the \(x_{ui}\) values, and the matrix \(\mathbf Z \) contains the \(\mathbf z _u\) vectors as rows. Finally, we fix \(\mathbf P \) and \(\mathbf Q ,\) and optimize for each \(\mathbf y _a{:}\)

$$\begin{aligned} \mathbf y _a = \left[ \mathbf Q ^T \left( \sum _{u \in {\mathcal {U}}(a)} \mathbf C ^u \right) \mathbf Q + \lambda \mathbf I \right] ^{-1} \sum _{u \in {\mathcal {U}}(a)} \mathbf Q ^T \mathbf C ^u \left( \mathbf x (u) - \mathbf Q \mathbf z _{u \backslash a} \right) , \end{aligned}$$
(6)

where \({\mathcal {U}}(a) = \lbrace u \in U \mid a \in {\mathcal {A}}(u)\rbrace \) is the set of users that have attribute a,  and \(\mathbf z _{u \backslash a} = \mathbf p _u + \sum _{b \in A(u), b \ne a} \mathbf y _b\) is defined as before but without including attribute a. Note that, unlike the case of user and item attributes, re-computing an attribute vector depends on the current state of all the other attribute features through the \(\mathbf z _{u \backslash a}\) vector.

During training, we alternate between three steps fixing a different set of latent feature parameters each time. This process is repeated for a fixed number of iterations T,  as depicted in Algorithm 1. Finally, once the training stage is complete, we use the learned parameters to compute predictions for the test users using Eq. 2. For each user, we estimate all the scores for unknown items and sort them in descending order. The top ten items of the list are recommended to the user as the more likely to be relevant.

figure a

3.1.1 Computational complexity

Similarly to Hu et al.’s model, the complexity of our personality-aware MF method is \({\mathcal {O}}(k^3 \vert U \vert + k^2 \vert R_+ \vert )\) for the P-step and \({\mathcal {O}}(k^3 \vert I \vert + k^2 \vert R_+ \vert )\) for the Q-step, where \(\vert R_+ \vert \) is total number of observed ratings. Here we have used an optimization described in Hu et al. (2008) to reduce the complexity from \(\vert U \vert \cdot \vert I \vert \) to \(\vert R_+ \vert \) terms. We refer the reader to that paper for more details. In these steps, the latent feature vectors can be easily computed in parallel within each step. The main computational cost relies on the Y-step, in which we have to iterate over the whole \({\mathcal {U}}(a)\) set for each attribute. Updating all the attribute vectors has complexity \({\mathcal {O}}(k^3 \vert A \vert + k^2 \vert A \vert (\vert I \vert + \vert R_+ \vert )),\) with the drawback that it cannot be parallelized since the re-computation of each attribute vector depends on the current state of the others. We note, however, that the number of attributes \(\vert {\mathcal {A}} \vert = 25\) is small, and the overhead required by the additional latent features is acceptable, making the complexity of our algorithm comparable to that of standard ALS-based MF. Also, we are considering exactly five attributes for each user, one for each dimension of the FFM, so \(\vert {\mathcal {A}}(u) \vert = 5,\) and recommendations are computed faster.

3.2 Personality-based active learning

In this section we describe the three AL methods that we have considered and compared in the experiments. The first one, personality-based binary-predicted was originally proposed in Elahi et al. (2013), and Braunhofer et al. (2014a, b). It first transforms a given rating matrix to a binary matrix, by mapping null entries to 0, and not null entries to 1. Hence, this new matrix models if a user rated or not an item, regardless of its value (Koren 2008). Afterward, from this rating matrix it derives an extended MF model that profiles users not only in terms of their binary ratings, but also using their Big Five personality traits. Hence, by selecting the items with the highest score this method attempts to identify what the user has most likely experienced, in order to maximize the likelihood that the user can provide the requested rating.

In this paper, we applied this strategy to our scenario where positive-only user feedback are given. Thus, the transformation step becomes unnecessary and we can directly learn the personality-based MF model as in Eq. 2. This is used to predict and assign a ratable score to each candidate item i (for each user u). Higher predicted scores indicate a higher probability that the target user has consumed (liked) the item i,  and hence may be able to rate it. This will maximize the chance that the selected items are actually familiar to the user, and hence they are ratable. Selecting items that are more familiar to the users, will result in collecting more number of ratings.

The second method is binary-predicted (Elahi et al. 2011, 2014b). It is identical to the personality-based binary-predicted method except that users are modeled only in terms of their binary ratings, ignoring their Big Five personality traits, i.e., instead of using the extended personality-based MF model shown in Eq. 2, it adopts the standard one, as in Eq. 1.

Finally, we have considered Entropy0 (Rashid et al. 2008; Golbandi et al. 2010, 2011). This method measures the dispersion of the ratings for an item, and attempts to combine the effect of the popularity with the diversity of the ratings, which may provide more useful (discriminative) information about the user’s preferences. Entropy0 score is computed by using the relative frequency of each of the two possible rating values, i.e., like (mapped to 1) and unknown (mapped to 0):

$$\begin{aligned} Entropy0(i) = {-} \sum _{r = 0}^1 p\left( x_{ui} = r\right) \log p\left( x_{ui} = r\right) , \end{aligned}$$
(7)

where \(p(x_{ui} = r)\) is the probability that the item rating \(x_{ui}\) is equal to r.

This method returns 0 for an item i if and only if its rating value is certain, i.e., if \(p(x_{ui} = 0) = 1\) or \(p(x_{ui} = 1) = 1.\) In contrast, it returns the maximum score for an item i if its two rating values are equally distributed, i.e., \(p(x_{ui} = 0) = \frac{1}{2}\) and \(p(x_{ui} = 1) = \frac{1}{2};\) in this case, \(Entropy0(i) = \log 2.\) Since an item is liked by only a small fractions of the users, the score computed by this method essentially measures the popularity of the item: entropy grows with the number of likes when the probability to be liked is small.

3.3 Personality-based cross-domain collaborative filtering

In this section we present the third technical contribution of this paper, a cross-domain rating prediction method. We hypothesize that personality information can be exploited to enhance cross-domain recommendations by enriching user profiles not only with preferences from auxiliary domains but also by leveraging the Big Five scores of the user.

In order to understand the effect of personality on cross-domain recommendation, we first adapt the personality-based MF method proposed in Sect. 3.1 by replacing personality attributes with cross-domain ratings. Let \({\mathcal {S}}\) be a source domain, \({\mathcal {T}}\) the target domain, and \(I_{{\mathcal {S}}},\,I_{{\mathcal {T}}}\) their respective sets of items. We estimate the user u’s preference for item \(i \in I_{{\mathcal {T}}}\) as

$$\begin{aligned} \hat{x}_{ui} = \mathbf q _i \cdot \left( \mathbf p _u + \sum _{j \in I_{{\mathcal {S}}}(u)} \mathbf y _j \right) , \end{aligned}$$
(8)

where \(I_{{\mathcal {S}}}(u)\) is the set of items in the source domain for which user u expressed a preference. This method is a simple extension of SVD++ (Koren 2008) that expands the user’s latent representation in the target domain \(\mathbf p _u\) with latent feature vectors modelling the effect of user feedback in a source domain. Another difference relies on the training algorithm, which is here based on ALS instead of stochastic gradient descent, as described in Sect. 3.1. It is worth noticing that in order for this model to be successful the sets of users from the source and target domains must overlap. Even when there are users with data in both domains, the preferences from the source domain may not be relevant for recommendation in the target domain, which is another limitation of the approach. Intuitively, user likes from an unrelated source domain such as restaurants are not indicative of her tastes in, e.g., music target domain.

Then, we combine both user personality and source domain preferences into a common set of user attributes. We aim to understand if personality information can be used to enhance cross-domain recommendations in the cold-start stage, or if, on the other hand, only cross-domain preferences are useful to improve the system performance. Specifically, we predict user preferences as follows:

$$\begin{aligned} \hat{x}_{ui} = \mathbf q _i \cdot \left( \mathbf p _u + \sum _{a \in {\mathcal {A}}(u)} \mathbf y _a + \sum _{j \in I_{{\mathcal {S}}}(u)} \mathbf y _j \right) . \end{aligned}$$
(9)

The above model is also trained using the ALS technique described in Sect. 3.1, and despite its simplicity we believe it is, to the best of our knowledge, the first attempt to enhance cross-domain MF with personality information. We note that, differently from the personality-based method that we have previously described, the number of parameters is here much larger, which has a direct impact on the complexity of the learning process. We are nevertheless interested in comparing the benefits of personality information and cross-domain preferences for new users, and thus use the same recommendation model for both.

4 Experimental evaluation

4.1 Dataset

The dataset used in our experiments is part of the database made publicly available in the myPersonality projectFootnote 6 (Bachrach et al. 2012). myPersonality is a Facebook application with which users take psychometric tests and receive feedback on their personality factor scores. The users allow the application to record personal information from their Facebook profiles, such as demographic and geo-location data, likes, status updates, and friendship relations, among others. In particular, as of March 2015, the tool instantiated a database with 46 million Facebook likes of 220,000 users for 5.5 million items of diverse nature—people (actors, musicians, politicians, sportsmen, writers, etc.), objects (movies, TV shows, songs, books, video games, etc.), organizations, events, etc.—and the Big Five scores of 4 million users, collected using 20–336 item IPIP questionnaires.

Due to the size and complexity of the database, in this paper we restrict our study to a subset of it. Specifically, we selected the likes assigned to the items belonging to one of the following three domains: books, movies and music. To determine which items in the original database belong to each of such domains, we used Facebook item categorization data. Specifically, we manually identified certain categories for each domain, e.g., “Music genre”, “Musician/Band”, “Album” and “Song” for the music domain.

Such categories were not always assigned correctly. For instance, there were many music “Albums” annotated with the “Musician/Band” category. Moreover, the names of the items were not always correct, e.g., some of them contained misspellings, and often were not used in a single, concise way, e.g., they were given in terms of morphological deviations, such as science fiction, science-fiction, sci-fi and sf.

In order to address the above issues—checking misspellings, unifying morphological deviations, and rectifying categorizations—we performed a number of transformations that consolidated incorrect and duplicate items with correct ones, while exploiting external knowledge to set the items categories. Since it is outside of the focus of this paper, we do not enter into details about the mentioned data transformations. We just mention that such operations were proposed in previous work (Szomszor et al. 2008; Cantador et al. 2010), and have been validated by automatically mapping the processed names of the items with the URIs of entities in DBpediaFootnote 7 (Lehmann et al. 2015; the Wikipedia ontology) via SPARQLFootnote 8 queries; we discarded those items that could not be mapped to DBpedia entities. For instance, in the music domain, those items whose names were consolidated as mozart, were mapped to http://dbpedia.org/page/Wolfgang_Amadeus_Mozart, and maintained as a single item in the final dataset.

The whole process was conducted on the 6500 most popular items in the dataset, i.e., the items with highest numbers of likes. Note that this may favor the good performance of popularity-based recommendation methods, as we shall observe in the next section. The final dataset is described in Table 1. It consists of 5,027,593 likes from 159,551 users on 16,303 items. Its minimum, maximum and average (standard deviation) numbers of likes per user are 1, 164 and 3.87 (4.46) for books, 1, 741 and 13.02 (18.78) for movies and 1, 648 and 19.49 (28.80) for music. We note that in order to be able to evaluate the effectiveness of using personality on user with various degrees of coldness (i.e., containing different numbers of likes), only users that entered a minimum of 20 likes where considered. After that, there were 1208 users in the book domain, out of which 1200 (99.34 %) and 1190 (98.51 %) had at least one preference in the movie and music domains, respectively, 26,951 users in the movie domain, out of which 23,826 (88.40 %) and 26,810 (99.48 %) had also preferences in the book and music domains, respectively, and finally, 43,702 users in the music domain, out of which 34,215 (78.29 %) and 43,134 (98.70 %) with also book and movie preferences, respectively.

Table 1 Statistics of the used dataset

4.2 Evaluation setting

The evaluation of the proposed methods (i.e., personality-based MF CF, personality-based AL as well as personality-based cross-domain recommendation) was conducted utilizing a modified user-based 5-fold cross-validation strategy, based on Kluver et al.’s methodology for cold-start evaluation (Kluver and Konstan 2014).

Our goal is to understand how the different methods perform as the number of observed likes in the target domain increases. First, we divide the set of users into five subsets of roughly equal size. In each cross-validation stage, we keep all the data from four of the groups in the training set. Then, for each user u in the fifth group—the test users—we randomly split her likes into three subsets, as depicted in Fig. 2: (i) a training set, initially empty and incrementally filled with u’s likes one by one to simulate different cold-start profile sizes, (ii) a candidate set containing the set of likes to be elicited by the AL strategies, and (iii) a testing set used to compute the performance metrics.

Fig. 2
figure 2

Overview of evaluation setting in a given cross-validation fold. Each test user u’s data is split into training, candidate, and testing sets. Different cold-start profile sizes are simulated by sequentially adding likes to each u’s training set

The above procedure was modified for the cross-domain scenario by extending the training set with the full set of likes from the auxiliary domain, in order to obtain the actual training data for the predictive models. Similarly, this evaluation strategy was further modified to test the performance of the AL strategies. In particular, the evaluation of an AL method for a specific user profile size closely follows the evaluation approach proposed by Elahi et al. (2014b), and proceeds in the following way:

  1. (1)

    The performance metrics are measured on the testing set, after training the rating prediction model (in our case, implicit MF) on the training set.

  2. (2)

    For each user in the testing set:

    1. (a)

      Using the AL method, the top \(N = 5\) candidate items that are not yet in the training set are computed for rating elicitation.

    2. (b)

      Assign to the training set the user’s likes for these candidate items as found in the candidate set, if any.

  3. (3)

    The performance metrics are measured again on the testing set, after having re-trained the rating prediction model on the new, extended training set.

We adopted three widely used accuracy and ranking metrics for positive-only feedback (i.e., one-class CF; Yao et al. 2014): mean average precision (MAP), half-life utility (HLU) and mean percentage ranking (MPR). We also measured two metrics for assessing novelty and coverage: AveragePopularity and spread.

  • MAP measures the overall precision performance based on precision at different recall levels (Li et al. 2010). It is calculated as the mean of the average precision (AP) over all users in the test set. A larger MAP corresponds to a better recommendation performance.

  • HLU measures the utility of a recommendation list for a user, with the assumption that the likelihood that the user will view/choose a recommended item decays exponentially with the item’s ranking (Breese et al. 1998; Pan et al. 2008). A larger HLU corresponds to a better recommendation performance.

  • MPR estimates the user satisfaction of items in a ranked recommendation list, and is calculated as the mean of the percentile ranking of each test item within the ranked list of recommended items for each test user (Hu et al. 2008). It is expected that a randomly generated recommendation list would have a MPR of around 50 %. A smaller MPR corresponds to a better recommendation performance.

  • AveragePopularity measures how novel the recommendations (or items requested to be liked, as for AL) are to the user (Ziegler and McNee 2005). It is expected that users prefer lists containing more novel (less popular) items. However, if the presented items are too novel, then the user is unlikely to have any knowledge of them, and will not be able to understand or rate them. Therefore, moderate values indicate a better performance (Kluver and Konstan 2014).

  • Finally, spread is a metric of how well the recommender or AL method spreads its attention across many items (Kluver and Konstan 2014). It is assumed that algorithms with a good understanding of its users are able to provide different users with different items. However, like novelty, one could not expect to achieve a perfect spread (presenting each item an equal number of times) without making avoidably bad recommendations or unfulfillable rating requests. Hence, moderate values are preferable.

In our experiments we observed equivalent behaviour of the methods in terms of MAP, HLU, and MPR. Hence, for brevity, we only report MAP values in the analysis presented in Sect. 5.

5 Experiment results

5.1 Evaluating personality-based matrix factorization

The goal of a first set of experiments was to understand if personality information can be used to improve the performance of MF for positive-only feedback in cold-start situations. Using the methodology described in Sect. 4.2, we computed HLU, MAP and MPR for different amounts of observed likes for items in the training set of the target domain. Specifically, we distinguish between two different scenarios:

  • Extreme cold-start, in which there are no likes at all from the active user, and recommendations are computed only on the basis of personality information.

  • Moderate cold-start, in which we assume that only one like is given, and incrementally evaluate the system performance with larger and larger profile size of the active user. We aim at understanding how the different methods behave as the amount of available ratings/likes increases.

We compare our proposed personality-based MF method (Personality MF), which computes predictions using Eq. (2), against Hu et al.’s method (iMF), which uses Eq. (1) and does not use any auxiliary information. We also include a non-personalized baseline that always recommends the most popular items (Most popular). Results in terms of MAP@10 for the extreme cold-start scenario are shown in Fig. 3, for the three domains available in our dataset. The results for HLU and MPR were very similar and we therefore do not report them here. We note that the small values obtained are due to the large item catalogues in our dataset. The set of possible candidate items to recommend for each test user is therefore also large, which leads to a low probability of matching a test item in the user’s recommendation list.

Fig. 3
figure 3

MAP@10 in the extreme cold-start scenario

We see that in all cases Personality MF significantly outperforms iMF and the popularity-based baseline (Wilcoxon signed-rank test, \(\mathrm{p} < 0.05\)). Our personality-based method is specially beneficial in the books and music domains, where it achieves relative improvements of 64 and 94 %, respectively. The relationships between user personality and preferences seem to be stronger in these domains, although a more exhaustive analysis is required to confirm this. Nonetheless, we could conclude that personality information is highly beneficial in the extreme cold-start situation, and that it is able to mitigate the total absence of user ratings/likes and recommend relevant items.

Fig. 4
figure 4

MAP@10 for different cold-start user profile sizes

In Fig. 4 we show the performance of the different methods as the profile size of the new user increases, again in terms of MAP@10. The Most popular baseline is clearly not a competitive approach now, and the personalized methods perform better and better as more ratings are available. We do not appreciate a difference in performance between iMF and Personality MF in any of the domains, indicating that personality information is not determinant once likes data can be exploited. Our results differ from those reported in Hu and Pu (2011), where it was shown that the user-based nearest neighbors method enhanced with personality clearly achieves better performance than using only the ratings, for users in a music recommender system with 2, 5, and 10 ratings. It is worth noting that in this work we report results in a different, larger dataset (43,702 vs. 111 users, see Table 1) composed of likes (positive-only feedback) instead of ratings. Also, we analyze the effects of integrating personality in the MF method, as opposed to user-based nearest neighbors, and evaluate the performance for users completely new to the system.

We conclude that, in terms of accuracy, personality proves useful for completely new users in the three analysed domains. In the remaining cases, iMF is competitive enough, and does not require any additional information. We argue, nonetheless, that the extreme cold-start is a critical stage of a recommender system; the system must keep the user engaged, and exploiting personality is a good option to find relevant items for the user. Also, once some likes are observed, more subtle relations between user preferences and personality could be unveiled by taking into account additional variables using more fine-grained representations of personality as suggested in Nunes (2009).

In addition to accuracy, we also analyzed the coverage and the ability of the different methods to provide novel recommendations. In Table 2 we show the average popularity and the spread of the recommended items by each method.

Table 2 Novelty and coverage of collaborative filtering methods in the cold-start

We observe similar behavior in all the considered domains: Personality MF and iMF on average recommend items with the same moderate popularity, except for completely new users. In that case, Personality MF recommends less novel items but still not simply the most popular ones—between 9.5 and 20 % less popular on average, compared to the baseline. In terms of coverage, the personalized methods recommend more varied items than the Most popular method, which always suggests the same set of items. We again see that without any available likes, personality-based MF approaches the behavior of the Most popular baseline. It is worth noting that in the extreme cold-start situation the coverage of iMF is similar to the Most popular method, while Personality MF is much better in that respect.

5.2 Evaluating personality-based active learning

In this section we present the experiment results for the proposed AL methods. We first illustrate the number of likes elicited by the methods. Then, we present their performance in term of accuracy and ranking quality. Finally, we show their performance with respect to the novelty and coverage metrics.

5.2.1 Number of acquired likes

The number of acquired ratings is an important measure of the performance of an AL method. In fact, certain methods can elicit more ratings by better estimating what items the user has actually experienced, and is therefore able to rate.

Figure 5 shows the average number of acquired likes for users having a profile size of 0 (i.e., completely cold users). We can observe that the best AL method for all the considered domains is the personality-based binary-predicted method, which is able to acquire, on average per user, 0.142, 0.165 and 0.153 likes in the books, movies and music domain, respectively (out of a maximum of 5). It outperforms the second-best method, Entropy0, that is able to elicit 0.125 (for books), 0.155 (for movies) and 0.147 (for music) likes. The corresponding p-values were 0.10, 0.03 and 0.03 for books, movies and music, respectively, according to Wilcoxon signed-rank tests, and thus marginally statistically significant to statistically significant. This shows that by exploiting the user’s personality our proposed personality-based binary-predicted method can better estimate which items may have been experienced and liked by completely new users whose preferences are unknown.

Fig. 5
figure 5

Average number of user likes acquired in the extreme cold-start scenario

Figure 6 illustrates the average number of acquired likes for new users having from 1 up to 10 likes in their profiles. The first observation is that the difference between the personality-based binary-predicted method and the standard binary-predicted method vanishes when at least one training like is available. As already observed in the comparison between Personality MF and iMF in Sect. 5.1, this suggests that at this stage the effects of personality exploitation in the underlying rating prediction model of the binary-predicted method are minor to non-existent. In any case, it is clear from the figure that by personalizing the rating elicitation process as done by both the personality-based and the binary-predicted methods we are able to obtain a considerably larger number of likes from the user; in all domains, the personalized AL methods systematically perform better than the Entropy0 and Random methods.

Fig. 6
figure 6

Average number of user likes acquired for different cold-start user profile sizes

5.2.2 Accuracy and ranking performance

An AL method may not be able to elicit a large number of ratings, but those actually elicited can be very useful for improving future recommendations, either for the target user or for other users in the system. Therefore, it is essential to inspect how the accuracy and ranking quality of the generated recommendations changes based on the acquired ratings/likes.

Figure 7 shows the MAP results for completely new users (i.e., profile size = 0) when the cut-off level was set to 10 (i.e., MAP@10). Again, the results for HLU and MPR were equivalent, so we do not report them for brevity. It can be seen from the figure that the proposed personality-based binary-predicted method does not only elicit the largest number of likes from users without any likes history, but also leads to the highest improvement in MAP compared to all other tested methods. The MAP of the recommender increased to 0.012, 0.015 and 0.009 for books, movies and music, respectively, after extending the set of training likes with the likes acquired via the personality-based binary-predicted method. The second-best AL method is Entropy0, which is able to achieve a MAP of 0.011 (for books), 0.015 (for movies) and 0.008 (for music). These differences in MAP between personality-based binary-predicted and Entropy0 are, however, not statistically significant (Wilcoxon signed-rank test), except for the music domain (p \(=\) 0.03). As expected, both the binary-predicted and Random AL methods fail to achieve any noteworthy improvements in terms of MAP.

Fig. 7
figure 7

MAP@10 before and after applying AL in the extreme cold-start scenario

Finally, when considering users having between 1 and 10 likes in their profile, only minor improvements in terms of the system MAP were achieved by applying AL. Therefore, the AL methods seem to be little effective in improving the system accuracy and ranking quality in case the users are already known by the system (i.e., they have at least one like).

5.2.3 Novelty and coverage

When evaluating an AL method, it is not only important to know how it affects the recommender system performance, but it is also important to understand how users would react to the system requests. For that reason, as we mentioned in Sect. 4.2, we propose to measure the AveragePopularity and the Spread of the items requested by the system to rate.

Table 3 shows the average popularity and the spread of items requested to be liked by each AL method. As can be seen, in terms of average popularity, the baseline Entropy0 method requests users to provide ratings for the most popular items, followed, with a significant gap, by the personality-based binary-predicted and binary-predicted methods, and then Random, which obviously has the lowest average popularity and the largest spread. As stated earlier, even though users are able to rate many of the items presented by Entropy0, we expect that they will feel these items as too popular and poorly adapted to their preferences and interests. On the other hand, the popularity of the items provided by random is too low, which turns out to be the reason for its low number of elicited likes. The personality-based binary-predicted and binary-predicted methods perform well, with the former also working for users with 0 training likes.

Table 3 Novelty and coverage of AL strategies in the cold-start

Table 3 also shows that, as expected, Entropy0 has the lowest spread, as it essentially requests to rate a small set of (popular) items. Also, as expected, the best spread result is obtained by the Random method, in which every item in the catalog, regardless of whether it is very popular (ratable) or not at all popular (not ratable), has the same chance to be presented to the user. Again, personality-based binary-predicted (for all profile sizes) and binary-predicted (for profile sizes \(\ge \)1) yield the best trade-off between high spread and high ratability.

5.3 Evaluating personality-based cross-domain collaborative filtering

The goal of the last set of experiments is to test whether personality information can be exploited to further enhance cross-domain techniques in cold-start situations. We aim to validate our hypothesis that personality can be effectively combined with cross-domain user preferences in the MF framework to address the cold-start problem.

In the following we compare the MF method that uses only cross-domain ratings as attributes, and computes predictions using Eq. (8) (methods books, movies, and music, depending on the source domain), and cross-domain ratings combined with personality as in Eq. (9), which we refer to by adding the +personality suffix. Note that these methods differ from those reported in Sect. 5.1 as they exploit information from a source domain. We show in Fig. 8 the performance of the different methods in the extreme cold-start scenario.

Fig. 8
figure 8

MAP@10 of cross-domain methods in the extreme cold-start scenario. The x axis represents the target domain, and bars correspond to recommendation methods with different combinations of source domain and personality

In two cases out of three, combining personality information with cross-domain ratings further improves the performance when no preferences about the user are available in the target domain. Only in the books domain, the best results are obtained using movie data only. In this case, adding personality information does not improve the performance, but it is beneficial if the available auxiliary information consists of music ratings (13.2 % relative improvement over the cross-domain method without personality). When predicting movie preferences, we observe that cross-domain methods enhanced with personality information always achieve better performance. In fact, the overall best results are obtained by combining music preferences and personality (5 % improvement of Music + Pers. over Music), and if only book likes are available as auxiliary information, the accuracy can be further improved by considering personality (by 12.2 %). In the case of music recommendations, we observe a symmetrical trend, where the best results are achieved combining personality with movie likes (16.7 % improvement of Music + Pers. over Music). On the other hand, adding book likes is clearly beneficial to the system but in this case exploiting personality information yields only a minimal improvement.

Fig. 9
figure 9

MAP@10 of cross-domain methods for different profile sizes

The results for the moderate cold-start are shown in Fig. 9. Differently from the extreme cold-start scenario, we cannot conclude that personality is beneficial for larger user profile sizes in the books domain. In the case of movies, we obtain small improvements combining personality with music ratings, but the effect is the opposite when dealing with book ratings. Finally, when recommending music, we clearly see the advantages of combining personality with auxiliary movie ratings, which consistently gives the best overall results.

Regarding novelty, the average popularity of the recommended items by the cross-domain methods is very similar, and much lower than the Most popular baseline as expected (on average, 145–148 vs. 236 in the books domain, 3980–4089 vs. 6620 for movies, and 6622–6753 vs. 11,092 for music). In terms of coverage, the spread of the item distribution is again similar among cross-domain methods, whereas it is much lower for the baseline (on average, 6.23–6.25 vs. 3.45 in books, 6.40–6.48 vs. 3.45 in movies, and 6.78–6.88 vs. 3.44 in music).

5.4 Discussion

We now discuss the results achieved by all the tested methods to address the new user problem, focusing on the benefits of the personality-based methods that we propose in this work. In Fig. 10 we compare the MAP@10 values of the best-performing methods for the new user problem in the extreme cold-start situation, i.e., for users completely new to the system.

Fig. 10
figure 10

Comparison of the best performing methods in the extreme cold-start situation

The Personality-based cross-domain (Personality CD) method is the best performing method in all the considered domains, as it uses more information than MF and AL approaches. The boost in precision, specially in the movies and music domains, comes at the cost of collecting the auxiliary information and the time required to train the models. It can be a compelling approach if cross-domain ratings are available at the time of designing the target system—e.g., if the catalog of items is expanded with a new domain—and the goal is to optimize for precision regardless of the training complexity.

The Personality-based active learning (Personality AL) method is a good alternative when the new user situation is extreme, as the proposed preference elicitation process minimizes the effort required from the user. We argue that additional aspects such as the design of the user interface are of great importance here. Although the recommendation model has to be trained again after the rating acquisition, the complexity in this case is much lower than with cross-domain methods. Also, the improvements in terms of precision are notable in the books and music domains. In the case of movies, we see that additional elicited likes are needed for the baseline iMF recommender to achieve better performance, as users seem to favor popular movies. However, there is a clear trade-off between the effort required from the user and the gain in system performance.

When no auxiliary likes are available and there is no chance to ask the user to rate some items, the Personality-based matrix factorization (Personality MF) method effectively exploits personality information when the new user problem is extreme. The proposed Personality MF method is fast to train and provides better precision than the Most popular baseline and standard iMF, which is unable to compute meaningful recommendations. In the moderate new user situation, as more likes are available, we did not find significant improvements using personality with respect to iMF. We argue, nonetheless, that being able to provide recommendations for completely new users is a very desirable quality of RS that is worth the acquisition of personality information.

Our conclusions are summarized in Table 4. In the books domain, we see that personality-based AL is likely the best approach, as it offers good precision in the extreme new user situation and overall good novelty and coverage. The boost in performance achieved by cross-domain methods may not be worth the extra time required to train the model. For movies, personality-based cross-domain is a good candidate approach. It offers roughly twice the precision maintaining good novelty and coverage, and is also able to provide better performance as the number of available likes grows (using auxiliary music likes, the best performing method). Finally, in the case of music recommendations, we find again that personality-based cross-domain CF is a compelling approach. However, due to the large number of likes in this domain (see Table 1), the training time is considerably larger than for the rest of the methods. Unless the extra precision is required, a good alternative is Personality MF, which is the second best in terms of precision and offers better novelty and coverage than AL in the extreme cold-start.

Table 4 Summary of the performance of the different methods

6 Conclusions and future work

In this paper we have presented three approaches to address the new user cold-start problem in CF, namely, personality-based MF, personality-based AL, and personality-based cross-domain recommendation. They can be used in different scenarios. For example, if in addition to the target domain, an auxiliary domain knowledge is available, the latter solution could be the best. However, if such knowledge is not available, but the system can request the users to give more ratings, the AL solution may be preferable. In neither of these scenarios, we can simply incorporate the personality information in the model and improve the classical MF model.

These conclusions have been derived from a number of extensive offline experiments, which allowed us to compare our methods with state of the art techniques. This has been done by designing and implementing an evaluation procedure that simulates user profile evolution, hence, considering different new user cold-start situations: (a) extreme situation, i.e., when there is no single rating/like from the new users, and (b) moderate situation, i.e., when there is at least one like provided by the new users. Moreover, we have conducted the evaluation in terms of several metrics, such MAP (ranking quality), Average Popularity (novelty), and Spread (coverage).

In this work we were not concerned with the actual acquisition of user personality information, and always assumed that it was already available. However, before being able to use such information in the recommendation process, a system has to obtain it from the user. This may be done either explicitly, i.e., by requesting the users to fill up a personality questionnaire, or implicitly, i.e., by analyzing the users’ behavioral patterns during the interaction with the system (Kosinski et al. 2013). It has been observed that explicit personality acquisition is more accurate (Dunn et al. 2009). However, this comes with a cost; the users must provide explicitly additional information (personality information) while they may not be keen on doing that. This is why one must consider accuracy improvement as well as its cost. Indeed, a topic for future work is the automatic inference of a user’s personality traits from auxiliary preferences or social network profiles. This is a challenging subject that is getting a lot of attention from the research community. An effective solution of this problem combined with the exploitation of cross-domain ratings can potentially reduce the effort required from the user, specially in AL approaches in which also ratings have to be acquired.

The results presented in this paper clearly depend, as in any experimental study, on the chosen simulation setup, which can only partially reflect the real evolution of a recommender system. For example, in this work we assumed that a randomly chosen set of likes among those that the user really gave to the system, represents all her known likes. However, this set could not obviously represent all the true user likes; it contains only the likes expressed by the user while interacting with the system. In fact, many more items are likely to be of interest to the user, but they are not included in the dataset. This is a common problem of any offline evaluation of a recommender system, where the performance of the recommendation method is estimated on a test set that is never a good sample of the true user preferences. Therefore, it is necessary to further evaluate the proposed solutions in alternative evaluation methodologies, and in particular in a live user study.

In recommender systems, the users are interested in recognizing that the entered ratings are immediately used in the recommendations generated by the system. However, an AL solution, as the one we developed in this paper, chooses a set of items (rather than a single item), and presents them to the user to rate. The system is thus retrained only after the user submits the whole batch of ratings. In contrast, in sequential AL the items to be rated are selected one by one, by choosing each successive item to be rated on the base of the user’s ratings provided to the previously requested items. For this reason, as an extension of our current batch AL method, we will implement a sequential (conversational) selection of items (Rubens et al. 2011).

Finally, we stress again that in this paper we used a dataset collected in a popular Facebook social network. This dataset, similarly to other social networks datasets, contains user preference data expressed as likes selections. All the not selected items should not be automatically labeled as dislikes, as this set will surely contains may items that the user likes. That makes it difficult for the system to infer the users’ true preferences. One possibility to solve this problem is to train the system using only the explicit likes. However, further studies will be done in order to develop more effective methods that can effectively make use of this type of data.