1 Introduction and motivation

Opinions are central to almost all human activities and are key influencers of our behaviours [4]. When people need to make a decision, such as which restaurant is the best one in the city or who would people vote for a candidate in the next election, they often listen to the opinions of others. With the development and spread of Social Networks, user-generated opinions are getting richer and ubiquito usly available. The automatic extraction and quantification of subjective information from natural language text is therefore becoming a fundamental task of Sentiment Analysis. This research area has been extensively investigated with several approaches, such as lexicon- [56] and learning- based methods [51] to deal with explicit opinions (see Example 1).

Example 1

[explicit, negative]: The latest menswear campaign is on fire: it’s disgusting!

Most of the works usually take into account content as unique information to infer sentiment [3, 22, 42, 63]. For instance, [22] presented the results of machine learning algorithms for classifying the sentiments of Twitter messages using distant supervision (i.e. emoticons are considered as ground truth for tweet labelling), while [3] explored the linguistic characteristics of how tweets are written and the meta-information of words for sentiment classification. Less work has been done on implicit opinions [70] (see Example 2).

Example 2

[implicit, positive]: “Saving Soldier Ryan”...I can’t wait to watch it!

An implicit opinion is a statement that implies a regular or comparative opinion that usually expresses a desirable or undesirable fact. Example 2 shows that there is some good expectation about the movie, although it is not encoded in words. In this cases the textual features do not provide any explicit information about the intended sentiment, making its recognition of the sentiment difficult even for humans.

When dealing with both explicit and implicit opinions, we should take into account their real nature: they are expressed as texts interconnected by several heterogeneous relationships (e.g. a user-generated opinion can be linked to its author, it can be related to one or more topics, it can be referenced by several readers). Considering the real characteristics of opinions in online social networks, there are two sources of information that can be used to address the sentiment analysis tasks. Textual information can be exploited to model the “similarity” of opinions, while the relational information can both complement and smooth the evidence given by the text. Considering only one of these two information sources, may lead to define biased sentiment analysis models do not able to account for important structure in the data [66]. To this purpose, we propose a novel network modelling approach ables to encapsulate the heterogeneity of the information sources (contents and relationships) by capturing two sociological processes underlying the social network interactions: homophily and constructuralism.

The paper is organised as follows. In Section 2, a literature review on sentiment analysis is presented. In Section 3, homophily and constructuralism underlying social network interactions are discussed. In Section 4, Approval Networks for modelling the user interactions are introduced. In Section 5, two novel sentiment analysis models are presented to take advantage of Approval Networks at user-level and aspect-level. Finally, in Section 6 conclusions are derived.

2 Literature review

During the last decade, sentiment analysis has been mainly focused on well-formed text at different granularity such as document level [1, 62, 69], sentence level [41, 68] and aspect level [16, 20, 72]. In the recent years, the informal texts on social networks become one of the major forms of online communications, enabling a quasi real-time diffusion of contents provided by people and organisations. Although the recent studies are moving towards informal text, most of them are only based on textual contents [3, 19, 29, 42, 53]. In those cases, sentiment analysis is based on the assumption that user-generated opinions are independent and identically distributed (i.i.d). These methods are aimed at handling the complex characteristics of natural language, do not considering the networked context of the data: a user-generated opinion is linked to its author and can be explicitly referenced by several readers. The debating and informal nature of user-generated opinions have led therefore to additional challenges for addressing the sentiment analysis tasks: it is impractical to exploit traditional supervised machine learning techniques that require annotated and non-relational training data.

Recently, some approaches [11, 13, 21, 37, 67] have been proposed to exploit some structural network information for sentiment analysis purposes.Footnote 1 In [21] a “friendship” network is used to simulate and analyse the diffusion model of opinion across the network, while in [37] the hyperlink structure of blogs is exploited to track how users join and construct opinions and how some of these opinions around a specific topic spread. In [11], the authors exploited the network structure to group the users with respect to the sentiment, without considering the textual contents. Although these approaches provide a fundamental contribution to the sentiment analysis research field, their focus is on structural properties of the social networks disregarding the information embedded within the text. Recently, an investigation about the emotional contagion in large social networks has been presented [13]. The authors have modelled the emotions of the users as dependent not only on the endogenous and exogenous factors (e.g. being always happy and rainfall effect), but also on contagion of groups of friends.

Some recent investigations, which have attempted to combine both text and network information for sentiment analysis purposes, are focused on user-user relationships [55, 58, 61, 71]. Tan et al. [61] proposed a semi-supervised approach to predict the user-sentiment by introducing explicitly available undirected user-user relationships (“friendships”) into a text based factor-graph model. In [58], Speriosu et al. proposed to enrich the content representation by including directed user-user relationships as additional features to the text ones. The same kind of directed user-user social relationships (e.g. “following” and “follower” in Twitter) has been exploited in [55] and [71] to predict the sentiment orientation by means of supervised and unsupervised relational approaches.

Although the above mentioned approaches represent a fertile ground for sentiment analysis in online social networks, they strongly assume that the explicitly available user-user relationships (friendships and following/follower connections) unconditionally represent the sentiment agreement between connected users. However, this assumption does not reflect what happens in the real world, where two structurally connected users (e.g. friends) can have divergent opinions on a given topic. In order to better capture the sentiment agreement among users, in the following section two main sociological theories are presented and analysed to finally converge to a novel network modelling approach for sentiment analysis purposes.

3 Homophily and constructuralism: two sociological processes

People with different characteristics (e. g., genders, races, ages, class backgrounds, etc.) usually show very different personalities: educated people are tolerant, women are sensitive, and gang members are violent [48]. Since people generally have significant contacts with others who tend to be like themselves, any personal characteristic tends to converge. Homophily is the principle stating that a contact among similar people occurs at a higher rate than among dissimilar people. Homophily implies that differences in terms of social characteristics translate into network distance, i.e. the number of relationships through which a piece of information must travel to connect two individuals [48].

The concept of homophily is very ancient. In Aristotle’s Rhetoric and Nichomachean Ethics, he noted that people “love those who are like themselves” [2]. Plato observed in Phaedrus that “similarity begets friendship” [52]. However, social scientists began systematic observations of group formation and network ties only in the 1920s [6, 31, 64]. They noted that school children formed friendships and play groups at higher rates if they were similar on demographic characteristics. The classic and most famous work in sociology is [39], where the friendship process is studied. They also quoted the proverbial expression of homophily, “birds of a feather flock together”, which is often used to summarise this sociological process. Researchers have studied homophily ranging from the strong relationships of “discussing important matters” [43, 44] to the more circumscribed relationships of “knowing about” someone [23] or appearing with them in a public place [47].

In particular, [39] distinguished two types of homophily: status homophily, in which similarity is based on informal, formal, or ascribed status, and value homophily, which is based on values, attitudes, and beliefs [48]. Status homophily includes the major sociodemographic dimensions like race, ethnicity, sex, or age, and acquired characteristics like religion, education, occupation, or behaviour patterns. Value homophily includes the wide range of internal states presumed to shape our orientation towards future behaviours: attitude, belief, and value similarity lead to attraction and interaction [32]. Value homophily is the homophily facet that has been considered as assumption in this paper, where interactions are preferred compared to static user attributes.

Besides homophily, [8] has developed a sociological theory called constructuralism, whose core assumption is that people who share knowledge are more likely to interact (i.e. form ties). In particular, constructuralism argues that individual learning from interactions takes place on two levels. First, social interactions allow us to collect over time new knowledge that represents similarity among users better than static sociodemographic dimensions like race, ethnicity, sex, or age (i.e. status homophily). Second, as humans receive and share knowledge with interaction partners, we “learn” a perception of what we expect them to know. Paired with the assumption of homophily that people tend to interact with others similar to them, constructuralism explains how social relationships evolve via interactions as the knowledge that two actors share increases [35]. This approach to the coevolution of knowledge and social relationships has a considerable explanatory power over the dynamics of social networks and has shown to be an effective tool for social simulation [9, 24]. Since text does not always provide explicit or sufficient information about sentiment, early studies on sentiment classification [30, 61] overcome this limitation by exploiting the principle of homophily, which is usually modelled through friendships. However, considering the similarity among users on the basis of constructuralism appears to be much more powerful than interpersonal influence within the friendship network [10, 36].

Considering friendship connections as proxy of homophily is therefore a strong assumption: (1) being friends/followers does not necessarily mean agreeing on a particular topic (e.g. there are often opposite political views among friends), (2) dynamic interactions are preferred compared to static attributes (value homophily): once friendship is established in online social networks, it is rare that it could be interrupted, or even when it occurs it changes slowly over time, (3) social interactions allow us to collect over time new knowledge that represents similarity among users better than the static sociodemographic dimensions, such as friendship (constructuralism).

For these reasons, we propose a novel paradigm called Approval Network to jointly model homophily and constructuralism. The general idea behind Approval Network is that a user who approves (e.g., by ‘likes’ on Facebook, ‘+1’ on Google+ and ‘retweets’ on Twitter) a given message is likely to hold the same opinion of the author. This is because an approval tool does not allow the user to add a comment against the original message.Footnote 2 Thus, the main underlying principle is that approval relationships could be a strong indication about the sentiment agreement between two users: the higher is the number of approvals between two users on a given topic, more likely will be their agreement on that topic.

For instance, information can spread in Twitter in the form of retweets, which are tweets that have been forwarded by a user to his or her followers. A retweet is identified by the pattern “RT @” followed by the name of the tweet’s author and the original tweet (e.g. John tweets “I like the new iPhone” and Mary retweets the John’s tweet: “RT @John: I like the new iPhone”, i.e. John and Mary positively agrees about iPhone). While approving can simply be perceived as the act of copying and rebroadcasting, the practice contributes to a conversational ecology in which conversations are composed of a public interplay of voices that give rise to an emotional sense of shared conversational context [7].

4 Approval networks

Approving contents is a well adopted practice in online social networks. Among the practical motivations, it has been highlighted that the agreement with the original user message is one of the main motivations [7, 33]. In order to model the two underlying social theories and capture the practical behaviour of the users, we have formally defined what we have called Approval Networks.

Definition 1

Given a topic of interest q, a Directed Approval Graph is a quadruple \(\text {\textbf {DAG}}_{q} = \{V_{q},E_{q},\mathbf {X}_{q}^{V},\mathbf {X}_{q}^{E} \}\), where V q ={v 1,...,v n } represents the set of active users on q; E q ={(v i ,v j )|v i ,v j V q } is the set of approval edges, meaning that the extent that v i approved v j ’s messages; \(\mathbf {X}_{q}^{E}=\{ w_{i,j}|(v_{i},v_{j}) \in E_{q}\}\) is the set of weights assigned to approval edges, indicating that v i approved w i,j messages of v j on q; \(\mathbf {X}_{q}^{V}=\{k_{i}|v_{i} \in V_{q}\}\) is the set of coefficients related to nodes, where k i represents the total number of messages of v i on q.

Given a DAG q , we can define a network representation that takes into account the real usage of social network approval tools. In particular, we defined an Augmented Directed Approval Graph as follows:

Definition 2

Given a \(\text {DAG}_{q} = \{V_{q},E_{q},\mathbf {X}_{q}^{V},\mathbf {X}_{q}^{E} \}\), an Augmented Directed Approval Graph is derived as a triple \(\text {\textbf {A-DAG}}_{q} = \{V_{q},E_{q},\mathbf {C}_{q}^{E} \}\), where \(\mathbf {C}_{q}^{E}=\{ c_{i,j} \}\) is the set of normalised weights of approval edges, and c i,j is computed as

$$\begin{array}{@{}rcl@{}} c_{i,j} = \frac{w_{i,j}}{{\max_{i}}\,\, w_{i,j}} \log_{2} \left( 1+\frac{w_{i,j}}{k_{j}} \right) \end{array} $$
(1)

The measure presented in (1) tries to capture the behaviour of users on social networks. First, the common characteristic of an approval network is that most of the users usually approve only one message of a target user, and very few users approve two or more messages [38]. Thus, a logarithmic scale should be used instead of a linear one. Second, the number of approvals between two users does not necessarily indicate how much they agree with a particular topic. It could be influenced by the interest and originality of the target user’s messages. For example, a user A could completely agree with user B but approves it one or two times only because the weak originality of B’s messages. For this reason, the number of approvals from user A to B has been normalised by considering the maximum number of approvals from any user connected to B. Finally, this network representation penalises users who approve few messages of a particular target user if there are other users who approve many posts of the same target user. A toy example of A-DAG is reported in Figure 1.

Figure 1
figure 1

Example of A-DAG q modeling approval relations. White numbers represent the number of messages provided by every user. a The weight on the edges corresponds to the number of approvals that a source user makes with respect to the messages of the target user. b The score on the edges is the weight of approval edges computed according (1)

In order to take into account both texts and relationships available in online social networks, A-DAG q has been extended by defining an heterogeneous graph as a unique representation of both user-user and user-message relationships.

Definition 3

Given an A-DAG q , let M q ={m 1,⋯ ,m m } be the set of nodes representing messages about q and \({A_{q}^{M}}=\{(v_{i},m_{t})|v_{i} \in V_{q}, m_{t} \in M_{q}\}\) be the set of arcs that connect the user v i and the message m t . A Heterogeneous Directed Approval Graph is a quintuple \(\textbf {H-DAG}_{q} = \{V_{q},E_{q},{C_{q}^{E}},M_{q},{A_{q}^{M}}\}\).

A graphical representation of H-DAG q is reported in Figure 2. In the following, topic q is intended to be fixed and therefore omitted.

Figure 2
figure 2

H-DAG representing user-message and user-user relationships

5 Sentiment analysis with approval networks

In the following sections, we will present two tasks where the proposed Heterogeneous Directed Approval Graph has been exploited for defining two novel sentiment analysis models. The first task relates to user-level sentiment analysis addressed by a semi-supervised model, while the second one is concerned with aspect-level sentiment analysis solved by presenting a fully unsupervised approach.

5.1 User-level sentiment analysis

Dealing with sentiment classification on social networks usually requires a fully supervised learning paradigm, where the sentiment orientation of users must be known a priori to derive suitable predictive models. However, this does not reflect the real settings of social networks, where the polarity on a given topic is explicitly available only for some users (black nodes) while for others could be derived from their posts and relations with other users (white nodes). As black nodes we considered those users whose bio (description on Twitter) or name clearly state a positive or negative opinion about the topic ’Obama’. For instance, a positive user’s bio could report “I love football, TV series and Obama!” and/or the name could be “ObamaSupporter”.

In this context, a semi-supervised learning paradigm better represents the real setting. To this purpose, we introduce a semi-supervised sentiment learning approach named S 2-LAN [54] ables to deal both with text and Approval Network: given a small proportion of users already labelled in terms of polarity, it predicts the sentiments of the remaining unlabelled users by combining textual information and Approval Network directly.

5.1.1 Semi-supervised sentiment learning by approval network (S2-LAN)

Given a H-DAG denoted by ϕ, two vectors need to be introduced to tackle the sentiment classification problem at user-level: a vector of labels L V={l(v i )∈{+,−}|v i V} that defines each user as either “positive” (+) or “negative” (-) and an analogous vector of labels \(\mathbf {L}^{M}=\{l_{v_{i}}(m_{t}) \in \{+,-\}|v_{i} \in V, m_{t} \in M\}\) that represents the polarity label of each message m t written by the user v i .

In our model, we assume that the sentiment label l(v i ) of the user v i can be derived by considering the sentiment labels \(l_{v_{i}}(m_{t})\) of his/her messages and influenced by the sentiment labels of the directly connected neighbours N(v i ). The user-message and user-user (approval) relations are combined in the following probabilistic model:

$$\begin{array}{@{}rcl@{}} \begin{array}{ll} \log P(\mathbf{L}^{V} | \phi)= & \bigg(\sum\limits_{v_{i} \in V} \bigg[ \sum\limits_{m_{t} \in M} \sum\limits_{\alpha} \sum\limits_{\beta} \mu_{\alpha,\beta} f_{\alpha,\beta} (l(v_{i}),l_{v_{i}}(m_{t}))\\ & \\ & + \sum\limits_{\underset{(v_{i},v_{j})\in \phi}{v_{j} \in N(v_{i})},} \sum\limits_{\alpha} \sum\limits_{\beta} \lambda_{\alpha,\beta}g_{\alpha,\beta}(l(v_{i}),l(v_{j}))\bigg]\bigg) -\log Z\\ \end{array} \end{array} $$
(2)

where α,β∈{+,−} denote the polarity labels, f α,β (⋅,⋅) and g α,β (⋅,⋅) are feature functions used to evaluate the user-message and the user-user relations respectively, and the weights μ α,β , λ α,β are parameters to be estimated. Z is a normalisation factor that enables a coherent probability distribution of P(L V). Regarding the estimation of μ, λ and the assignment of user sentiment labels which maximises logP(L V) refer to [54], where a modified version of the SampleRank algorithm [65] is presented.

User-message feature function

A user-message feature function evaluates whether the message polarity agrees (or disagrees) with respect to the user sentiment. Formally, \(f_{\alpha ,\beta }(l(v_{i}),l_{v_{i}}(m_{t}))\) is defined as:

$$ f_{\alpha,\beta} (l(v_{i}),l_{v_{i}}(m_{t})) = \left\{ \begin{array}{ll} \frac{\rho_{T-black}}{|M_{v_{i}}|} & \quad l(v_{i})=\alpha, l_{v_{i}}(m_{t})=\beta, v_{i}\in \text{black}\\ \frac{\rho_{T-white}}{|M_{v_{i}}|} & \quad l(v_{i})=\alpha, l_{v_{i}}(m_{t})=\beta, v_{i} \in \text{white}\\ 0 & \quad otherwise \end{array} \right. $$
(3)

where “ v i ∈ black” means that the user v i is initially labeled (i. e. its polarity label is known a priori), and “ v i ∈ white” means that v i is unlabelled (i.e. its polarity label is unknown a priori). The parameters ρ Tb l a c k and ρ Tw h i t e represent the different level of confidence related to black and white users, and \(M_{v_{i}} \subset M\) denotes the set of messages written by user v i . Since the user-message feature function f α,β assumes that every message m t M has a polarity label, a sentiment classification methodology for messages is required. To this purpose, an ensemble method based on Bayesian Model Averaging (BMA) has been used [19, 53]. This choice is motivated by the fact that an ensemble of different models could be less sensitive to noise and could provide a more accurate prediction than single learners [14].

User-user feature function

A user-user feature function evaluates whether the polarity of a given user agrees (or disagrees) with its neighbour’s sentiment. Given a H-DAG, g α,β (l(v i ),l(v j )) is formally defined as follows:

$$ g_{\alpha,\beta} (l(v_{i}),l(v_{j}))= \left\{ \begin{array}{ll} \frac{\rho_{neigh}\cdot c_{i,j}}{\sum\limits_{v_{k} \in N(v_{i})} c_{i,k}} & \quad l(v_{i})=\alpha, l(v_{j})=\beta\\ 0 & \quad otherwise \end{array} \right. $$
(4)

where ρ n e i g h represents the confidence level of the relationships among usersFootnote 3 and c i,j denotes the normalised weights of approval edges in ϕ.

5.1.2 Experiments

In this section, we present a case study to validate the semi-supervised model S2-LAN presented above. The experimental investigation is based on connections and messages obtained from Twitter and presents a comparison between several approaches able to consider textual information, structural information and their combination.

Dataset

In order to evaluate the proposed model, two datasets have been considered:

  • The first dataset, named Obama, comprises 62 users posting about the President of U.S. Barack Obama, whose tweets have been monitored during the period 8-10 May 2013. The resulting tweets and authors have been manually labeled as positive or negative by three annotators. This dataset consists also of 270 relationships and 160 posts related to the 62 users. In order to validate the proposed modelling on a more complex scenario, a larger dataset has been collected. modelling

  • The second dataset, named Superman, comprises 3835 users posting about the movie entitled “Man of Steel”, whose posts have been monitored during the period 11-12 June 2013. Also in this case, the resulting tweets and authors have been manually labelled. This dataset consists also of 4032 relationships and 5167 posts related to the 3835 users.

Some statistics about the considered dataset are reported in Table 1.

Table 1 Characteristics of the Twitter datasets

Compared models

In order to evaluate the proposed model, S2-LAN has been compared with traditional approaches based on text, i.e. Dictionary-based Classifier (DIC) [26], Naive Bayes (NB) [45], Maximum Entropy (ME) [46], Support Vector Machines (SVM) [12], Conditional Random Fields (CRF) [60] and Bayesian Model Averaging (BMA) [19].

The classical state-of-the-art measures for classification have been employed, i.e. Precision (P), Recall (R), F1-measure (F1) distinguished in positive (+) and negative (-), together with the well known global Accuracy (Acc) measure. Moreover, in order to show the importance of textual and social network information, S2-LAN has been investigated according to the following experimental conditions:

  • S2-LAN (T+A): this corresponds to the proposed model in (2), where both text and approval relationship are evaluated by the user-message and user-user feature functions;

  • S2-LAN (T+F): this model estimates the likelihood reported in (2) considering text and friendship relationships (i.e. following/follower) in their respective feature functions;

  • S2-LAN (T): the model is estimated only evaluating the user-message feature functions based on text;

  • S2-LAN (A): the model is estimated only evaluating the user-user feature functions based on approval relationships;

  • S2-LAN (F): the model is estimated only evaluating the user-user feature functions based on friendship (i.e. following/follower) connections.

Considering that the estimation of the parameters λ and μ of S2-LAN models is performed by the gradient-based approach named SampleRank, we fixed its maximum number of steps equal to 10000 and we assumed that convergence is achieved when the results are persistent for 500 steps. Moreover, since is based on a sampling function, we performed k={1,5,11,15,21,101} runs to get k predictions (votes) and take a majority vote among the k possible labels for each user. For each k, we performed 500 experiments to compute the average performance.

Results

The first evaluation relates to the ability of classifiers to detect the polarity of tweets on both dataset. By comparing the state of the art approaches, it emerges that BMA on Obama achieves 60.37 % of Accuracy compared with 58.49 % obtained by the best single classifier (CRF), 58.04 % of SVM, 56.12 % of NB, 55.35 % of ME and 55.01 % of DIC. Analogous results can be observed on Superman, where BMA obtains 53.1 % of accuracy followed by SVM (52.6 %), CRF (52.0 %), NB (51.7 %), ME (51.3 %) and DIC (51.1 %). The experimental comparison between BMA and the other baseline classifiers has shown that the adopted solution is particularly effective and efficient, thanks to its ability to define a strategic combination of different classifiers through an accurate and computationally efficient heuristic. According to these results, BMA has been selected as text-based baseline at user-level for the comparison with S2-LAN. To this purpose, the polarity of a user has been derived by aggregating the polarity prediction on tweets given by BMA through a majority voting mechanism. For instance, if BMA detects three positive and two negative tweets for a given user, the final user label will be positive.

Tables 2 and 3 summarise the performance on Obama and Superman achieved by S2-LAN (T+A) with respect to the BMA based heuristics. The reported results refer to the S2-LAN (T+A) configuration based on k=21, which represents a good trade-off between running time and accuracy. It can be easily noted that the approval relations enclosed in S2-LAN (T+A) ensure a global improvement of 27 % with respect to the text-only method on Obama and 14 % on Superman. Since BMA does not take into account any kind of relationship, the correct prediction of a user does not have any effect on adjoining users. Considering the network structure, the prediction of each user has impact on all the other nodes by a “propagation” effect, smoothing each predicted label according to the connected nodes.

Table 2 Performance on Obama: best baseline based on text (BMA) vs the proposed S2-LAN (T+A) based on relations and text
Table 3 Performance on Superman: best baseline based on text (BMA) vs the proposed S2-LAN (T+A) based on relations and text

In order to better understand the contribution that text and relationships provide to the proposed model, several additional results are reported in the following. We notice that the results achieved on Obama are extremely higher compared to the ones obtained on Superman. The main reason can be found in the characteristics of the two networks: the Obama network structure is more dense and compact than the Superman one. This can be deduced by comparing the average number of neighbours and the diameter of the two graphs: while the Obama dataset is characterised by an average number of edges per user equal to 7.1 and a diameter coefficient equal to 3, the Superman dataset has 2.1 and 5 respectively (see Table 1). In this scenario, a higher average number of neighbours positively affects the diffusion model because more information (from adjoining nodes) is available when processing each user. Additionally a compact network denoted by a low diameter, i.e. the longest distance in the graph, makes the propagation effect less sensible to the noise.

Several additional considerations can be derived when comparing the five variants of S2-LAN (See Table 4 and 5). First of all, it is easy to note that in both datasets, the model based only on text (S2-LAN (T)) is extremely poor. When considering either only friendship or approval relationships, S2-LAN (F) and S2-LAN (A) have two different behaviours in the two datasets. While in Obama only considering the relational information produces a positive effect on the user-level sentiment prediction, on the Superman dataset there is no significant improvement with respect to the text information.

Table 4 Accuracy comparison: importance of textual and social network information on the Obama dataset
Table 5 Accuracy comparison: importance of textual and social network information on the Superman dataset

These two different behaviours can be again explained by analysing the dataset characteristics. While in Obama we can roughly estimate a number of edges per user close to 4.3, for the Superman dataset this estimation is about 1.05. According to these statistics, both friendships and approvals are able to provide a good improvement with respect to text in the dense network, while no significant variations into the sparse one. In this context, the denser is the network and the higher is the improvement.

Finally, looking at the results of those models that jointly consider text and relationships, i.e S2-LAN (T+A) and S2-LAN (T+F), we can derive two considerations. First of all, both kinds of relationships (when combined with text) are able to improve the user-level sentiment predictions with respect to consider text only. Second, the proposed model S2-LAN (T+A) is able to better combine the two ingredients outperforming S2-LAN (T+F) in both datasets. This investigation not only confirms that the inclusion of relationships in predictive models, as suggested in other studies [17, 18, 57], leads to improve recognition performance when dealing with non-propositional environments, but also that approval relationships better capture the concepts of homophily and costructuralism than simple friendships.

5.2 Aspect-level sentiment analysis

This section is focused on simultaneously extracting aspects and classifying sentiments from textual messages. Most of the works [42, 63] tackle sentiment analysis in social networks at document level. However, social networks contain highly diverse aspects in the same message (e.g., “#iOS7 is very good, but improvements on battery and screen are required”). The above mentioned works would classify the entire message as globally positive, but the company Apple would be probably more interested in opinions about the aspect ‘battery’ and ‘screen’ in order to understand how ‘iOS7’ could be improved. In order to deal with aspect-level sentiment analysis [27], two tasks need to be addressed:

  • Aspect Extraction: Aspect extraction consists of two sub-tasks: (1) extracting all aspect terms (e.g., ‘screen’) from the corpus, and (2) clustering aspect terms with similar meanings (e.g., group ‘screen’ and ‘display’ into one aspect category in the domain ‘iPhone’). Topic models, such as Latent Dirichlet Allocation (LDA) [5] and Probabilistic Latent Semantic Analysis (pLSA) [25], have been successfully applied to perform both sub-tasks simultaneously.

  • Sentiment Classification: Once aspects have been extracted, sentiment analysis techniques can applied to discover the underlying sentiment orientations. In the previous example, the polarity on the aspects battery and screen is negative.

Several works which deal with sentiment classification and topic modelling have been proposed in the literature. Topic Sentiment Mixture (TSM) [49] separates topic and sentiment words using an extended pLSA model. Further models based on the LDA principle can be found in [40] and [34], where Joint Sentiment/Topic (JST) and Aspect and Sentiment Unification Model (ASUM) have been proposed respectively. The main advantage of joint modelling both tasks derives from its ability to reciprocally reduce their noise.

However, these techniques consider only textual information, disregarding that some relationships can provide useful information to infer the latent aspects and sentiment orientations. To this issue, an unsupervised probabilistic model called NAS (Networked Aspect-Sentiment) is proposed as an extension of JST. With NAS, we intend to extract both aspects and sentiments from microblog messages by simultaneously incorporating all the information made available by the proposed approval networks.

5.2.1 Networked Aspect-Sentiment (NAS) model

JST (in its original form) is characterised by a good ability to jointly model aspects and sentiment orientations. However, it is not able to use some additional information given by the network structure underlying the user messages. Taking into account the relationships based on approvals could enhance the model to maintain coherent sentiment labels of strong connected users posting about the same topic. The main idea to overcome the limitation of the JST model is to consider (into the generative process and therefore during the inference phase) not only the sentiment label related to specific topic-message-words, but also the sentiment label of messages provided by adjoining users.

Based on the above intuition, we propose a novel generative model called Network Aspect Sentiment (NAS), which drops the sentiment independence assumption that characterise the JST model. As depicted in Figure 3, several dependencies have been modelled in NAS. In particular, analogously to the JST model, the topic z is conditioned on the sentiment-topic-message distribution 𝜃. However, concerning the sentiment label l, the NAS model not only consider its conditional dependency on the sentiment-message distribution π but also the dependencies on adjoining sentiment labels modelled through the variable x (regulated by a sentiment-approval-distribution 𝜖). The generation of each word w finally depends on its sentiment-topic-word distribution φ, the topic z and the sentiment label l. The entire process takes advantage of hyper-parameters α, β, γ to regulate the prior of their respective discrete distributions.

Figure 3
figure 3

Graphical representation of NAS through plate notation. Nodes are random variables, edges are dependencies, and plates are replications. Only shaded nodes are observable

More formally, let’s assume that we have a corpus composed of M messages, where each message m is a sequence of N m words, i.e. \(d=(w_{1},w_{2},...,w_{N_{m}})\) and each unique word w i contributes to create a vocabulary of V distinct terms. Let S be the number of sentiment labels and T the total number of topics.

The joint distribution of the NAS model can be written as:

$$\begin{array}{@{}rcl@{}} &&p(w,z,l,x,\varphi,\omega,\epsilon,\pi \mid\alpha,\beta,\gamma) =\\ &&{\kern20pt} =P(\varphi|\beta)P(\omega|\alpha)P(\pi|\gamma)P(\epsilon|\gamma)P(z|s,\theta)P(l|x,\pi)P(x|\epsilon)P(w|z,l,\varphi) \end{array} $$
(5)
$$\begin{array}{@{}rcl@{}}&&={\prod}_{s=1}^{S}{\prod}_{t=1}^{T}P(\varphi_{st}|\beta) \times {\prod}_{s=1}^{S}{\prod}_{m=1}^{M}P(\theta_{sm}|\alpha) \times {\prod}_{m=1}^{M}P(\pi_{m}|\gamma) \\ &&{\kern6pt}\times {\prod}_{m^{\prime}=1}^{M}P(\epsilon_{m^{\prime}}|\gamma) \times {\prod}_{m=1}^{M}{\prod}_{n=1}^{N_{m}}P(z_{nm}|\theta_{sm}) \times {\prod}_{m^{\prime}=1}^{M}{\prod}_{n=1}^{N_{m^{\prime}}}P(x_{nm^{\prime}}|\epsilon_{m^{\prime}}) \\ &&{\kern6pt}\times {\prod}_{m=1}^{M}{\prod}_{n=1}^{N_{m}}P(w_{md}|\varphi_{st}) \times {\prod}_{m=1}^{M}{\prod}_{n=1}^{N_{M}}P(l_{nm}|\pi_{m}) \end{array} $$
(6)
$$\begin{array}{@{}rcl@{}}&&={\prod}_{s=1}^{S}{\prod}_{t=1}^{T}Dir(\varphi_{st}|\beta) \times {\prod}_{s=1}^{S}{\prod}_{m=1}^{M}Dir(\theta_{sm}|\alpha) \times {\prod}_{m=1}^{M}Dir(\pi_{m}|\gamma) \\ &&{\kern6pt}\times {\prod}_{m=1}^{M}Dir(\epsilon_{m^{\prime}}|\gamma) \times {\prod}_{m=1}^{M}{\prod}_{n=1}^{N_{m}}Disc(z_{nm}|\theta_{sm}) \times {\prod}_{m=1}^{M}{\prod}_{n=1}^{N_{m}}Disc(x_{nm^{\prime}}|\epsilon_{m^{\prime}}) \\ &&{\kern6pt}\times {\prod}_{m=1}^{M}{\prod}_{n=1}^{N_{m}}Disc(w_{nm}|\varphi_{st}) \times {\prod}_{m=1}^{M}{\prod}_{n=1}^{N_{m}}Disc(l_{nm}|\pi_{m}) \end{array} $$
(7)

where D i r(⋅) and D i s c(⋅) denote Dirichlet and Discrete distributions respectively.

Given the n-th word of the m-th message, the collapsed sampler needs to compute the probability of topic z n m and label l n m being assigned to the word w n m , given all the other topic assignments (\(z_{nm}^{-}\)) and the sentiment realisations (\(l_{nm}^{-}\)) of all the other words, which are further conditioned on the sentiment realisations \(x_{nm^{\prime }}\) of messages m of adjoining users. The sampler needs therefore to estimate the joint probability of topic and sentiment as follows:

$$ P(z_{nm},l_{nm}| z_{nm}^{-},l_{nm}^{-},w_{nm},x_{nm},\alpha,\beta,\gamma) $$
(8)

By definition of conditional probability, we can derive that (8) corresponds to:

$$ \frac{P(z_{nm},l_{nm}, z_{nm}^{-},l_{nm}^{-},w_{nm},x_{nm}|\alpha,\beta,\gamma)}{P(z_{nm}^{-},l_{nm}^{-},w_{nm},x_{nm}|\alpha,\beta,\gamma)} $$
(9)

Since the denominator does not depend on z n m and l n m , it can be removed obtaining:

$$ \propto P(z_{nm},l_{nm}, z_{nm}^{-},l_{nm}^{-},w_{nm},x_{nm}|\alpha,\beta,\gamma) $$
(10)

Now, considering that z n m together with \(z_{nm}^{-}\) is just z, and l n m together with \(l_{nm}^{-}\) corresponds to l, (10) can be written in a compact form as:

$$ P(z,l,w,x|\alpha,\beta,\gamma) $$
(11)

Using the rule of total probability, the topic distribution 𝜃, the sentiment distributions π and 𝜖, and the word distribution φ, can be integrated out as follows:

$$\begin{array}{@{}rcl@{}} &&P(z,l,w,x|\alpha,\beta,\gamma)\\ &&\quad= \int \int \int \int P(z,l,w,x,\varphi,\theta,\epsilon,\pi|\alpha,\beta,\gamma)\; d\varphi \; d\theta \; d\epsilon \; d\pi \end{array} $$
(12)

Expanding the integrand given the model defined in (5), we can obtain:

$$\begin{array}{@{}rcl@{}} &&{\kern5pt} P(z,l,w,x|\alpha,\beta,\gamma) \\ &&\quad=\int \int \int \int P(\varphi|\beta)P(\theta|\alpha)P(\pi|\gamma)P(\epsilon|\gamma)P(z|l,\theta)P(l|x,\pi)P(x|\epsilon)P(w|z,l,\varphi) \; d\varphi \; d\theta \; d\epsilon \; d\pi \\ &&\quad= \int P(\varphi|\beta)P(w|z,l,\varphi) d\varphi \times \int P(\theta|\alpha)P(z|l,\theta) d\theta \\ &&\quad\times \int P(\pi|\gamma) P(l|x,\pi) d\pi \times \int P(\epsilon|\gamma) P(x|\epsilon) d\epsilon \end{array} $$
(13)

Expanding out the terms according to the independence assumption in (6), we obtain the following form of the joint probability distribution:

$$\begin{array}{@{}rcl@{}} P(z,l,w,x|\alpha,\beta,\gamma)&=&\int {\prod}_{s=1}^{S}{\prod}_{t=1}^{T}P(\varphi_{st}|\beta) {\prod}_{m=1}^{M}{\prod}_{n=1}^{N_{m}}P(w_{nm}|\varphi_{st_{nm}}) \; d\varphi \\ &\times & \int {\prod}_{s=1}^{S}{\prod}_{m=1}^{M}P(\theta_{sm}|\alpha) {\prod}_{m=1}^{M}{\prod}_{n=1}^{N_{m}}P(z_{nm}|\theta_{sm_{t}}) \;d\theta \\ &\times & \int {\prod}_{m=1}^{M}P(\pi_{m}|\gamma) {\prod}_{m=1}^{M}{\prod}_{n=1}^{N_{m}}P(l_{nm}|\pi_{m_{s}}) \; d\pi \\ &\times & \int {\prod}_{m^{\prime}=1}^{M}P(\epsilon_{m^{\prime}}|\gamma) {\prod}_{m^{\prime}=1}^{M}{\prod}_{n=1}^{N_{m}}P(x_{nm^{\prime}}|\epsilon_{m^{\prime}}) \; d\epsilon \end{array} $$
(14)

Expanding out the Dirichlet priors and the discrete distributions according to their definitions, we can derive:

$$\begin{array}{@{}rcl@{}} P(z,l,w,x|\alpha,\beta,\gamma)&=&\Bigg(\frac{\Gamma({\sum}_{i=1}^{V}\beta_{i})}{{\prod}_{i=1}^{V}{\Gamma}(\beta_{i})}\Bigg)^{S T} \times \quad {\prod}_{t=1}^{T}{\prod}_{s=1}^{S} \frac{{\prod}_{i=1}^{V}{\Gamma}(N_{its}+\beta)}{\Gamma(N_{ts}+V\beta)}\quad \\ &\times& \Bigg(\frac{\Gamma({\sum}_{i=1}^{T}\alpha_{i})}{{\prod}_{i=1}^{T}{\Gamma}(\alpha_{i})}\Bigg)^{SM} \times \quad {\prod}_{s=1}^{S}{\prod}_{m=1}^{M} \frac{{\prod}_{t=1}^{T}{\Gamma}(N_{tsm}+\alpha)}{\Gamma(N_{ts}+T\alpha)}\quad \\ &\times&\Bigg(\frac{\Gamma({\sum}_{i=1}^{S}\gamma_{i})}{{\prod}_{i=1}^{S}{\Gamma}(\gamma_{i})}\Bigg)^{M} \times \quad {\prod}_{m=1}^{M} \frac{{\prod}_{s=1}^{S}{\Gamma}(N_{sm}+\gamma)}{\Gamma(N_{m}+S\gamma)}\quad \\ &\times&\Bigg(\frac{\Gamma({\sum}_{i=1}^{S}\gamma_{i})}{{\prod}_{i=1}^{S}{\Gamma}(\gamma_{i})}\Bigg)^{M} \times \quad {\prod}_{m^{\prime}=1}^{M} \frac{{\prod}_{s=1}^{S}{\Gamma}(N_{sm^{\prime}}+\gamma)}{\Gamma(N_{m^{\prime}}+S\gamma)}\quad \end{array} $$
(15)

A sample obtained from the Markov chain of the Gibbs sampler can be used to approximate the distribution of words in topic and sentiment labels, obtaining therefore the following joint probability distribution:

$$\begin{array}{@{}rcl@{}} &&p(z,l,w,x|\alpha,\beta,\gamma) \propto\\ &&{\kern70pt} \overbrace{\frac{{N_{its}}+\beta}{N_{ts}+V\beta} \; \times \frac{N_{tsm}+\alpha}{N_{sm}+T\alpha} \; \times \frac{N_{sm}+\gamma}{N_{m}+S\gamma}}^{\text{\textbf{A} (i.\,e. JST)}} \;\times \\ &&{\kern118pt}\overbrace{\frac{\sum\limits_{(v_{i},v_{j}) \in E} c_{i,j} \; \overbrace{\frac{1}{|M(v_{j})|} \sum\limits_{m^{\prime} \in M(v_{j})} \frac{N_{sm^{\prime}}+\gamma}{N_{m}^{\prime}+S\gamma}}^{\textbf{C}}} {\sum\limits_{(v_{i},v_{j}) \in E} c_{i,j}}}^{\text{\textbf{B} (i.\,e. Approval Network)}} \end{array} $$
(16)

The last constituent (B) highlights the contribution provided by the Approval Network. The rationale behind B is to flatten the relationships between messages of adjoining users (m ) by aggregating their sentiment label s through their average (C). This contribution is weighted by considering the edge coefficients c i j encoded by the HN-DAG and representing the strength of the existing relationships between users.

5.2.2 Experiments

In this section, we present the experimental results on real data to demonstrate that the inclusion of the approval network in the proposed NAS model outperforms the state-of-the-art baseline methods for both sentiment classification and aspect extraction tasks.

Dataset and Settings

Two datasets from Twitter have been collected by monitoring users (tweets and retweets) posting on “iOS7” and “Lone Survivor”. iOS7 contains positive, negative and neutral tweets, while Lone Survivor is composed of positive and neutral tweets. In Table 6, some basic statistics are reported for both datasets.

Table 6 Characteristics of the Twitter datasets

In order to transform microblog messages to a more canonical language, URLs, mention tags (@), hashtags (#) and retweet (RT) symbols have been removed and misspelled tokens have been automatically corrected using the Google’s Spell Checker API.Footnote 4 Emoticons, initialisms for emphatic expressions (e.g., ‘ROFL’, ‘LMAOL’, ‘ahahah’, ‘eheh’, ...) and onomatopoeic expressions (e.g., ‘bleh’, ‘wow’, ...) have been replaced by POS_EXP, NEU_EXP and NEG_EXPR, according to their sentiment.

For a direct comparison of the proposed model with the state-of-the-art, a comparative analysis has been performed with the following approaches:

  • DIC: the first baseline for sentiment classification purposes is the dictionary-based classifier, where the overall message polarity is determined by first checking whether each term belongs to the positive, negative or neutral lexicon [28] and then using the majority voting strategy;

  • ASUM: this model, based on the Aspect Sentiment Unification Model presented in [34], computes the joint probability P(z,l,w) using 1000 Gibbs iterations with 100 burn-in iterations, disregarding any relational information encoded by the social network.

  • JST: it corresponds to the Joint Sentiment Topic model presented in [40] and works analogously to the ASUM model for sampling the joint probability P(z,l,w). No relational information is included.

  • TSM: this model, based on the Topic Sentiment Model presented in [49], computes the joint probability P(z,l,w) using the Kullback-Leibler divergence measure over the sentiment coverage in order to determine the model convergence. Also in this case, no relational information is included.

  • NAS-A: it corresponds to the model presented in Section 5.2.1, where the joint probability P(z,l,w,x) is estimated using 1000 Gibbs iterations with 100 burn-in iterations. This model includes the relational information encoded by the Approval Network.

  • NAS-F: this approach works similarly to NAS-A, but it considers the Following/Follower relationships instead of the approval ones when computing the joint probability distribution.

Supported by a detailed investigation of the main aspects into the data, the number of aspects has been set T=10 for the iOS7 dataset, while T=5 for Lone Survivor. Concerning the sentiment associated to each aspect, the SentiWordNet resource [15] has been exploited for all the investigated models. This allowed us to automatically derive dynamic and context-dependent sentiment labelling.

Results on Sentiment classification

In order to evaluate the performance of the considered models for the sentiment classification task, F1-measure (which aggregates both Precision and Recall) has been computed over all the sentiments.

If we focus on iOS7 (Figure 4a), we can note the outperforming results of NAS-A with respect to the other approaches. The first consideration relates to the direct comparison of NAS-A with the popular baseline DIC. As expected, DIC achieves low performance of F1-measure (0.41 on iOS7 and 0.43 on Lone Survivor). Conversely, NAS-A is able to achieve valuable improvements showing macro-average F1-measure of 0.63 on iOS7 and 0.77 on Lone Survivor. Looking at the performance of NAS-F, it can be noted that although it is able to provide a small relative gain with respect to JST, TSM and ASUM, the use of friendship relationships is not able outperform the NAS-A model. NAS-A is able to achieve the best results on iOS7, leading to a gain of 23 % compared to ASUM, 19 % compared to JST and NAS-F, 28 % compared to TSM. Similar results can be observed in Figure 4b on Lone Survivor, where NAS-A achieves a gain of 29 % over ASUM, 12 % over JST and NAS-F and 42 % over TSM.

Figure 4
figure 4

Results on sentiment classification

The outperforming results of the proposed model are due to several reasons. First, NAS-A (and NAS-F) overcomes the ASUM assumption stating that the sentiment of all the words of a given sentence must be consistent to each other. Second, it works well with positive, negative and neutral opinions because it is able to deal with the co-occurrence of different sentiment seed words. For instance, consider the sentence “I bought the new iPhone and the screen is very nice.”, where a neutral word (‘new’) co-occurs with a positive word (‘nice’) in the same sentence. Finally, the proposed model shows that including the information encoded by the approval network leads to tackle those situations where words are not sufficient or their sentiment orientations are contradictory with respect to the overall message polarity. According to its ability to take into account the sentiment value of adjoining users, the proposed model is capable to detect positive (or negative) messages when none of their words are positive (or negative). This ability can be noted by looking at some examples reported in Table 7, where tweets are correctly classified only by NAS-A.

Table 7 Examples of tweets whose sentiment is correctly captured by the NAS model

For instance, the tweet ‘iOS 7 looks like a child’s coloring book!!’ is correctly classified as negative by NAS-A, even if it does not contain negative words. Conversely, the tweet ‘If Lone Survivor didn’t change your life, you’re fucking insane’ as been classified by NAS-A as positive towards Lone Survivor, even if it contains only neutral and negative words. In these two examples, NAS-A correctly classifies them as negative/positive because their authors retweet other users who emit negative/positive tweets.

Results on Aspect Discovery

In order to automatically measure the quality of the aspects extracted by all the considered models, the Topic Coherence measure [50] has been adopted. This measure depends on the corpus without using any other resource and is computed according to the co-document frequency ratio among top topical terms. Note that, given the initial number of aspects T and sentiments S, NAS-A, NAS-F, JST and ASUM produce T×S aspects. Since TSM performs the worst in sentiment classification and it has T aspects as output, Topic Coherence has not been computed because it is sensitive to the number of aspects [50]. If we focus on Figure 5, we can note that the inclusion of the network through NAS-A leads to significant improvements on both datasets. ASUM achieves the lowest performance, followed by JST and finally by NAS-A.

Figure 5
figure 5

Comparison of Topic Coherence

Even if Topic Coherence is a good measure to compare aspects extracted by different models, it is not able to provide a clear idea of their pros and cons. For this reason, we discuss some examples reported in Tables 8 and 9 related to two main aspects, i.e. Battery Life and Security. The aspect Security has been weakly identified by ASUM only through the words ‘privacy’ and ‘unlocked’, because the model tends to relate nouns with neutral aspects and adjectives to positive/negative aspects, making their characterisation very difficult. Moreover, inconsistent adjectives for positive and negative aspects (e.g.,‘love’ for a negative aspect) and several non-related words are chosen by ASUM.

Table 8 Aspects related to “Battery Life” in iOS7
Table 9 Aspects related to “Security” in iOS7

Conversely, JST, NAS-F and NAS-A have a good balance of adjectives and nouns for positive and negative aspects that allow us to easily characterise the aspect and simultaneously understand its perceived sentiment. Thanks to a manual investigation of the data, we have discovered that “Battery life” is negatively perceived because of its short life, while “Gaming” is positively perceived thanks to the introduction of controllers’ support and kits. Unlike neutral opinions, positive and negative opinions are mostly identifiable by the presence of opinionated words (e.g., adjectives). According to this consideration, only NAS-A and NAS-F lead to the correct identification of the negative aspect Battery Life. Conversely, JST has a more clear behaviour in excluding opinionated words that generally leads the model to infer the neutral sentiment.

As general remark, the experimental comparison suggests that the introduction of network information may not only help the sentiment classification step, but also the ability of identifying aspects. A more important observation is concerned with NAS-A when compared to NAS-F. The proposed model based on approval relationships is able to better capture the aspects underlying the microblog posts thanks to its ability to model the interest of users on the same topic.

A final remark related to the proposed NAS model, relates to its ability to deal with short and noisy text. The fact that social network text is composed of few words poses considerable problems when applying traditional topic/sentiment models. These models typically suffer from data sparsity to estimate robust word co-occurrence statistics when dealing with short and ill-formed text. The proposed model is able to reduce the negative impact of short and noisy text thanks to the contagion of connected users. The relational information enclosed into the generative model is able to compensate the biased statistics when considering independent text.

6 Conclusion

Text does not always provide explicit or sufficient information about the sentiment orientation of a short text. Early studies have tried to overcome this limitation by modelling the user-user similarity through friendships. However, being friends does not necessarily mean agreeing on all issues. Supported by the theory behind two sociological processes (homophily and constructuralism), this paper has proposed to model user-user relationships through Approval Networks. To finally capture the real heterogeneity of social network data, a Heterogeneous Directed Approval Graph (H-DAG) has been presented to model both textual contents and user relationships. Two novel sentiment analysis models based on H-DAG have been introduced, confirming that the inclusion of approval relationships in predictive models leads to significant improvement in terms of effectiveness.