Motivation

User-friendly Web 2.0 technologies encourage the general public to actively participate in the creation of Web content. Blogs, social networks and message boards reach out to a global community of Web users. These online texts present personal experience and convey the sentiments and emotions of the authors. These emotion-rich posts are known to be important in setting interaction patterns in online discussions, as emotion-rich text has a strong influence on attitudes and behavioral intentions of the discussion participants [1]. Studies of online sentiments and opinions can help in the understanding of sentiments and opinions of the public at large. Such understanding is especially important for the development of public policies whose success greatly depends on public support, e.g., education, health care, housing and infrastructure. Study of affect and social aspects in online communication is preliminary steps for creation of affective dialogue system in which text-based system–user communication is used to model, generate and present different affective and social interaction scenarios [2].

Effective implementation of healthcare policies relies on the understanding of opinions expressed by the general public. Major healthcare initiatives such as vaccination during pandemics and the incorporation of healthy choices in everyday lifestyles are examples of policies that require such understanding to be successfully implemented. As online media becomes the main medium for the posting and exchange of information, analysis of this online data can contribute to studies of the general public’s opinions on health-related matters. Users of online communities dedicated to special medical conditions can be exposed to materials where about 90 % of text is dedicated to patient experience [3]. Analysis of health information posted online contributes to identification the sources of information, its dissemination and possible impact on the general public [1, 35]. Although empirical evidence strongly supports the importance of emotions in health-related messages [6], there are few studies of the relationship between a subjective language and online discussions of personal health [7].

We focus on sentiments in the medical forum discourse. It has been shown that sentiments expressed by a forum participant affect sentiments in messages written by other participants posted on the same discussion thread [8, 9]. In this study, we aimed to identify the most common sentiments expressed in individual posts and the most common pairs and triads of sentiments appearing in the forum discussions. We applied our analysis to data collected from the in vitro fertilization (IVF) medical forum.Footnote 1 This forum is designed to bring together women who use IVF treatments in the hope of conceiving. As a result, women constitute 95 % of the forum participants and they post almost 99 % of the messages, although there are occasional messages posted by men. To give a glimpse of the emotionally charged data, we provide an example of four consecutive messages from an embryo transfer discussion:

Alice: :

Jane—whats going on??

Jane: :

We have our appt. Wednesday!! EEE!!!

Beth: :

Good luck on your transfer! Grow embies grow!!!!

Jane: :

The transfer went well—my RE did it himself which was comforting. 2 embies (grade 1 but slow in development) so I am not holding my breath for a positive. This really was my worst cycle yet; it was the antagonist protocol which is supposed to be great when you are over 40 but not so much for me!!

In our sentiment analysis, we applied a twofold approach. First, our goal was to indentify a set of sentiment categories that represents the full spectrum of emotions appearing in the discussions and, at the same time, is compact and segregated enough to be used in a machine learning (ML) empirical study. Next, we compared the domain-specific lexicon HealthAffect and the general sentiment lexicons SentiWordNet, MPQA, SenticNet3, SentiStrength and DepecheMood according to their ability to represent messages in sentiment classification. In those experiments, we used ML multi-class classification technique to automatically recognize sentiment categories in four multi-class classification problems; the messages were represented by HealthAffect and the general sentiment lexicons.

The following results were obtained: We identified the dominant sentiments as encouragement, gratitude, confusion, facts and endorsement. A total of 1438 messages were annotated manually by two annotators with a strong inter-annotator agreement: Fleiss kappa = 0.737 when the posts were annotated in the context of discussion and Fleiss kappa = 0.763 when the posts were annotated as individual entities. Our empirical evidence shows that HealthAffect provides for more reliable sentiment classification than the other lexicons. Messages represented by HealthAffect were classified with up to 22 % improvement in the F-score over the benchmark classification obtained on SentiWordNet representation.

The article is organized as follows: Section “Related Work” presents relevant work in sentiment analysis, section “Data Set” introduces the data set, section “Data Annotation” describes the annotation scheme and its results, section “Correspondence Analysis for Sentiment Sequences” presents the correspondence analysis and results on sentiment sequences, section “Automated Sentiment Recognition” describes sentiment classification experiments, and section “Discussion and Future Work” discusses the results. Preliminary results of this work appeared in [10].

Related Work

Sentiment Analysis

The availability of emotion-rich text has helped to promote studies of sentiments from a boutique science into the mainstream of text data mining. Extraction and analysis of sentiments, opinions, attitudes, emotions, perceptions and intentions is one of the most asked-for types of text analysis, as was pointed out in Seth Grimes’ Text Analytics Report 2014.Footnote 2 Sentiments and opinions are analyzed in texts of consumer-written product reviews [11], political discussions [12] and forums and blogs [13, 14]. Text analysis of user-written online messages has been motivated by both the demand for such studies and an easy access to the online data [15, 16].

In sentiment analysis, ML methods, affective lexicons and natural language processing (NLP) tools are used to classify text units (e.g., words, sentences, paragraphs) into sentiment categories [17]. The choice of text unit depends on the goal of the study. Our goal is the identification of sentiments in communication units. Hence, a message is the core text unit of forum communication [14], and we use it as our text unit.

Most sentiment analysis research concentrates on the polarity of discussions, e.g., positive and negative sentiments [13, 18]. A few studies have worked on the distinct universal emotions anger, fear, enjoyment, sadness and disgust [19] and dynamic, evolving sets of sentiments [2022]. We analyzed sentiments that appeared in forum messages and created a set of sentiment labels that were most appropriate for health-related online discussions.

Reliable annotation is essential for a thorough analysis of text, although human errors and bias can be introduced during the annotation process [13]. Multiple annotations of topic-specific opinions in blogs were evaluated in [23]. The authors computed agreement among seven manual annotators for five classification categories, including positive, negative and mixed opinions and non-opinionated and non-relevant categories. Annotation agreement achieved on messages gathered from a medical forum was evaluated in [24]. Multiple annotations were used to categorize tweets into those positive, negative and neutral sentiments in [25]. Analysis of eight Twitter data sets released into the public domain was presented in [18]. This paper also presents an STS-Gold Twitter set of positive, neutral and negative tweets, where annotation agreement among three annotators had a Fleiss kappa score of 0.765. The merits of reader-centric and author-centric annotation models were discussed in [26]. In our current work, we apply the reader-centric annotation model and report the Fleiss kappa obtained after the evaluation of our inter-annotator agreement.

An accurate sentiment classification relies on lexical sources of semantic information. Sentiment research often uses lexicons where words are assigned into opinion, sentiment and emotion categories. However, in independent studies [24, 27], the authors showed that the sentiment categories of SentiWordNet,Footnote 3 WordNetAffectFootnote 4 and the subjectivity lexiconFootnote 5 are not fully representative of health-related emotions. As it is nearly impossible to create a lexicon for every domain, various techniques were proposed for lexicon adaptation, e.g., the feature ensemble model in order to learn a new labeling function which uses feature reweighting [28], contextualised sentiment lexicons for ambiguous terms to be identified and linked to their corresponding polarity [29], objective sentiment words from the SentiWordNet were reevaluated to improve the performance of word-of-mouth sentiment classification [30]. We use HealthAffect, a domain-specific lexicon, to automatically classify sentiments. A preliminary, much smaller version of the lexicon was introduced in [24]. In the current work, we repopulate the lexicon and use a manual filtering to prevent over-fitting the data.

Sentiment propagation is an emerging area in sentiment analysis. Although the relationship between consecutive sentiments is a popular subject of a fine-grained discourse analysis [31], it only recently started to make inroads into text mining. Subjective information posted by a user may affect subjectivity in posts written by other users [8]. Tsai et al. [32] used a two-step approach to evaluate sentiment propagation among related commonsense concepts. Correlations between emotions expressed in consecutive posts were studied in [16, 33, 34]. Until now, health-related sentiment classification has focused on individual messages. Our current work identifies the most common sentiment transitions in pairs and triads of consecutive posts. Studies of sentiment transitions are important if we want to better understand the emotional and cognitive processes of human interactions [35].

Concept-Level Sentiments

Our approach is reminiscent of concept-level sentiment analysis [36]. In the analysis of data, we retrieve and aggregate subjective information about different aspects of IVF treatment. Such information is directly linked with the basic IVF concepts and features and, thus, cannot be identified through a keyword search or the use of general lexical resources.

Another technique associated with concept-level analysis is correspondence analysis, a multivariate technique for analyzing matrices of data. Its implementation in the R programming language is described by Baayen [37]. The technique of correspondence analysis discovers whether groups of words tend to occur in the same messages as each other. Such groups are called “factors,” and they are ordered according to their importance in terms of how much of the variation between the messages they explain. The idea for such a representation comes from work by Stanley and Meyer [38], who used another matrix analysis technique called Factor Analysis to plot students’ ratings of their emotional states on various occasions on a two-dimensional graph. Stanley and Meyer call the discovered axes (and hence constructs) for representing affective experiences “affective space.” We applied correspondence analysis in our study of sentiments.

The ConceptNet knowledge base represents information about contextual, pragmatic information expressed in texts as a graph with node concepts connected by twenty types of semantic relations [39]. The source was used in several text analysis studies [32, 40]. For example, important concepts from ConceptNet were selected and redundant concepts were eliminated using the Minimum Redundancy and Maximum Relevance feature selection techniques [40]. At the same time, large semantic sources exhibit the “curse of dimensionality”: The bigger the semantic network is, the more difficult it becomes to process and obtain the required knowledge from it. In our current study, we work with a domain-dependent sentiment lexicon, without building a semantic network.

Reproductive Technologies and Sentiments

Reproductive technologies are hotly debated in modern society. These highly spirited debates are in part due to a multitude of issues connected with the technologies. For example, the most popular reproductive technology—IVF—is linked to an uncertain chance of live birth and discussions of the health of the babies born, ongoing pregnancies, clinical pregnancies, miscarriages, multiple pregnancies, implantation rate, cryopreservation rate, embryo quality and fertilization rate [41], as well as age, obesity, a risk of breast cancer and overall financial costs to society [42]. The complexity of the problem causes the technology’s recipients to seek information, advice and guidance not only from medical professionals, but also from peers. The peer connection is increasingly done online, through social media [43].

A meta-study of 19 studies on reproductive technologies published in 1999–2009 listed several reasons for the use of medical forums: (a) information searching—to learn about psychological, physical and social aspects of available treatments, evaluations of alternative treatments—(b) in seeking emotional support—anonymous communication, immediate and constant community access, easy contact with peers [43]. A survey of online infertility support groups showed that empathy and shared personal experience constituted 45.5 % of content, gratitude—12.5 %, recognized friendship with other members—9.9 %, whereas the provision of information and advice and requests for information or advice took up 15.9 and 6.8 %, respectively [9].

Sentiment analysis often connects its subjects with specific online media (e.g., sentiments on consumer goods are studied on Amazon.com). Health-related emotions are studied on Twitter [25, 44] and online public forums [9, 17]. Sentic PROMs (patient reported outcome measures) analyze semi-structured texts and aggregate the input data [4547]. This system complements the very structured tool used to monitor patient outcomes in the cases where patients express their opinions and feelings in free text. In our work, we continue studies of online forum data. In forum discussions, patients do not restrict themselves to giving only feedback about hospitals and health services but freely express their opinions, sentiments and attitudes and actively exchange them among each other. Our results can be applied for studies of patient opinions where differentiation between subjective (seeking opinions, emotions and other private states) and non-subjective (seeking factual information) messages is not a trivial task [14].

Data Set

Forums dedicated to specific medical conditions and health-related problems promote sharing of personal experience and disclosure of the emotional state of the forum participants [2]. We collected data from the IVF Web site dedicated to reproductive technologies. The Web site belongs to an infertility outreach resource community created by prospective, existing and past IVF patients. The IVF.ca Web site includes forums: Cycle Friends, Expert Panel, Trying to Conceive, Socialize, In Our Hearts, Pregnancy, Parenting and Administration.Footnote 6 Every forum hosts a few sub-forums, e.g., the Cycle Friends forum has six sub-forums: Introductions, IVF/FET/IUI Cycle Buddies, IVF Ages 35+, Waiting Lounge, Donor and Surrogacy Buddies and Adoption Buddies. On every sub-forum, new topics are initiated by the forum participants. Depending on the interest among participants, a different number of messages is associated with each topic, e.g., Human growth hormone and what to expect has 120 messages posted from Oct 2012, while Over 40 and pregnant or trying to be has 3455 messages posted from May 2010.

We wanted the forum to represent a variety of discussions and contain a manageable number of topics and messages. The IVF Ages 35+ sub-forumFootnote 7 satisfied both requirements, i.e., it had 510 topics and 16,388 messages, where the messages had 128 words on average.Footnote 8 Figure 1 illustrates the distribution of posts among the forum topics.

Fig. 1
figure 1

Number of posts per topic in the IVF Ages 35+ sub-forum

Among those 510 topics, 340 topics contained less than ten posts. These short topics often contained one initial request and a couple of replies and were deemed too short to form a good discussion. We also excluded topics containing >20 posts. This exclusion left 80 topics with an average of 17 messages per topic for a manual analysis by two annotators.

The topics usually had the following structure:

  1. (a)

    a participant started the theme with a post;

    1. (i)

      the initial post usually contained some information about the participant’s problem, expressed worry, concern, uncertainty and a request for help to the other forum participants.

  2. (b)

    the following posts:

    1. (i)

      provided the requested information by describing their similar stories, knowledge about treatment procedures, drugs, doctors and clinics, or

    2. (ii)

      supplied moral support through compassion, encouragement, wishing all the best, good luck, etc.

  3. (c)

    the participant who started the topic often thanked other contributors and expressed appreciation for their help and support.

We wanted to identify what sentiments prevail in the forum messages. Our goal was to indentify a set of sentiment categories that represent the full spectrum of emotions appearing in the discussions and, at the same time, being compact and distinct enough to be used in a ML empirical study.

Data Annotation

Annotation of subjectivity can be centered either on the perception of a reader [20] or the author of a text [26]. In the current work, we opted for the reader perception model and asked annotators to analyze the topic’s sentiment as it was addressed to the other forum participants. The data annotation was carried out by master students as their practical work for the course “semantic interpretation of text.” The students had already completed courses on “computational linguistics” and “natural language processing.” Most annotators already had experience in sentiment and opinion annotation. Each annotator independently annotated a set of topics. Each message was annotated by two annotators.

We used 292 randomly selected posts to verify whether the messages were self-evident for sentiment annotation or required an additional context. The annotators reported that posts were long enough to convey emotions and in most cases there was no need for a wider context.

We applied an annotation scheme which was successfully applied in [24]. In [9], the authors showed that most posts referred to sharing personal experiences, provision of information or advice, expressions of gratitude/friendship, chat, requests for information and expressions of universality (e.g., “we’re all in this together”). Hypothesizing that binary sentiment categories (e.g., positive and negative polarity) would be too general and could not adequately cover emotions expressed in health-related messages, we intended to build a set of sentiments that

  1. 1.

    contains sentiment categories specific for posts from medical forums and

  2. 2.

    makes an automated sentiment detection feasible and reliable.

This was the first phase of the annotation process. We used the bottom-up approach to build that set. First, we asked annotators to read several topic discussions and describe the sentiments expressed by the forum participants and the sentiment propagation within these discussions.

We instructed annotators not to mark descriptions of symptoms and diseases as subjective; in many cases, they appear in the post as objective information for other forum participants that have encountered similar issues. In such cases, only the author’s sentiments toward the other participants should be taken into consideration. For example, I have had a few days now with heartburn/reflux—could be stress, a little achy tummy/pelvic and a tired aching back. More waiting, but getting more hopeful is a description of symptoms and should not be annotated as subjective. In contrast, I hope your visit with us infertilies is short and sweet and you get that baby soon!!! exposes the author’s sentiment toward another person.Footnote 9 It should be mentioned that the posts were usually long enough to express several sentiments. However, annotators were requested to mark messages with one sentiment category.

After collecting the results of the initial annotation, we merged and summarized the annotations. That resulted in 35 sentiment types which we arranged into three groups:

  • confusion, which included worry, concern, doubt, impatience, uncertainty, sadness, anger, embarrassment, hopelessness, dissatisfaction and dislike;

  • encouragement, which included cheering, support, hope, happiness, enthusiasm, excitement, optimism; and

  • gratitude, which included thankfulness.

A special group of sentiments was presented by expressions of compassion, sorrow and pity. According to the WordNetAffect classification, these sentiments should be considered negative. However, in the context of health discussions, these emotional expressions appeared in conjunction with moral support and encouragement. Hence, we treated them as a part of encouragement.

Not all posts had an emotional content. Posts presenting only factual information were marked as facts. Some posts contained factual information and strong emotional expressions; those expressions almost always conveyed encouragement (“hope, this helps,” “I wish you all the best,” “good luck”). Such posts were labeled endorsement. Note that the final categories did not include openly negative sentiments. We considered confusion as a non-positive label. Encouragement and gratitude were considered positive labels, and facts and endorsement—neutral.

The posts that both annotators labeled with the same label were assigned to that category; 1256 posts were assigned with a class label. The posts labeled with two different sentiment labels were marked as ambiguous; 182 posts were marked as ambiguous.

We evaluated agreement between the annotators by using Fleiss kappa [48], a measure that evaluates agreement for a multi-class manual labeling.

$${\text{Flesiss}}\;{\text{kappa}} = (P - P_{\text{class}} )/(1 - P_{\text{class}} )$$

where P is the average agreement per class observed and P class is the average agreement per class which would be obtained by chance.

Despite the challenging data, we obtained Fleiss kappa = 0.737 which indicated a strong agreement between annotators [23]. This value was obtained on 80 annotated topics. Agreement for the randomly extracted posts was calculated separately in order to verify whether annotation of separate posts was no more difficult than annotation of the post sequences. Contrary to our expectations, the obtained Fleiss kappa = 0.763 was slightly higher than when the posts were annotated in the context of discussions. The final distribution of posts among sentiment classes is presented in Table 1.

Table 1 Class distribution of the IVF posts

Correspondence Analysis for Sentiment Sequences

We applied correspondence analysis [37] to recognize the affective groups of the most frequent words found in the data. We used the messages from the ART_over_35 topic, missing out only the very short ones. The messages are numbered in the order they appear in the discussion. As input, we produced a matrix where the columns corresponded to the 500 most frequent words in the ART_over_35 text collection and the rows each corresponded to one individual message. Since we were mainly interested in sentiment words, this original matrix was reduced by retaining only those columns corresponding to the 41 words conveying sentiments such as “best,” “better” and “congratulations.” From the list, 28 words were indicative of sentiment categories and appear in HealthAffect (e.g., able, against, interested, recommended, risk) and 13 words were not indicative of specific categories and thus do not appear in HealthAffect (e.g., “avoid,” “luxury”).

The technique of correspondence analysis discovers whether groups of words tend to occur in the same messages as each other. Such groups are called “factors,” and they are ordered according to their importance in terms of how much of the variation between the messages they explain. The graph below (Fig. 2) was produced by correspondence analysis and shows to what extent each word and each message is related to the two main factors. Only those words which are significantly associated with the factors (p < 0.1) are shown in the graph. The group of words making up the first factor explain 24.5 % of the variation between the posts, while those making up the second factor explain 12.3 % of this variation. The identified words occur together in three main groups: concern, support and good will, and desire to know. The groups form the affective author-centric space of the topic and can be representative of the affective space of the IVF discussion [38].

Fig. 2
figure 2

Correspondence analysis of sentiments

The graph shows that in the top left quadrant are words which occur together in messages expressing concern for the future, as in “I don’t feel able to handle the negative pressure.” In the top right quadrant are words which appear in messages of support and good will, such as “successful,” “luck” and “good.” Finally, in the lower left quadrant are words found in messages expressing a desire to know, “interested,” “confusion,” “success” and “like” (as in “I’d like to know the chances of success”). Most of the early messages (from 1 to 17) are in the topic-opening “desire to know” quadrant, apart from a short exchange of anxious messages (10, 12 and 13). There are then a series of encouraging messages (from 18 to 27), while the last few messages are more neutral, scoring about 0 on Factor 2, and slightly negative on Factor 1. Although this is not apparent from the graph, they correspond to messages where people looked back on their own experiences of IVF in a neutral, unemotional way.

Note that there are no significant words in the fourth quadrant. We show only the words which were significantly associated with the factors (p < 0.1), and there are none of these in the fourth quadrant. Although messages 19 and 22 are in the fourth quadrant, the most important thing is that they score highly on Factor 1 (i.e., over to the right-hand side of the graph). The last few messages (29–35) are not in the fourth quadrant, but appear about half way up (very close to the horizontal axis) mostly on the left. In our case, the poles of the affect space were positive–negative for Factor 1 and question–response for Factor 2.

To further identify sentiments that reinforced themselves and sentiments that were likely to trigger changes, we computed the distribution of sentiment pairs and triads in consecutive messages. We found that the most frequent sequences consisted mostly of facts and/or encouragement: 39.5 % in total. These two categories were most likely to propagate through next messages. The most frequent change was from endorsement to facts (6.1 % in total). Approximately 10 % of sentiment pairs were factual and/or encouragement followed by gratitude. Confusion was followed by facts and encouragement in 80 % of cases. The most frequent triad containing confusion was confusion, facts, facts. That sentiment transition showed a high level of support among the forum participants. Other less frequent sequences appeared when a new participant added her post in the flow. Tables 2 and 3 list the results. Figure 3 shows the most frequent pairs of sentiments. The node size corresponds to the proportion of the sentiment in the data, and the line weight to the proportion of the transaction.

Table 2 Most frequent sequences of two sentiments and their occurrence in the data
Table 3 Most frequent triads of sentiments and their occurrences in the data
Fig. 3
figure 3

Most frequent pairs of sentiments

Our next goal was to find a method that reliably identified sentiments in a large number of the forum texts. This method had to be general enough to accommodate the diversity of natural language expressions appearing in the forum data and exhaustive enough to recognize the opinions expressed toward the IVF treatment. We also wanted this method to be based on a compact set of features, thus avoiding the pitfalls of high dimensionality of feature space in text representation.

In this work, we concentrate on the classification of individual messages. Sentiment classification of pairs and triads of messages is left for future work.

Automated Sentiment Recognition

The first stage of our study identified that the forum messages belonged to five sentimental and neutral categories. For automated sentiment classification, we tested the multi-categorical SentiWordNet [49] and the MPQA subjectivity lexicon [50] which recognizes only positive and negative polarity of its terms. We also tested several lexicons with sentiment information which were announced recently: SentiStrength, sentiment analysis software [51], contains the list of English words that express emotions, SenticNet 3 [21] is a knowledge base that contains information about the semantics and sentics associated with multi-word expressions, and DepecheMood [22] that contains more than 37,500 terms that have been assigned numerical values representing degrees of eight sentiment categories: afraid, amused, angry, annoyed, dont_care, happy, inspired and sad. For every lexicon mentioned here, we created a set of features that will represent our data in ML experiments. We used different procedures to create the sets. The procedures were based on the characteristics of the lexicons:

  1. 1)

    SentiWordNet was created by assigning to each synset of WordNet three sentiment scores: positivity, negativity and objectivity. Every synset in SentiWordNet has all three scores simultaneously. Thus, there were positive terms with negative and objective scores equal to zero and positive score greater than zero; negative terms with positive and objective scores equal to zero and negative score greater than zero; objective terms with positive and negative scores equal to zero and objective score greater than zero; neutral terms with all scores equal to zero; and there were also ambiguous terms with several scores greater than zero. We selected only positive and negative synsets and searched for the presence of every term of these synsets in our texts. Only unambiguously positive or negative terms that were present in the texts were used as features in the experiments. Further, we used SentiWordNet as the benchmark representation for the comparison of empirical results.

  2. 2)

    MPQA lexicon assigns words with opinion clues, where the initial list of subjectivity clues from [52] was expanded [50] with positive and negative word lists from the General Inquirer.Footnote 10 Each clue has the following structure: type=strongsubj len=1 word1=abuse pos1=verb stemmed1=y priorpolarity=negative where priorpolarity values can be: positive, negative, both, neutral. As in the previous case, we selected only words with positive or negative polarity clues and compared them with the words which appeared in our texts.

  3. 3)

    SenticNet is a publicly available semantic resource for concept-level sentiment analysis. It associates polarity scores with ConceptNet concepts which are represented as words and multi-word expressions. Our downloaded version contained an XML file with more than 13,000 concepts. Each concept was associated with five characteristics: pleasantness, attention, sensitivity, aptitude and polarity. We used only the polarity attribute and extracted terms with nonzero polarity which were present in our texts. (1) SentiStrength is sentiment analysis (opinion mining) software. Its simplified version is free for academic research. The downloadable version contains Java code and lexicons in editable textual format. We used the lexicon with a polarity score associated with each word. Some terms in SentiStrength are stemmed; while comparing these terms with the words from our texts, we searched for all words that matched this stem. Thus, these stems could correspond to several words in our list of features.

  4. 4)

    DepecheMood is a high-coverage lexicon of approximately 37,500 terms annotated with emotion scores. This lexicon was crowdsourced from rappler.com news articles. Rappler’s mood meter, a small interface, offers the readers the opportunity to click on the emotion that a given news article made them feel. Numerous votes have been collected, and document-by-emotion matrix was built which was transformed into a word–emotion matrix, e.g., concerned - 0.129322883 AFRAID, 0.100615215 AMUSED, 0.170474974 ANGRY, 0.161903853 ANNOYED, 0.120271172 DONT_CARE, 0.108064155 HAPPY, 0.098734566 INSPIRED and 0.110613182 SAD.

  5. 5)

    We also used the domain-specific lexicon HealthAffect introduced in [24]. To build the lexicon, we adapted the pointwise mutual information (PMI) approach [53]:

$$PMI({\text{word}}1,{\text{word}}2) = \log_{2} \left( {p({\text{word}}1\,\&\, {\text{word}}2)/\left( {p({\text{word}}1)p({\text{word}}2)} \right)} \right)$$

The initial candidates consisted of unigrams, bigrams and trigrams of words with frequency ≥5 appearing in unambiguously annotated posts (i.e., we omitted posts marked as uncertain). This was a list of candidates to be included in our HealthAffect lexicon. Next, for each class and each candidate, we calculated PMI(candidate, class) as

$$PMI({\text{candidate}},{\text{class}}) = \log_{2} \left( {p({\text{candidate}}\;{\text{in}}\;{\text{class}})/\left( {p({\text{candidate}})p({\text{class}})} \right)} \right)$$

Next, we calculated semantic orientation (SO) for each candidate and for each class as

$$SO({\text{candidate}},{\text{class}}) = {\text{PMI}}({\text{candidate}},{\text{class}}) - \sum {{\text{PMI}}({\text{candidate}},{\text{other}}\_{\text{class}})}$$

where other_classes include all the classes except the class that SO is calculated for. After all, the possible SO was computed and each HealthAffect candidate was assigned with the class that corresponded to its maximum SO. Consequently, each candidate was considered an indicator of the class that provided it with the maximum SO. To avoid the over-fitting pitfall, we manually reviewed and filtered out conversation-specific terms (i.e., personal and brand names, geolocations, dates) and non-relevant elements, such as stop words and their combinations (since_then, that_was_the, to_do_it, so_you). Table 4 presents all the described lexicons.

Table 4 Total number of terms and the number of extracted terms for the six lexicons

Further in the ML experiments, the extracted terms are used as features to represent the messages. The classification’s performance was evaluated through four multi-class classification results:

  • 6-class classification where all 1438 posts were classified into 6 classes, including ambiguous.

  • 5-class classification where 1269 unambiguous posts were classified into 5 classes.

  • 4-class classification where all 1269 unambiguous posts were classified into encouragement, gratitude, confusion and neutral (i.e., facts and endorsement).

  • 3-class classification of 1269 unambiguous posts into positive (encouragement, gratitude), negative (confusion) and neutral (facts, endorsement).

As is common in multi-class classification problems, the sentiment categories were unequally represented in the data, e.g., 34 % for the largest category versus approximately 10 % for the small categories in the 6-class and 5-class problems. We considered that this distribution was not skewed enough to invoke undersampling and oversampling techniques used on more imbalanced data [54]. Although in the 4-class and 3-class problems the imbalance had increased to 52 % for the largest category versus 10.3 % for the smallest category, we opted to keep the same learning setting for direct comparison of the learning results.

We applied Naïve Bayes (NB), NBText, NBMultinomial, SVM, Decision Trees and KNN from the WEKA toolkit. We considered that the number of individual posts was sufficient for tenfold cross-validation. To select the best classifier, we used standard metrics of text classification performance. We computed multi-class versions of Precision (P), Recall (R), balanced F-score (F) and AreaUnderCurve (AUV):

$$\begin{aligned} {\text{Precision}} & = {\text{tp}}/({\text{tp}} + {\text{fp}}) \\ {\text{Recall}} & = {\text{tp}}/({\text{tp}} + {\text{fn}}) \\ F{\text{score}} & = 2{\text{tp}}/(2{\text{tp}} + {\text{fn}} + {\text{fp}}) \\ {\text{AUC}} & = (1/2)\left( {{\text{tp}}/({\text{tp}} + {\text{fn}}) + {\text{tn}}/({\text{tn}} + {\text{fp}})} \right) \\ \end{aligned}$$

where tp = correctly recognized positive examples, tn = correctly recognized negative examples, fp = negative examples recognized as positives, and fn = positive examples recognized as negatives. Although Matthews’ coefficient [55] can work well in multi-class optimization, we opted for the F-score and AUC as these are more commonly used measures in our discipline.

We applied NB, NBText (DMNBText), NBMultinomial, SVM, Decision Trees and KNN from the WEKA’s toolkit. To select the best classifier, we used tenfold cross-validation and computed the measures listed above. We assessed classification based on F-score. WEKA computes the performance measures for each class individually and the weighted average of the measures for overall results; weights are assigned according to the number of instances with that particular class label. The best F-score and other corresponding measures for each class are reported in Tables 5, 6, 7 and 8.

Table 5 Best F-score and corresponding precision, recall and AUC of each class for the 6-class problem
Table 6 Best F-score and corresponding precision, recall and AUC of each class for the 5-class problem
Table 7 Best F-score and corresponding precision, recall and AUC of each class for the 4-class problem
Table 8 Best F-score and corresponding precision, recall and AUC of each class for the 3-class problem

The results reported above show considerable consistency: DMNBText and NBMultinomial algorithms outperformed other algorithms in sentiment classification, with exception of endorsement classification in 6-class and 5-class problems where SVM was the best; among lexicons, HealthAffect and DepecheMood provided for the best classification of individual classes:

  • The highest precision occurred for gratitude/positive, except for the 4-class problem where it was the second best. If misclassified, gratitude was commonly labeled as encouragement. Posts in the gratitude class tend to be the shortest and contain only words of gratitude and appreciation of others’ help. As they usually do not contain any more information than this, there were fewer chances for them to be misclassified.

  • The highest recall occurred for facts/neutral in all the four problems, the biggest class in the data. However, precision for this class was uneven and depended on the structure of other classes.

The best overall results appear in Tables 9, 10, 11 and 12. We report DMNBText and NBMultinomial, as they achieved the best and second best results. To put empirical evidence in perspective, we used the majority class baseline and designated SentiWordNet as the benchmark representation. SentiWordNet is commonly used in other studies, thus making comparison of the results feasible in future. The best results for each metric are in bold, the second best results are in bold italic, the benchmark are in italic.

Table 9 Classification results for 6 classes, the baseline F-score = 0.171
Table 10 Classification results for 5 classes, the baseline F-score = 0.215
Table 11 Classification results for 4 classes, the baseline F-score = 0.353
Table 12 Classification results for 3 classes, the baseline F-score = 0.353

The overall classification results improved when we decreased the number of sentiment categories; hence, uncertainty was reduced for the algorithms. The F-score obtained on the HealthAffect features was the best in all experiments. At the same time, the results provided by DepecheMood were better than the results provided by remaining lexicons. We hypothesize that the critical characteristic of DepecheMood was its ability to recognize several sentiments, not only positive and negative ones.

Discussion and Future Work

We have presented the results of sentiment recognition in messages posted on a medical forum. Sentiment analysis of online medical discussions differs considerably from polarity studies of consumer-written product reviews, financial blogs and political discussions. While in many cases positive and negative sentiment categories are powerful enough, such a dichotomy is not sufficient for medical forums. We formulate our medical sentiment analysis as a multi-class classification problem in which posts were classified into encouragement, gratitude, confusion, facts and endorsement. We have run four multi-class sentiment classification problems on which we compared the performance of ML algorithms and the ability of sentiment lexicons to represent the data. We have shown that Naïve Bayes Text and Naïve Bayes Multinomial provide reliable sentiment classification for each class individually and for overall classification. In the four problems, the domain-based HealthAffect provided for a higher F-score than DepecheMood, SentiWordNet, MPQA, SenticNet3 and SentiStrength. DepecheMood provided for a higher F-score than the other general sentiment lexicons.

In spite of sentiment annotation being highly subjective, we obtained a strong inter-annotator agreement between two independent annotators (i.e., Fleiss kappa = 0.73 for posts annotated in the context of discussions and Fleiss kappa = 0.76 for posts annotated as separate instances). The kappa values demonstrated an adequate selection of classes of sentiments and appropriate annotation guidelines. However, many posts contained more than one sentiment in most cases mixed with some factual information. The possible solutions in this case would be (a) to allow multiple annotations for each post and (b) to annotate every sentence of the posts.

In the current work, we identified message sequences in order to reveal patterns of sentiment interaction. Manual analysis of a sample of data showed that topics contained a coherent discourse. Some unexpected shifts in the discourse flow were introduced by a new participant joining the discussion. In future work, we may include the post’s author information in the sentiment interaction analysis. The information is also important for analysis of influence, when one participant is answering directly to another one citing in many cases the post which she answered to. Identifying sentiment propagation among related semantic concepts is another venue of the future work.

We plan to use the results obtained in this study for the analysis of discussions related to other highly debated healthcare policies. One future possibility is to construct a Markov model for the sentiment sequences. However, in any online discussion there are random shifts and alternations in discourse which complicate application of the Markov model.

In the future, we aim to annotate more text, enhance and refine HealthAffect, and use it to achieve reliable automated sentiment recognition across a wide spectrum of sentiments related to healthcare issues.