An Enhanced Exploration of Sentimental Analysis in Health Care

Chakrapani, Kannan; Kempanna, Muniyegowda; Safa, Mohamed Iqubal; Kavitha, Thiyagarajan; Ramachandran, Manikandan; Bhaskar, Vidhyacharan; Kumar, Ambeshwar

doi:10.1007/s11277-022-09981-8

An Enhanced Exploration of Sentimental Analysis in Health Care

Published: 19 October 2022

Volume 128, pages 901–922, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Wireless Personal Communications Aims and scope Submit manuscript

An Enhanced Exploration of Sentimental Analysis in Health Care

Download PDF

Kannan Chakrapani¹,
Muniyegowda Kempanna²,
Mohamed Iqubal Safa³,
Thiyagarajan Kavitha⁴,
Manikandan Ramachandran¹,
Vidhyacharan Bhaskar ORCID: orcid.org/0000-0003-3820-2081⁵ &
…
Ambeshwar Kumar¹

300 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

The medical dataset replicates the patient's crucial information, such as important details regarding the patient's health. It includes disease diagnoses, interventions, and descriptions of the examined results. Also, detecting the mindset of an acute disease-affected patient is a primary challenging task. Though sentiment analysis plays a role in seeing their perspective, the significant broad medical application does not yet meet the analysis of patient mindset. So here we identified major shortcoming exists while studies the diversified disease-affected people mindset. Hence, we introduce a practical framework to analyse patients' perspectives using a socio-medical dataset that contains various reviews and feedback of critical diseases-affected people—initially, we applied a pre-processing technique, including Lowercase Conversion, removing special characters, removing stop words, Number to word conversion, Stemming, and lemmatization over dataset. Next, N-gram tokenization methodology is used to extract the valuable features followed by assigning polarity score to each sentiment we extract and calculate the overall polarity of the context. Finally, a probabilistic LDA model was employed to combine the review. Furthermore, various machine learning classifiers are explored to evaluate the performance of the proposed framework.

Sentiment Analysis of Healthcare Big Data: A Fundamental Study

Sentiment Analysis in Healthcare: A Brief Review

A Sentiment Analysis Method for Analyzing Users Opinions About Drugs for Chronic Diseases

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The evolving area in medical concepts depends on allotting the classification and emotion analysis. Because of the absence of domain-specific lexicons and uninterested domain researchers in this area, the issues accepted are high. One more issue is the semantic relation of the healthcare domain and the separation of knowledge-dependent features. The main reason is that the all-time medical lexicons do not provide any characteristics like category and sentiment. The experts are tried their best to plan the data extraction like GENIAI and PennBioIE 2 to solve the issues throughout the years. The primary need is to recreate either structure or unstructured corpora versions. Other economic and ontology methods are used with linguistic and ML (machine learning) techniques [1]. These are used to separate the healthcare topics with the synaptic and semantic characters [2]. Two systems are evolved to separate the semantic relations from the healthcare domain in recent work. The first system is tokenization (allotting the groups). The second system is for recognizing the emotions in the healthcare domain and their contexts. Healthcare topic is nothing but a phrase or a word with entities and knowledge and data belonging to healthcare attributes. The recognized field consists of two types i) Medical ii) Non-Medical. Separations of negation and stop words or sentences are considered with the aim of identifying the area. Let us take two examples Regular headache and uncontrolled jerking. These are considered medical and non-medical fields depend on the presence of the healthcare domain. Headache is regarded as a symptom of starting stage of cancer also. So it is a medical context with a company of medical concepts. Additionally, every word or phrase of the corpus is recognized as the context in our work. In case of sentences like” Orange is good or bad” means a non-medical domain without any medical concepts.

The categorization and emotion recognition systems are used to separate the contexts and domain. The separated field is divided into five models in the categorization. I) Disease II) symptoms III) drugs IV) human anatomy V) miscellaneous medical terms (unidentified topic represented as MMT in the remaining section of the paper). An example of a disease type is “Headache.” According to utterance and computation of the first incident of separation concept, the healthcare researchers and scientists presented these five categories in the corpora. Every type has its ideas which allot the whole classification based on their context. Eleven classifications of healthcare domain are recognized in pairwise aspects like disease symptom, disease drug, etc. According to both concepts and contexts, the sentiment recognition model [3,4,5] is augmented for sense-based information. Here, only the positive and negative emotions are taken into account. For example, “There is something wonderful about being pregnant” is represented with positive emotion. The outputs of emotion recognition are not similar for various types of domain. Consider “anatomy of the human body” as a neutral sentiment while positive or negative feelings depend on symptom types. The past model lexicon viz is used to evaluate these models. WordNet of Medical Event is utilized to separate the healthcare domain from contexts. Additionally, the lexicons allotting the linguistic and the emotional characteristics in the healthcare domain [6], the WME has two versions: WME 1.0(WME version 1) [7] and WME 2.0 (WME version 2) [8] augmented to compromise. The Linguistic character such as POS, gloss with polarity score, emotions are coming under the WME1.0 lexicon. It also has 6415 no. of healthcare domain, and it impotent to offer the feeling-based field and related knowledge data. So we go for WME 2.0, which has 10,186 no. of the area, and it has some knowledge-related character like affinity score, similar sentiment words (SSW), and gravity score. The blended model is injected to combine the previous linguistic symbols of WME 2.0 and the machine learning prototype. This offers aspects such as negation [9], uni-gram and bi-gram. It is two types of classifiers to accomplish an average of 0.81 and 0.86 F-Measures for allotting classifications to the healthcare domain and contexts in developing a categorization system. i) Naive Bayes ii) Logistic Regression [13], In the presence of WME2.0, the emotion recognition system was evaluated with the utilization of naive Bayes and support vector-oriented Sequential Minimum Optimization (SMO) Classifiers. It attains an average of 0.91 and 0.81 F-Measure for recognizing emotions of the healthcare domain and contexts. When the unigram and bigram are used to identify the divisions of the medical domain, the negation character identifies the first sentiments of medical concepts [10,11,12].

The paper is organized as follows: Sect. 2 discussed the various related work and literature studies associated with sentimental analysis. Section 3 discussed the current problem statement along with the solution. The different methodology used for the proposed approach is discussed in Sect. 4. Experimentation and result discussion was on Sect. 5. Finally, the paper is concluded in Sect. 6.

2 Related Study and Literature Survey

This section analyzes and describes the numerous existing approaches built for analyzing sentiment of acute disease affected patients.

Many new researchers are examining the way social media convincing the public of medical care. Researches are injecting text mining (Ficek and Kencl [14], Rahnama [15]), and it is doing a significant role in the high performance of unstructured data by the utilization of apache spark along with the binary and ternary process, Baltas and Tsaklidis [16] introduced a Twitter sentiment analysis. The following method is a conventional extreme learning machine based on spark cluster performed by Oneto et al. [17]. Using spark with deep learning shows that they have a high-performance level than any other spark model. In mobile big data analytics, a deep learning framework is introduced by Chen et al. [18] utilizing the apache spark model. The further method is sentimental analysis with spark architecture on large scale data discovered by Nodarkis, et al. [19].

To separate the sentiments from the HPV vaccine-based tweets, the best ML system is performed by Du.et.al [20]. The ranking division, along with the SVM type, is processed, and 6000 tweets are explicated physically. Other than baseline types, this output gives the best of 0.6732F-. Medical sentiment analysis is an evolving technology. Denecke and Nejdl [21] introduced a health care ontology to evaluate the factual level in the healthcare texts. It is different from past emotion examination systems. Emotion examination is processed by rule or by ML methods. Many workers are represented with ML methods than rule-dependent methods.

To find out the polarity level in sufferer data, Xia, et al. [22] discovered a multistep opinion classification. Additionally, to compute the quality of healthcare, Cambria et al. [23] analyzed an outline by combing sentic PROMs and sentiment examination systems. Disease like breast cancer, colorectal cancer, and diabetes are classified by De la Torre-Diez et al. [24]. Social media depends on online cancer group relations; portier et al. [25] imposed emotion examination methods to find false sentiments and bad moods in a human. The subsequent research examines the emotions of a group of people attempted by Crannnell et al. [26]. Sufferers mentally convince these people. Extraction of online e-liquid revises by e-liquid characters by Chen and Zeng [27]. It is to separate the polarity characters and also process an emotion examination in big online e-liquid websites.

Ozcift and Gulten [28] introduce the research of ML algorithms in healthcare identification. To evaluate the division execution, they integrate the one ML classifier, including the CFS algorithm. 74.5%, 81%, and 87.2% percentage of output offered by three healthcare datasets. It is compared to basic classifiers. The following model is CNN-MDRP by Chen et al. [29, 30] to forecast the infection complication from unstructured and structured information. This model gives a 94.8% percentage of convergence speed when it happens in actual life incidents. Depend on data categorization; Lu [33] introduced a concept recognition type. They perform character depend on categorization by gathering information from internet medical groups. It uses various division models along with C4.5, SVM, and Naive Bayes. When compared to other methods, it gives the best-improved classification outputs.

A distinctive model for refining word-level emotion examination by Chen et al. [29, 30] and gives the best emotion analysis results than the old eleven methods. Lin et al. [31] introduced TCM medical documents. They got multi-combination type by integrating various characters sing weighted LDA concept type. The effectiveness of this work was evolved along with the best categories rate. It recreates the best support in TCM medical science. Jonnalagadda et al. [32] transverse research on recognizing decision experts from 178,527 news research papers, the output gives 88.5%of efficiency for the 734,024 samples along with corpus. Monogram et al. [34] introduced the best deep learning model for cardiovascular disease with many kernel training. The neural language like CBOW and Skip-gram is imposed by Minarro-Gimenez et.al. For the first time using the skip-grams, it also uses PubMed corpus and PubMed text articles. Th et al. [36] performed skip-gram and CBOW on 1.25M PubMed research papers to evaluate the word combination with the twinning couple. For the purpose of biomedical NLP, Chiu et al. [37] processed best word embedding’s and performed two various corpora. It demonstrates skip-gram model surpasses the CBOW model. From the suffering with anorexia nervosa, Spinczyk et al. [38] offered a rule-dependent type for examining emotions. The emotional word is recognized from the record. At the time of healing, it assists humans to the sharp using group of phrases model.

3 Problem Statement and Solution

From the literature survey, we found the difficulties involved while analyzing the acute disease-affected mindset. During the extraction of sentiment, there have been a lot of parameters involved. Also, analyzing the diversified people mindset becomes a very challenging role. Hence, we proposed a practical framework to analyze the sentiment of perceptive disease-affected people. In our proposed framework, we used five majorly used machine learning classifiers known as Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF), K-nearest neighbor (KNN). Our proposed framework includes:

Initially, we perform pre-processing over the dataset, including Conversion of Lowercase, eliminate special characters, eliminate stop words, Conversion of Number to word, Stemming, and lemmatization. Hence the dataset gets accurate and crisp.
After pre-processing, N-gram tokenization is performed over the dataset.
Then assigning a polarity score to each extracted review and calculate the overall polarity score.
Through combing the data review, we employed a probabilistic LDA over the resultant dataset.
Then primary machine learning classifiers such as Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF), K-nearest neighbour (KNN) are used to classify the sentiment as positive, negative, or neutral.

4 Methodologies

This section will figure out all the methodologies used for our proposed work and describe its entire functionalities in a detailed manner.

4.1 Data Pre-processing

Data pre-processing is primary task of any data classification process. Here we used the techniques includes lemmatization and stop word removal which is very crucial as well as much generalized approach. Here we used four major pre-processing technique is employed over our socio-medical dataset. It is helpful to improve the quality of classification process as well as paves the way to develop the features as much robust.

4.1.1 Conversion of Lowercase

Initially, we employed lowercase conversion. It sequentially scans the entire dataset and finds out the uppercase words. If the uppercase word exists, it converts it into corresponding lowercase letters using the python library function known as Numpy.

4.1.2 Eliminate Special Characters

This is the second of data pre-processing. This step eliminates the special characters such as (*, %, $, @, #, etc.) from the dataset. And it takes the uppercase-free dataset, which means after finishing step 1, it starts its process.

4.1.3 Eliminate Stop Words

Usually, stop words are frequently occurring in the text, and no meaningful information is conveyed. Hence my use of the NLTK library function we achieved to eliminate the stop words. Figure 1 demonstrates the clear view.

4.1.4 Conversion of Number to Word

These steps involved converting the number to words by use of python library known as num2words. (For instance, 7 are converted into seven).

4.1.5 Stemming and Lemmatization

It is the final step of pre-processing. Stemming is the technique to remove the suffix or prefix from a given the word and reduce the word's complexity, eliminating the stem or root of a term. We employed a porter stemmer to done this process. Lemmatization usually diminishes the word to a well-founded dictionary word.

4.2 N-gram Tokenization

N-gram tokenization [39,40,41] is defined as a collection of occurring tokens present in a frame and used to find further grants. It divides the text into different tokens after the pre-processing with the intention of feature extraction.

Based on the pre-processed data, the N-gram dictionary is explicated physically at words like 'a',' and," the," there,' etc.; these words do not have any critical data .so they are rejected from the review sentences. Separation of the opinion words from the N-gram dictionary is processed by sentence-level annotation and summarization. The summary of every sentence, along with frequency, is represented in Fig. 2. Using the summarized characteristics, the mark of every aspect was evaluated to predict the no. of positive, negative, and neutral words is represented in Table 1. Also, it has to be done after the physical explication and summarization. The score was calculated for positive, negative, and their emotions individually. Additionally, it has three steps i) recognizing the opinion terms ii) feature vectorization process iii) vector transformation. Depending on Term Frequency and Inverse Document frequency [TD-IDF] [42], the features of vectors are computed. The TD-IDF allots the weight to every feature vector. The importance of the feature vector is calculated by

$$TD-IDF\left(x, y, Y\right)=TD\left(x, y\right).IDF\left(x, Y\right)$$

(1)

$$TD=TD\left(x, y\right)$$

(2)

$$IDF\left(x, Y\right)=log\frac{n}{|\left\{y\in Y:X\in Y\right\}|}$$

(3)

where$X$—Entire time a specific term takes place in a document ‘y’N—Entire number of documents in corpus|{y ∈ Y∶ x ∈ y}|-no of time the term ‘x’ happens in total number of documents ‘Y’

Table 1 Details regarding the statistical data

Full size table

The results of the weighted character vector are fashioned to LDA (latent Dirichlet allocation) [43] using pipeline. Intentionally, the Bayesian optimizer which is in LDA separates the some characters by changing them into various no.of concepts. ‘Dimension is term used to describe every concept in LDA. Below section briefly describes the LDA working type.

4.3 Assign Polarity Score

Depends on the polarity level of total domain in context, emotions are bring out by learning section [44] some phrases like “no”,”not”, ”never” and” neither” are considered to identify the relevant emotions of healthcare domain [11, 45, 46]. The below algorithm is sketched to allot the emotions to healthcare domain by utilizing the emotion recognition system.

STEP 1: Set polarity score (Polarity _level) and emotion of both healthcare and non-healthcare of the domain. The different emotion lexicons techniques used for imposing to allot the polarity level are SenticNet and SentiWord Net.

STEP 2: In order to allot the relevant emotion of domain, determine the negation words or phrases

STEP 3: Calculate the total polarity level of the domain using following equation,

$$\mathrm{Polarity}\; \mathrm{level}=\sum_{n\equiv 1}^{k}{Polarity}_{level}$$

(4)

wherePolarity _level - the total polarity level of the contextPolarity _level - single polarity level of every topic in a domain

The following algorithm is to examine the single person’s emotions. It should be done after the polarity level determination. It can find the classification of healthcare domain along with domain classification.

STEP 1: Set the type of healthcare domains of the environment with concept categorization system. We can denote the medical concepts and their types as CM in an environment.

STEP 2: A new abbreviation ${P}_{cc}$ is introduced which is used to denote the successive healthcare domain and their types.

If the successive twin partner of concept types is same then Pcc is

$$ P_{cc1} = CM1 \cap CM2 $$

(5)

Else

$$ P_{cc2} = CM1 \cup CM2 $$

(6)

where $CM1$ & $CM2$ – two successive healthcare domain and their types in an environment.

In order to find the total context category(Cc), use extracted Partial Context category(pcc)

$$cc={P}_{cc1}\cap {P}_{cc2}$$

(7)

where ${P}_{cc1}$ & ${P}_{cc2}$ – partial context category of the environment.

4.4 Latent Dirichlet Allocation (LDA) with Topic Modelling

Following hierarchical Bayesian type, LDA is injected on character vectors to inspect the text in the corpus. The unsupervised likelihood type that designs corpus into the group of content is called LDA [47]. Over the words, each concept is sketched as a distribution. Consider there is possible arrangement ‘t’ along with the corpus ‘c’, and it holds ‘r’ number of review associated with it. Each probability distribution function associated with the feature vector is considered the polynomial probability distributed function, and each study will produce the arbitrary constant ‘k’. The following equation represents how feature extraction is done in the LDA model.

$$p\left({f}_{a}\right)=\sum_{b=1}^{n}P\left(\frac{{f}_{a}}{{t}_{a}}=b\right)P\left({t}_{a}=b\right)$$

(8)

whereP (${t}_{a}$= b) -the likelihood concept of b sampled for character fa for every revise in corpus c.P (${f}_{i}$|${t}_{i}$ = k)—the likelihood of fa under topic b and n denotes the total number of domain.

The term had used while computing Eq. 8. There we used a character vector that refers to b, which is also a multinomial distribution of characters. For considering the review r, P (t) refers to the multinomial distribution. For the representation of feature vector and reviews the estimated parameters, and are used, the total features representation of a review is done using RN. Also, R represents the entire set of reviews. The topic-word hyperparameters are represented using, and the total consideration associated with Dirichlet distributions is kept on updating as every unit of the cell.

$\theta $ Determines the variable comprises of review level and it is sampled once per feature level variable i.e. R and f corresponding to each work r with N. Consequently, for entailed features also very difficult to score and process directly, Eq. (8) clearly determines the probability function of overall possibilities available with respect to each feature vector.

$$P\left({t}_{i}=\left.k\right|{t}_{-i},{f}_{i,}{r}_{i,}\dots \dots \right)\alpha \frac{{C}_{fi}^{FN}+\beta }{{\sum }_{f\equiv 1}^{F}{C}_{fj}^{\mathrm{FN}}+F.\beta }\cdot \frac{{C}_{rij}^{RN}+\alpha }{{\sum }_{k\equiv 1}^{N}{C}_{rj}^{\mathrm{RN}}+N.\alpha }$$

(9)

wheret_i = k - The f_i features which is assigned to the topic k.t_i-1 – The i^th review represents to the allocated domain f_i*r_i.R – Entire set of reviews.F – Entire set of features.C^FN and C^RN - Matrix corresponding to topic-review.${C}_{fj}^{FN}$ – The overall features corresponding to topic k.f_i *${C}_{rj}^{RN}$– The respective topic given to some word with respect to review r without f_i.

The following Eqs. (10) and (11) represent the parameter $\theta $ and ${\widehat{\phi }}_{i}^{\left(r\right)}$. From the equation $\alpha $ and $\beta $ are the hyperparameters.

$${\widehat{\phi }}_{i}^{\left(k\right)}=\frac{{C}_{fi}^{FN}+\beta }{{\sum }_{f\equiv 1}^{F}{C}_{fj}^{\mathrm{FN}}+F.\beta }$$

(10)

$${\widehat{\phi }}_{i}^{\left(r\right)}=\frac{{C}_{rij}^{RN}+\alpha }{{\sum }_{k\equiv 1}^{N}{C}_{rj}^{\mathrm{RN}}+N.\alpha }$$

(11)

In this proposed work, LDA is utilized to discover the text from the review corpus and fuse them into the latent domain. We have modeled the field as K = 100, 200, 300, 400, and 500 on the review corpus. The essential features were identified from the domain based on the probabilities correlated with each segment. Later, the implementation of feature selection is presented in the below section to handle the curse of dimensionality problem.

4.5 Classification Process

In this subsection, we discussed various classifiers used for our evaluation purpose; by using those classifiers, we analyze and classify the sentiment as positive, negative, and neutral.

4.5.1 Support Vector Machine (SVM):

While text classification and important categorization are based on hypertext, the SVM model is widely used. It is very much helpful to significantly minimize the training sample label of both inductive and transudative settings. It is a classifier that primarily classifies the cost function also enhances the classification performance. And we have used the LibSVM library, which is a function of sklearn SVC. To analyse the data in minimizing the structural risk, a learning robust method, SVM, is used. It is a kind of learning method in which classifies the data appropriately.

The training phase was usually optimally separating hyperplane, reducing the cost function so that distance between two classes of margin had been induced and feature space must be minimized. Consider there is m instance of data in the training phase. Each model consists of a pair (ai, bi) where ai ∈ Rn is a vector attribute that belongs to the instance i. Also, bi ∈ {+ 1,-1} and known as the instance of class label.

To find the hyperplane that finely separates the optimal solution, the main objective of the SVM and the corresponding data belongs to two main classes, which is $W\cdot a+c$=0. The decision function is used to classify and test the instance y, which is defined as

$$F\left(y\right)=W\cdot a+c$$

(12)

4.5.2 Naïve Bayes

The NB classifier between the predictors depending on Bayes theorem detachment, since NB refers to different characters, the multinomial naive Bayes are used along with proper fit prior. It is simple, has no difficulties to sketch, no repeated variable computation utilizing big datasets. It processes enlightened division models. Bayes theorem provides a way of calculating the posterior probability, P(g|h), from P(g), P(h), and P(h|g).

Naive Bayes classifier assumes that the effect of the value of a predictor (h) on a given class (g) is independent of the importance of other predictors. This assumption is called class conditional independence.

NB offers a computation of posterior probability $P\left(g|h\right)$ from $P\left(g\right)$, $P\left(h\right)$ and $P\left(h|g\right)$. It consider the performance of total of a predictor (h) on a given class (g). It is not dependent of other predictors. It is called conditional independence

$$P\left(g|h\right)=P\left(h1|g\right)*P\left(h1|g\right)\dots \dots P\left(hn|g\right)*P\left(g\right)$$

(13)

were

$P\left(h|g\right)$ is the posterior probability of class (destiny) given predictor (attribute).

$P\left(g\right)$. is the prior probability of class.

$P\left(g|h\right)$. is the likelihood which is the probability of the predictor given class.

$P\left(h\right)$ is the prior probability of the predictor.

4.5.3 Decision Tree (DT)

It comes from the classification tree algorithm family. Transversely, the subtrees are outlined by dividing the source entity. 12 is the maximum depth of the tree. The tree structure is the basic form for classification and regression models. It can be built by splitting the dataset into tiny and very tiny subsets. Simultaneously, the association tree is evolved. The decision node and leaf node are the final results. As usual, the topmost node is the root node. The root node handles the categorical and numerical data. It is a top-down approach, so the data are divided into different homogeneous values which have instances. The entropy of the decision tree is built by the following equation.

$$\mathrm{E}(\mathrm{S}) = \sum_{j=1}^{d}-P\;i\,log\;2 \; pi$$

(14)

4.5.4 Random Forest (RF)

Based on the dataset sample, the RF has various decision trees. There is a possible 'n' maximum depth of the tree. It can be built with different single decision trees at the learning phase. To make the destiny prediction, it makes use of predictions from all the trees. The mode of the classes for classification or the mean forecast for regression, since there are a group of outputs to reach a destiny, they are called an Ensemble method.

The following equation used for making binary tree:

$$Nlm=WmCl-Wleft\left(m\right)\;Cleft\left(1\right)-Wright\left (m\right) Cright\left(m\right)$$

(15)

left(m) = child node from left split on node m
right(m) = child node from right split on node m

The importance for each feature on a decision tree is then calculated as:

$$ {\text{Fi}} = \frac{{\sum j:{\mathrm{node}}\;j\;{\text{splits}}\;{\mathrm{on}}\;{\text{feature}}\;i\;nij}}{{\sum k {\EUR} \;{\text{all}}\;{\mathrm{nodes}}\;nik }} $$

(16)

where j = the importance of feature mK = the importance of node m

4.5.5 K- nearest Neighbor

K nearest neighbor is the faster algorithm which significantly produces the classification result as accurate and more precise also the performance must be enhanced. It is majorly used to find out similar objects and much applicable to solving the problem. Applications such as recommendation systems, search engines are utilized and essentially work based on KNN.

5 Experimentation and result Discussion

This section presents a clear description of the implementation details and a proper comparison with the existing approach. We used the Windows operating system with 12 GB RAM with a 2 GHz processor and 1 TB hard disk for experimentation purposes. Python programming language along with the help of proper library functions used for implementation. We used pycharm IDE for implementation purposes, and CPU will no longer be sufficient as a GPU when the dataset is huge. Hence, we used google colab.

5.1 Dataset Description

Notably, The 821,483,453 general tweets on Twitter are brought together between 16th march 2019 and 2nd October 2020. Among them, 438,072,932 are based on healthcare issues, especially numerous social environment health domains. Besides, three medical and health datasets are used to assess the coherence of the project. In the below sections, a brief explanation of this data is given. The other standard survey has taken between October 2013 and Jan 2016 using the UCI machine learning repository (common dataset in Twitter). This convention dataset has different medical tweets, which are gathered using many accounts on Twitter as follows: a. reutershealth b. kaiserhealthnews c. latimeshealth d. bbchealth e. msnhealthnews f. NBChealth g. cbchealth h. nytimeshealth i. gdnhealthcare j. everydayhealth k. nprhealth l. foxnewshealth. Table 1 describes the details regarding statistical data and Table 2 describes examples of various emo-tags list.

Table 2 Example of emo-tags list

Full size table

Here our experimentation consists of 5 main phases. It includes data pre-processing which provides for conversion of lowercase, elimination of special character, elimination stop words, and conversion of number to terms, and stemming and lemmatization. After the resultant pre-processing dataset is passed to N-gram tokenization that emotion has been converted into tokens. Then we assign polarity value to each emotion and calculate cumulatively. The resultant had fed to Latent Dirichlet Allocation (LDA) with Topic modelling. In that, each topic is converted into a set of sentences. Then finally significant classifier like SVM, NB tree, DT tree, Random forest, KNN is used to analyses the sentiment of acute disease-affected people.

5.2 Data Pre-processing

The collected dataset is applied over the set of pre-processing techniques includes lemmatization and stop word removal. We also include some manually developed wordlist, which provides for some crucial keywords of sentiments like positive, negative, and neutral. Some of which are shown in the Table 3. We also embedded it in the pre-processing step.

Table 3 Set of manually developed wordlist

Full size table

5.3 N-gram Tokenization

Then we applied N-gram tokenization over the resultant dataset. There annotation based on sentence-level and summarization is done and extracting the words based on opinion from N-gram dictionary. The words such as “a”, “the”, “and”, “so” etc. kinds of words are removed from the dictionary and provide clarity about what kind of emotions we want to predict. The resultant of N-gram tokenization is simulated and generated the graph and shown in below. Pink line is our approach and orange link base line approach which is except N-gram tokenization [i.e. only pre-processing]. Hence, we found that our approach outperforms well and able to provide result in an accurate manner (Table 4).

Table 4 Categorization of sentiments

Full size table

5.4 Assign Polarity Score and Calculate it’s Cumulative

This step happens after the completion of N-gram tokenization. The resultant of n-gram tokenization is fed to the input of this step. To analyses the exact prediction, we will set some polarity scores to each entity so that the negation kinds of words such as “not,” never,” “none,” neither,” etc., can be recognized and extract the exact sentiment. After fix polarity score to every entity, we categorize this context into uni-gram bi-gram and tri-gram which defines the priority to analyses the sentiment also it is used to examine the kinds of statistics result involved in the disease.

The categorization over each entity is done exactly with Tri-gram in which the corresponding polarity score = 45. After finding these, cumulative is done over each entity, then entire sequence must recognize the negation, which means the negative sentiment is done. The comparison graph over each categorization is shown in below graph.

5.5 Latent Dirichlet Allocation (LDA) with Topic modelling

It was assigning probability value over every emotion and categorizes that based on that probability value. Also, it categorizes that into a high risk to low risk. Furthermore, it keeps on actioned this, so that model fine-tuned most finely. Here k = 100.

Figure 3 demonstrates the overall entities of the dataset, which clearly scatters entire emotions involved, got from the LDA model. It fine-tuned and provides a high reliable feature in which classification done with high accuracy. Each dataset which are scattered in Fig. 3 is the emotions i.e. (Positive, Negative, and Neutral). Also Table 5 and 6 describes the different emotions based on its random fixed probability values (Figs. 4 and 5).

Table 5 Different emotions based on its random fixed probability values

Full size table

Table 6 Comparison of various classifiers for analyzing sentiments

Full size table

5.6 Evaluate result using classifier:

For evaluation of result, we use most significant classifier like SVM, KNN, DT, RF, NB tree that are widely used for analyzing very crucial information. Classifier extracts the exact sentiment from each existing entity. We used evaluation metrics like Accuracy, Recall and Precision for our analysis (Figs. 6 and 7).

From above table describes the overall comparison of the classifier to classify and analyses sentiment with our proposed method. Where A denotes Accuracy, R denotes Recall, P denotes precision. In our experimentation analysis, we determine, SVM Outperforms well and produce better accuracy when compared to the remaining classifiers. It achieves the highest accuracy of 98.2% while classifying negative sentiment, 96.5% for organizing positive feelings, and analysing neural emotions achieves 96.1% accuracy. Next from SVM, Naïve Bayes outperforms well, and it also performs an accuracy of 95.2% accuracy for classifying negative sentiment. Consequently, the corresponding comparison graph is given as follows.

Then final comparison was our proposed methodology with SVM along with baseline methodology with SVM. Hence, we take SVM classifier and compare with base line SVM Classifier. Then we identified that our proposed methodology with SVM classifier analyses the sentiment and outperforms well when compared to traditional SVM classifier and the corresponding table and comparison are given as follows: (Table 7)

Table 7 Comparison of proposed Methodology with SVM along with baseline methodology with SVM

Full size table

6 Conclusion and Future Work

Usually, analysing some acute disease-affected people mindset is a much challenging role. Also, the dataset must be very accurate. Hence, we collected datasets from three different environments, including a review from social media, a critical review from Twitter, and abstracts of medical study from the wall street journal. Then we proposed a practical framework that differs from the traditional approach, which includes four crucial techniques, i.e., Enhanced pre-processing, N-gram tokenization, assigning polarity score, and topic modelling with Latent Dirichlet Allocation. Finally, for evaluation purposes, we used a significant classifier that is prominently used in medical applications, including SVM, NB, DT, KNN, and RF. Out of this classifier, SVM outperforms well with our proposed method, and it got better accuracy of 98.2%. Also, we compared our proposed practical framework with a baseline approach to prove our work analyses acute disease affected people emotion in an efficient way. We will incorporate this with a deep learning approach to processing some vast features of the dataset in future work.

Data Availability statement

Data sharing not applicable to this article as no datasets were generated or analysed during the current study.

Code Availability

Not Applicable.

References

Spasic, I., Ananiadou, S., McNaught, J., & Kumar, A. (2005). Text mining and ontologies in biomedicine: Making sense of raw text. Briefings in Bioinformatics, 6(3), 239–251.
Article Google Scholar
Jiang, M., Chen, Y., Liu, M., Trent Rosenbloom, S., Mani, S., Denny, J. C., & Hua, X. (2011). A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. Journal of the American Medical Informatics Association, 18(5), 601–606.
Article Google Scholar
Cambria, E., (2013) An introduction to concept-level sentiment analysis. In: Mexican international conference on artificial intelligence, pp 478–483. Springer.
Cambria, E. (2016). Affective computing and sentiment analysis. IEEE Intelligent Systems, 31(2), 102–107.
Article Google Scholar
Cambria E, Jie F, Bisio F, Poria S. Affectivespace 2: Enabling affective intuition for concept-level sentiment analysis. In: AAAI, pp 508–514. 2015.
Swaminathan, R., Sharma, A., Yang, H., (2010) Opinion mining for biomedical text data: Feature space design and feature selection. In: The 9th international workshop on data mining in bioinformatics, BIOKDD.
Mondal, A., Chaturvedi, I., Das, D., Bajpai, R., Bandyopadhyay, S., (2015) Lexical resource for medical events: A polarity based approach. In: 2015 IEEE International Conference on Data Mining Workshop (ICDMW), pp 1302–1309. IEEE.
Mondal, A., Das, D., Cambria, E., Bandyopadhyay, S., (2016) Wme: Sense, polarity and affinity based concept resource for medical events. In: Proceedings of the 8th global wordnet conference, pp 242–246.
Mondal, A., Satapathy, R., Das, D., Bandyopadhyay, S., (2016) A hybrid approach based sentiment extraction from medical context. In: 4th workshop on sentiment analysis where ai meets psychology (SAAIP 2016), IJCAI 2016 Workshop, July 10, Hilton, New York City, USA.
Basili, R., Pazienza, M.T., Vindigni, M., (1997) Corpus-driven unsupervised learning of verb subcategorization frames. In: Congress of the Italian Association for Artificial Intelligence, pp 159–170. Springer.
Huang, Y., & Lowe, H. J. (2007). A novel hybrid approach to automated negation detection in clinical radiology reports. Journal of the American Medical Informatics Association, 14(3), 304–311.
Article Google Scholar
Morante, R., Liekens, A., Daelemans, W., et al. (2008) Learning the scope of negation in biomedical texts. In: Proceedings of the conference on empirical methods in natural language processing, pp 715–724. Association for Computational Linguistics.
Jacob, S. G., & Geetha, R. R. (2011). Discovery of knowledge patterns in clinical data through data mining algorithms: Multi-class categorization of breast tissue data. International Journal of Computers and Applications, 32(7), 46–53.
Google Scholar
Ficek, M., Kencl, L., (2012) Inter-call mobility model: A spatio-temporal refnement of call data records using a gaussian mixture model. In: 2012 Proceedings IEEE INFOCOM. IEEE, pp 469–477. Doi: https://doi.org/10.1109/infcom.2012.6195786
Liang, J., Liu, P., Tan, J., & Bai, S. (2014). Sentiment classifcation based on AS-LDA model. Proc Comput Sci, 31, 511–516. https://doi.org/10.1016/j.procs.2014.05.296
Article Google Scholar
Baltas, A. B. A. K., & Tsakalidis, A. K. (2017). Algorithmic aspects of cloud computing. Lecture Notes in Computer Science (Vol. 10230, pp. 15–25). Springer.
Google Scholar
Oneto, L., Bisio, F., Cambria, E., & Anguita, D. (2016). Statistical learning theory and ELM for big social data analysis. IEEE Computational Intelligence Magazine, 11(3), 45–55. https://doi.org/10.1109/MCI.2016.25725
Article Google Scholar
Chen, J., Pan, X., Monga, R., Bengio, S., Jozefowicz, R., (2016) Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981.
Nodarakis N, Sioutas S, Tsakalidis AK, Tzimas G (2016) large scale sentiment analysis on twitter with spark. In: EDBT/ICDT workshops, pp 1–8
Du, J., Xu, J., Song, H., Liu, X., & Tao, C. (2017). Optimization on machine learning based approaches for sentiment analysis on HPV vaccines related tweets. Journal of Biomedical Semantics, 8(1), 1–7. https://doi.org/10.1186/s13326-017-0120-6
Article Google Scholar
Denecke, K., & Nejdl, W. (2009). How valuable is medical social media data? Content analysis of the medical web. Information Sciences, 179(12), 1870–1880. https://doi.org/10.1016/j.ins.2009.01.025
Article Google Scholar
Xia, L., Gentile, A.L., Munro, J., Iria, J., (2009) Improving patient opinion mining through multi-step classifcation. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artifcial Intelligence and Lecture Notes in Bioinformatics), 5729 LNAI, pp. 70–76.
Cambria, E., Benson, T., Eckl, C., & Hussain, A. (2012). Sentic PROMs: Application of sentic computing to the development of a novel unifed framework for measuring health-care quality. Expert Systems with Applications, 39(12), 10533–10543. https://doi.org/10.1016/j.eswa.2012.02.120
Article Google Scholar
De la Torre-Díez, I., Díaz-Pernas, F. J., & Antón-Rodríguez, M. (2012). A content analysis of chronic diseases social groups on facebook and twitter. Telemed e-Health, 18(6), 404–408. https://doi.org/10.1089/tmj.2011.0227
Article Google Scholar
Portier, K., Greer, G. E., Rokach, L., Ofek, N., Wang, Y., Biyani, P., Yu, M., Banerjee, S., Zhao, K., Mitra, P., & Yen, J. (2013). Understanding topics and sentiment in an online cancer survivor community. Journal of the National Cancer Institute. Monographs, 47, 195–198. https://doi.org/10.1093/jncimonographs/lgt025
Article Google Scholar
Crannell, W. C., Clark, E., Jones, C., James, T. A., & Moore, J. (2016). A pattern matched Twitter analysis of US cancer-patient sentiments. Journal of Surgical Research, 206(2), 536–542. https://doi.org/10.1016/j.jss.2016.06.050
Article Google Scholar
Chen, Z., & Zeng, D. D. (2017). Mining online e-liquid reviews for opinion polarities about e-liquid features. BMC Public Health, 17(1), 1–7. https://doi.org/10.1186/s12889-017-4533-z
Article Google Scholar
Ozcift, A., & Gulten, A. (2011). Classifer ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms. Computer Methods and Programs in Biomedicine, 104(3), 443–451. https://doi.org/10.1016/j.cmpb.2011.03.018
Article Google Scholar
Chen, M., Hao, Y., Hwang, K., Wang, L., & Wang, L. (2017). Disease prediction by machine learning over big data from healthcare communities. IEEE Access, 5, 8869–8879. https://doi.org/10.1109/access.2017.2694446
Article Google Scholar
Chen, T., Xu, R., He, Y., & Wang, X. (2017). Improving sentiment analysis via sentence type classifcation using BiLSTM-CRF and CNN. Expert Systems with Applications. https://doi.org/10.1016/j.eswa.2016.10.065
Article Google Scholar
Lin, F., Xiahou, J., & Xu, Z. (2016). TCM clinic records data mining approaches based on weighted-LDA and multi-relationship LDA model. Multimedia Tools and Applications, 75(22), 14203–14232. https://doi.org/10.1007/s11042-016-3363-9
Article Google Scholar
Jonnalagadda, S., Peeler, R., & Topham, P. (2012). Discovering opinion leaders for medical topics using news articles. Journal of Biomedical Semantics, 3(1), 2.
Article Google Scholar
Kim, E., Han, J. Y., Moon, T. J., Shaw, B., Shah, D. V., McTavish, F. M., & Gustafson, D. H. (2012). The process and efect of supportive message expression and reception in online breast cancer support groups. Psycho-Oncology, 21(5), 531–540. https://doi.org/10.1002/pon.1942
Article Google Scholar
Lu, Y. (2013). Automatic topic identifcation of health-related messages in online health community using text classifcation. Springerplus, 2(1), 1–7. https://doi.org/10.1186/2193-1801-2-309
Article Google Scholar
Manogaran, G., Varatharajan, R., & Priyan, M. K. (2018). Hybrid recommendation system for heart disease diagnosis based on multiple kernel learning with adaptive neuro-fuzzy inference system. Multimedia Tools and Applications, 77(4), 4379–4399. https://doi.org/10.1007/s11042-017-5515-y
Article Google Scholar
Minarro-Gimenez, J. A., Marin-Alonso, O., & Samwald, M. (2014). Exploring the application of deep learning techniques on medical text corpora. Studies in Health Technology Informatics, 205, 584–588. https://doi.org/10.3233/978-1-61499-432-9-584
Article Google Scholar
Muneeb, T.H., Sahu, S., Anand, A., (2015) Evaluating distributed word representations for capturing semantics of biomedical concepts. In: Proceedings of BioNLP 15, (Ml), pp 158–163.
Chiu, B., Crichton, G., Korhonen, A., Pyysalo, S., (2016) How to train good word embeddings for biomedical NLP. In: Proceedings of the 15th workshop on biomedical natural language processing, pp 166– 174.
Spinczyk, D., Nabrdalik, K., & Rojewska, K. (2018). Computer aided sentiment analysis of anorexia nervosa patients’ vocabulary. BioMedical Engineering Online BioMedical Cent. https://doi.org/10.1186/s12938-018-0451-2
Article Google Scholar
Timusk, T., Holmes, C.C., Reichardt, W., (1995) C-axis properties of 123, like Lanl-Cm95. Anharmonic Prop High-T_c Cuprates 49:171.
Aisopos, F., Papadakis, G., Varvarigou, T., (2011) Sentiment analysis of social media content using N-Gram graphs. In: Proceedings of the 3rd ACM SIGMM international workshop on Social media— WSM’11, p 9. https://doi.org/10.1145/2072609.2072614
Dey, A., Jenamani, M., & Thakkar, J. J. (2018). Senti-N-Gram: An n-gram lexicon for sentiment analysis. Expert Systems with Applications, 103, 92–105. https://doi.org/10.1016/j.eswa.2018.03.004
Article Google Scholar
Vittayakorn, S., Umeda, T., Murasaki, K., Sudo, K., Okatani, T., Yamaguchi, K., (2016) Automatic attribute discovery with neural activations, Lecture Notes in Computer Science (including subseries Lecture Notes in Artifcial Intelligence and Lecture Notes in Bioinformatics), 9908 LNCS, pp 252–268. https://doi.org/10.1007/978-3-319-46493-0_16
Miura, Y., Hattori, K., Ohkuma, T., Masuichi, H., (2013) Topic modeling with sentiment clues and relaxed labeling schema. In: Proceedings of the 3rd workshop on sentiment analysis where AI meets psychology, pp 6–14.
Sarker, A., Molla-Aliod, D., Paris, C., et al. (2011) Outcome polarity ´ identification of medical papers, pp 105–114.
Elkin, P. L., Brown, S. H., Bauer, B. A., Husser, C. S., Carruth, W., Bergstrom, L. R., & Wahner-Roedler, D. L. (2005). A controlled trial of automated classification of negation from clinical notes. BMC Medical Informatics and Decision Making, 5(1), 13.
Article Google Scholar
Goldin, I., Chapman, W.W., (2003) Learning to detect negation with ‘not’in medical texts. In: Proc workshop on text analysis and search for bioinformatics, ACM SIGIR.
Bashri, M.F.A., Kusumaningrum, R., (2017) Sentiment analysis using Latent Dirichlet allocation and topic polarity word cloud visualization. In: 2017 5th international conference on information and communication technology, ICoIC7 2017, 0(c), pp 4–8. Doi: https://doi.org/10.1109/icoict.2017.8074651

Download references

Funding

No Funding.

Author information

Authors and Affiliations

School of Computing, SASTRA Deemed University, Thanjavur, Tamil-Nadu, 613401, India
Kannan Chakrapani, Manikandan Ramachandran & Ambeshwar Kumar
Computer Science and Engineering Department, BIT, Bangalore, India
Muniyegowda Kempanna
Department of IT, SRM Institute of Science and Technology, Kattankulathur Chennai, India
Mohamed Iqubal Safa
Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Greenfields, Vaddeswaram, Guntur, 522502, India
Thiyagarajan Kavitha
Department of Electrical and Computer Engineering, San Francisco State University, San Francisco, CA, 94132, USA
Vidhyacharan Bhaskar

Authors

Kannan Chakrapani
View author publications
You can also search for this author in PubMed Google Scholar
Muniyegowda Kempanna
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Iqubal Safa
View author publications
You can also search for this author in PubMed Google Scholar
Thiyagarajan Kavitha
View author publications
You can also search for this author in PubMed Google Scholar
Manikandan Ramachandran
View author publications
You can also search for this author in PubMed Google Scholar
Vidhyacharan Bhaskar
View author publications
You can also search for this author in PubMed Google Scholar
Ambeshwar Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vidhyacharan Bhaskar.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chakrapani, K., Kempanna, M., Safa, M.I. et al. An Enhanced Exploration of Sentimental Analysis in Health Care. Wireless Pers Commun 128, 901–922 (2023). https://doi.org/10.1007/s11277-022-09981-8

Download citation

Accepted: 29 August 2022
Published: 19 October 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s11277-022-09981-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An Enhanced Exploration of Sentimental Analysis in Health Care

Abstract

Similar content being viewed by others

Sentiment Analysis of Healthcare Big Data: A Fundamental Study

Sentiment Analysis in Healthcare: A Brief Review

A Sentiment Analysis Method for Analyzing Users Opinions About Drugs for Chronic Diseases

Explore related subjects

1 Introduction

2 Related Study and Literature Survey

3 Problem Statement and Solution

4 Methodologies

4.1 Data Pre-processing

4.1.1 Conversion of Lowercase

4.1.2 Eliminate Special Characters

4.1.3 Eliminate Stop Words

4.1.4 Conversion of Number to Word

4.1.5 Stemming and Lemmatization

4.2 N-gram Tokenization

4.3 Assign Polarity Score

4.4 Latent Dirichlet Allocation (LDA) with Topic Modelling

4.5 Classification Process

4.5.1 Support Vector Machine (SVM):

4.5.2 Naïve Bayes

4.5.3 Decision Tree (DT)

4.5.4 Random Forest (RF)

4.5.5 K- nearest Neighbor

5 Experimentation and result Discussion

5.1 Dataset Description

5.2 Data Pre-processing

5.3 N-gram Tokenization

5.4 Assign Polarity Score and Calculate it’s Cumulative

5.5 Latent Dirichlet Allocation (LDA) with Topic modelling

5.6 Evaluate result using classifier:

6 Conclusion and Future Work

Data Availability statement

Code Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation