1 Introduction

The evolving area in medical concepts depends on allotting the classification and emotion analysis. Because of the absence of domain-specific lexicons and uninterested domain researchers in this area, the issues accepted are high. One more issue is the semantic relation of the healthcare domain and the separation of knowledge-dependent features. The main reason is that the all-time medical lexicons do not provide any characteristics like category and sentiment. The experts are tried their best to plan the data extraction like GENIAI and PennBioIE 2 to solve the issues throughout the years. The primary need is to recreate either structure or unstructured corpora versions. Other economic and ontology methods are used with linguistic and ML (machine learning) techniques [1]. These are used to separate the healthcare topics with the synaptic and semantic characters [2]. Two systems are evolved to separate the semantic relations from the healthcare domain in recent work. The first system is tokenization (allotting the groups). The second system is for recognizing the emotions in the healthcare domain and their contexts. Healthcare topic is nothing but a phrase or a word with entities and knowledge and data belonging to healthcare attributes. The recognized field consists of two types i) Medical ii) Non-Medical. Separations of negation and stop words or sentences are considered with the aim of identifying the area. Let us take two examples Regular headache and uncontrolled jerking. These are considered medical and non-medical fields depend on the presence of the healthcare domain. Headache is regarded as a symptom of starting stage of cancer also. So it is a medical context with a company of medical concepts. Additionally, every word or phrase of the corpus is recognized as the context in our work. In case of sentences like” Orange is good or bad” means a non-medical domain without any medical concepts.

The categorization and emotion recognition systems are used to separate the contexts and domain. The separated field is divided into five models in the categorization. I) Disease II) symptoms III) drugs IV) human anatomy V) miscellaneous medical terms (unidentified topic represented as MMT in the remaining section of the paper). An example of a disease type is “Headache.” According to utterance and computation of the first incident of separation concept, the healthcare researchers and scientists presented these five categories in the corpora. Every type has its ideas which allot the whole classification based on their context. Eleven classifications of healthcare domain are recognized in pairwise aspects like disease symptom, disease drug, etc. According to both concepts and contexts, the sentiment recognition model [3,4,5] is augmented for sense-based information. Here, only the positive and negative emotions are taken into account. For example, “There is something wonderful about being pregnant” is represented with positive emotion. The outputs of emotion recognition are not similar for various types of domain. Consider “anatomy of the human body” as a neutral sentiment while positive or negative feelings depend on symptom types. The past model lexicon viz is used to evaluate these models. WordNet of Medical Event is utilized to separate the healthcare domain from contexts. Additionally, the lexicons allotting the linguistic and the emotional characteristics in the healthcare domain [6], the WME has two versions: WME 1.0(WME version 1) [7] and WME 2.0 (WME version 2) [8] augmented to compromise. The Linguistic character such as POS, gloss with polarity score, emotions are coming under the WME1.0 lexicon. It also has 6415 no. of healthcare domain, and it impotent to offer the feeling-based field and related knowledge data. So we go for WME 2.0, which has 10,186 no. of the area, and it has some knowledge-related character like affinity score, similar sentiment words (SSW), and gravity score. The blended model is injected to combine the previous linguistic symbols of WME 2.0 and the machine learning prototype. This offers aspects such as negation [9], uni-gram and bi-gram. It is two types of classifiers to accomplish an average of 0.81 and 0.86 F-Measures for allotting classifications to the healthcare domain and contexts in developing a categorization system. i) Naive Bayes ii) Logistic Regression [13], In the presence of WME2.0, the emotion recognition system was evaluated with the utilization of naive Bayes and support vector-oriented Sequential Minimum Optimization (SMO) Classifiers. It attains an average of 0.91 and 0.81 F-Measure for recognizing emotions of the healthcare domain and contexts. When the unigram and bigram are used to identify the divisions of the medical domain, the negation character identifies the first sentiments of medical concepts [10,11,12].

The paper is organized as follows: Sect. 2 discussed the various related work and literature studies associated with sentimental analysis. Section 3 discussed the current problem statement along with the solution. The different methodology used for the proposed approach is discussed in Sect. 4. Experimentation and result discussion was on Sect. 5. Finally, the paper is concluded in Sect. 6.

2 Related Study and Literature Survey

This section analyzes and describes the numerous existing approaches built for analyzing sentiment of acute disease affected patients.

Many new researchers are examining the way social media convincing the public of medical care. Researches are injecting text mining (Ficek and Kencl [14], Rahnama [15]), and it is doing a significant role in the high performance of unstructured data by the utilization of apache spark along with the binary and ternary process, Baltas and Tsaklidis [16] introduced a Twitter sentiment analysis. The following method is a conventional extreme learning machine based on spark cluster performed by Oneto et al. [17]. Using spark with deep learning shows that they have a high-performance level than any other spark model. In mobile big data analytics, a deep learning framework is introduced by Chen et al. [18] utilizing the apache spark model. The further method is sentimental analysis with spark architecture on large scale data discovered by Nodarkis, et al. [19].

To separate the sentiments from the HPV vaccine-based tweets, the best ML system is performed by Du.et.al [20]. The ranking division, along with the SVM type, is processed, and 6000 tweets are explicated physically. Other than baseline types, this output gives the best of 0.6732F-. Medical sentiment analysis is an evolving technology. Denecke and Nejdl [21] introduced a health care ontology to evaluate the factual level in the healthcare texts. It is different from past emotion examination systems. Emotion examination is processed by rule or by ML methods. Many workers are represented with ML methods than rule-dependent methods.

To find out the polarity level in sufferer data, Xia, et al. [22] discovered a multistep opinion classification. Additionally, to compute the quality of healthcare, Cambria et al. [23] analyzed an outline by combing sentic PROMs and sentiment examination systems. Disease like breast cancer, colorectal cancer, and diabetes are classified by De la Torre-Diez et al. [24]. Social media depends on online cancer group relations; portier et al. [25] imposed emotion examination methods to find false sentiments and bad moods in a human. The subsequent research examines the emotions of a group of people attempted by Crannnell et al. [26]. Sufferers mentally convince these people. Extraction of online e-liquid revises by e-liquid characters by Chen and Zeng [27]. It is to separate the polarity characters and also process an emotion examination in big online e-liquid websites.

Ozcift and Gulten [28] introduce the research of ML algorithms in healthcare identification. To evaluate the division execution, they integrate the one ML classifier, including the CFS algorithm. 74.5%, 81%, and 87.2% percentage of output offered by three healthcare datasets. It is compared to basic classifiers. The following model is CNN-MDRP by Chen et al. [29, 30] to forecast the infection complication from unstructured and structured information. This model gives a 94.8% percentage of convergence speed when it happens in actual life incidents. Depend on data categorization; Lu [33] introduced a concept recognition type. They perform character depend on categorization by gathering information from internet medical groups. It uses various division models along with C4.5, SVM, and Naive Bayes. When compared to other methods, it gives the best-improved classification outputs.

A distinctive model for refining word-level emotion examination by Chen et al. [29, 30] and gives the best emotion analysis results than the old eleven methods. Lin et al. [31] introduced TCM medical documents. They got multi-combination type by integrating various characters sing weighted LDA concept type. The effectiveness of this work was evolved along with the best categories rate. It recreates the best support in TCM medical science. Jonnalagadda et al. [32] transverse research on recognizing decision experts from 178,527 news research papers, the output gives 88.5%of efficiency for the 734,024 samples along with corpus. Monogram et al. [34] introduced the best deep learning model for cardiovascular disease with many kernel training. The neural language like CBOW and Skip-gram is imposed by Minarro-Gimenez et.al. For the first time using the skip-grams, it also uses PubMed corpus and PubMed text articles. Th et al. [36] performed skip-gram and CBOW on 1.25M PubMed research papers to evaluate the word combination with the twinning couple. For the purpose of biomedical NLP, Chiu et al. [37] processed best word embedding’s and performed two various corpora. It demonstrates skip-gram model surpasses the CBOW model. From the suffering with anorexia nervosa, Spinczyk et al. [38] offered a rule-dependent type for examining emotions. The emotional word is recognized from the record. At the time of healing, it assists humans to the sharp using group of phrases model.

3 Problem Statement and Solution

From the literature survey, we found the difficulties involved while analyzing the acute disease-affected mindset. During the extraction of sentiment, there have been a lot of parameters involved. Also, analyzing the diversified people mindset becomes a very challenging role. Hence, we proposed a practical framework to analyze the sentiment of perceptive disease-affected people. In our proposed framework, we used five majorly used machine learning classifiers known as Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF), K-nearest neighbor (KNN). Our proposed framework includes:

  • Initially, we perform pre-processing over the dataset, including Conversion of Lowercase, eliminate special characters, eliminate stop words, Conversion of Number to word, Stemming, and lemmatization. Hence the dataset gets accurate and crisp.

  • After pre-processing, N-gram tokenization is performed over the dataset.

  • Then assigning a polarity score to each extracted review and calculate the overall polarity score.

  • Through combing the data review, we employed a probabilistic LDA over the resultant dataset.

  • Then primary machine learning classifiers such as Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF), K-nearest neighbour (KNN) are used to classify the sentiment as positive, negative, or neutral.

4 Methodologies

This section will figure out all the methodologies used for our proposed work and describe its entire functionalities in a detailed manner.

4.1 Data Pre-processing

Data pre-processing is primary task of any data classification process. Here we used the techniques includes lemmatization and stop word removal which is very crucial as well as much generalized approach. Here we used four major pre-processing technique is employed over our socio-medical dataset. It is helpful to improve the quality of classification process as well as paves the way to develop the features as much robust.

4.1.1 Conversion of Lowercase

Initially, we employed lowercase conversion. It sequentially scans the entire dataset and finds out the uppercase words. If the uppercase word exists, it converts it into corresponding lowercase letters using the python library function known as Numpy.

4.1.2 Eliminate Special Characters

This is the second of data pre-processing. This step eliminates the special characters such as (*, %, $, @, #, etc.) from the dataset. And it takes the uppercase-free dataset, which means after finishing step 1, it starts its process.

4.1.3 Eliminate Stop Words

Usually, stop words are frequently occurring in the text, and no meaningful information is conveyed. Hence my use of the NLTK library function we achieved to eliminate the stop words. Figure 1 demonstrates the clear view.

Fig. 1
figure 1

Proposed workflow

4.1.4 Conversion of Number to Word

These steps involved converting the number to words by use of python library known as num2words. (For instance, 7 are converted into seven).

4.1.5 Stemming and Lemmatization

It is the final step of pre-processing. Stemming is the technique to remove the suffix or prefix from a given the word and reduce the word's complexity, eliminating the stem or root of a term. We employed a porter stemmer to done this process. Lemmatization usually diminishes the word to a well-founded dictionary word.

4.2 N-gram Tokenization

N-gram tokenization [39,40,41] is defined as a collection of occurring tokens present in a frame and used to find further grants. It divides the text into different tokens after the pre-processing with the intention of feature extraction.

Based on the pre-processed data, the N-gram dictionary is explicated physically at words like 'a',' and," the," there,' etc.; these words do not have any critical data .so they are rejected from the review sentences. Separation of the opinion words from the N-gram dictionary is processed by sentence-level annotation and summarization. The summary of every sentence, along with frequency, is represented in Fig. 2. Using the summarized characteristics, the mark of every aspect was evaluated to predict the no. of positive, negative, and neutral words is represented in Table 1. Also, it has to be done after the physical explication and summarization. The score was calculated for positive, negative, and their emotions individually. Additionally, it has three steps i) recognizing the opinion terms ii) feature vectorization process iii) vector transformation. Depending on Term Frequency and Inverse Document frequency [TD-IDF] [42], the features of vectors are computed. The TD-IDF allots the weight to every feature vector. The importance of the feature vector is calculated by

$$TD-IDF\left(x, y, Y\right)=TD\left(x, y\right).IDF\left(x, Y\right)$$
(1)
$$TD=TD\left(x, y\right)$$
(2)
$$IDF\left(x, Y\right)=log\frac{n}{|\left\{y\in Y:X\in Y\right\}|}$$
(3)

where\(X\)—Entire time a specific term takes place in a document ‘y’N—Entire number of documents in corpus|{y ∈ Y∶ x ∈ y}|-no of time the term ‘x’ happens in total number of documents ‘Y’

Fig. 2
figure 2

Token analysis

Table 1 Details regarding the statistical data

The results of the weighted character vector are fashioned to LDA (latent Dirichlet allocation) [43] using pipeline. Intentionally, the Bayesian optimizer which is in LDA separates the some characters by changing them into various no.of concepts. ‘Dimension is term used to describe every concept in LDA. Below section briefly describes the LDA working type.

4.3 Assign Polarity Score

Depends on the polarity level of total domain in context, emotions are bring out by learning section [44] some phrases like “no”,”not”, ”never” and” neither” are considered to identify the relevant emotions of healthcare domain [11, 45, 46]. The below algorithm is sketched to allot the emotions to healthcare domain by utilizing the emotion recognition system.

STEP 1: Set polarity score (Polarity level) and emotion of both healthcare and non-healthcare of the domain. The different emotion lexicons techniques used for imposing to allot the polarity level are SenticNet and SentiWord Net.

STEP 2: In order to allot the relevant emotion of domain, determine the negation words or phrases

STEP 3: Calculate the total polarity level of the domain using following equation,

$$\mathrm{Polarity}\; \mathrm{level}=\sum_{n\equiv 1}^{k}{Polarity}_{level}$$
(4)

wherePolarity level - the total polarity level of the contextPolarity level - single polarity level of every topic in a domain

The following algorithm is to examine the single person’s emotions. It should be done after the polarity level determination. It can find the classification of healthcare domain along with domain classification.

STEP 1: Set the type of healthcare domains of the environment with concept categorization system. We can denote the medical concepts and their types as CM in an environment.

STEP 2: A new abbreviation \({P}_{cc}\) is introduced which is used to denote the successive healthcare domain and their types.

If the successive twin partner of concept types is same then Pcc is

$$ P_{cc1} = CM1 \cap CM2 $$
(5)

Else

$$ P_{cc2} = CM1 \cup CM2 $$
(6)

where \(CM1\) & \(CM2\) – two successive healthcare domain and their types in an environment.

In order to find the total context category(Cc), use extracted Partial Context category(pcc)

$$cc={P}_{cc1}\cap {P}_{cc2}$$
(7)

where \({P}_{cc1}\) & \({P}_{cc2}\) – partial context category of the environment.

4.4 Latent Dirichlet Allocation (LDA) with Topic Modelling

Following hierarchical Bayesian type, LDA is injected on character vectors to inspect the text in the corpus. The unsupervised likelihood type that designs corpus into the group of content is called LDA [47]. Over the words, each concept is sketched as a distribution. Consider there is possible arrangement ‘t’ along with the corpus ‘c’, and it holds ‘r’ number of review associated with it. Each probability distribution function associated with the feature vector is considered the polynomial probability distributed function, and each study will produce the arbitrary constant ‘k’. The following equation represents how feature extraction is done in the LDA model.

$$p\left({f}_{a}\right)=\sum_{b=1}^{n}P\left(\frac{{f}_{a}}{{t}_{a}}=b\right)P\left({t}_{a}=b\right)$$
(8)

whereP (\({t}_{a}\)= b) -the likelihood concept of b sampled for character fa for every revise in corpus c.P (\({f}_{i}\)|\({t}_{i}\) = k)—the likelihood of fa under topic b and n denotes the total number of domain.

The term had used while computing Eq. 8. There we used a character vector that refers to b, which is also a multinomial distribution of characters. For considering the review r, P (t) refers to the multinomial distribution. For the representation of feature vector and reviews the estimated parameters, and are used, the total features representation of a review is done using RN. Also, R represents the entire set of reviews. The topic-word hyperparameters are represented using, and the total consideration associated with Dirichlet distributions is kept on updating as every unit of the cell.

\(\theta \) Determines the variable comprises of review level and it is sampled once per feature level variable i.e. R and f corresponding to each work r with N. Consequently, for entailed features also very difficult to score and process directly, Eq. (8) clearly determines the probability function of overall possibilities available with respect to each feature vector.

$$P\left({t}_{i}=\left.k\right|{t}_{-i},{f}_{i,}{r}_{i,}\dots \dots \right)\alpha \frac{{C}_{fi}^{FN}+\beta }{{\sum }_{f\equiv 1}^{F}{C}_{fj}^{\mathrm{FN}}+F.\beta }\cdot \frac{{C}_{rij}^{RN}+\alpha }{{\sum }_{k\equiv 1}^{N}{C}_{rj}^{\mathrm{RN}}+N.\alpha }$$
(9)

whereti = k - The fi features which is assigned to the topic k.ti-1 – The ith review represents to the allocated domain fi*ri.R – Entire set of reviews.F – Entire set of features.CFN and CRN - Matrix corresponding to topic-review.\({C}_{fj}^{FN}\) – The overall features corresponding to topic k.fi *\({C}_{rj}^{RN}\)– The respective topic given to some word with respect to review r without fi.

The following Eqs. (10) and (11) represent the parameter \(\theta \) and \({\widehat{\phi }}_{i}^{\left(r\right)}\). From the equation \(\alpha \) and \(\beta \) are the hyperparameters.

$${\widehat{\phi }}_{i}^{\left(k\right)}=\frac{{C}_{fi}^{FN}+\beta }{{\sum }_{f\equiv 1}^{F}{C}_{fj}^{\mathrm{FN}}+F.\beta }$$
(10)
$${\widehat{\phi }}_{i}^{\left(r\right)}=\frac{{C}_{rij}^{RN}+\alpha }{{\sum }_{k\equiv 1}^{N}{C}_{rj}^{\mathrm{RN}}+N.\alpha }$$
(11)

In this proposed work, LDA is utilized to discover the text from the review corpus and fuse them into the latent domain. We have modeled the field as K = 100, 200, 300, 400, and 500 on the review corpus. The essential features were identified from the domain based on the probabilities correlated with each segment. Later, the implementation of feature selection is presented in the below section to handle the curse of dimensionality problem.

4.5 Classification Process

In this subsection, we discussed various classifiers used for our evaluation purpose; by using those classifiers, we analyze and classify the sentiment as positive, negative, and neutral.

4.5.1 Support Vector Machine (SVM):

While text classification and important categorization are based on hypertext, the SVM model is widely used. It is very much helpful to significantly minimize the training sample label of both inductive and transudative settings. It is a classifier that primarily classifies the cost function also enhances the classification performance. And we have used the LibSVM library, which is a function of sklearn SVC. To analyse the data in minimizing the structural risk, a learning robust method, SVM, is used. It is a kind of learning method in which classifies the data appropriately.

The training phase was usually optimally separating hyperplane, reducing the cost function so that distance between two classes of margin had been induced and feature space must be minimized. Consider there is m instance of data in the training phase. Each model consists of a pair (ai, bi) where ai ∈ Rn is a vector attribute that belongs to the instance i. Also, bi ∈ {+ 1,-1} and known as the instance of class label.

To find the hyperplane that finely separates the optimal solution, the main objective of the SVM and the corresponding data belongs to two main classes, which is \(W\cdot a+c\)=0. The decision function is used to classify and test the instance y, which is defined as

$$F\left(y\right)=W\cdot a+c$$
(12)

4.5.2 Naïve Bayes

The NB classifier between the predictors depending on Bayes theorem detachment, since NB refers to different characters, the multinomial naive Bayes are used along with proper fit prior. It is simple, has no difficulties to sketch, no repeated variable computation utilizing big datasets. It processes enlightened division models. Bayes theorem provides a way of calculating the posterior probability, P(g|h), from P(g), P(h), and P(h|g).

Naive Bayes classifier assumes that the effect of the value of a predictor (h) on a given class (g) is independent of the importance of other predictors. This assumption is called class conditional independence.

NB offers a computation of posterior probability \(P\left(g|h\right)\) from \(P\left(g\right)\), \(P\left(h\right)\) and \(P\left(h|g\right)\). It consider the performance of total of a predictor (h) on a given class (g). It is not dependent of other predictors. It is called conditional independence

$$P\left(g|h\right)=P\left(h1|g\right)*P\left(h1|g\right)\dots \dots P\left(hn|g\right)*P\left(g\right)$$
(13)

were

\(P\left(h|g\right)\) is the posterior probability of class (destiny) given predictor (attribute).

\(P\left(g\right)\). is the prior probability of class.

\(P\left(g|h\right)\). is the likelihood which is the probability of the predictor given class.

\(P\left(h\right)\) is the prior probability of the predictor.

4.5.3 Decision Tree (DT)

It comes from the classification tree algorithm family. Transversely, the subtrees are outlined by dividing the source entity. 12 is the maximum depth of the tree. The tree structure is the basic form for classification and regression models. It can be built by splitting the dataset into tiny and very tiny subsets. Simultaneously, the association tree is evolved. The decision node and leaf node are the final results. As usual, the topmost node is the root node. The root node handles the categorical and numerical data. It is a top-down approach, so the data are divided into different homogeneous values which have instances. The entropy of the decision tree is built by the following equation.

$$\mathrm{E}(\mathrm{S}) = \sum_{j=1}^{d}-P\;i\,log\;2 \; pi$$
(14)

4.5.4 Random Forest (RF)

Based on the dataset sample, the RF has various decision trees. There is a possible 'n' maximum depth of the tree. It can be built with different single decision trees at the learning phase. To make the destiny prediction, it makes use of predictions from all the trees. The mode of the classes for classification or the mean forecast for regression, since there are a group of outputs to reach a destiny, they are called an Ensemble method.

The following equation used for making binary tree:

$$Nlm=WmCl-Wleft\left(m\right)\;Cleft\left(1\right)-Wright\left (m\right) Cright\left(m\right)$$
(15)
  • left(m) = child node from left split on node m

  • right(m) = child node from right split on node m

The importance for each feature on a decision tree is then calculated as:

$$ {\text{Fi}} = \frac{{\sum j:{\mathrm{node}}\;j\;{\text{splits}}\;{\mathrm{on}}\;{\text{feature}}\;i\;nij}}{{\sum k {\EUR} \;{\text{all}}\;{\mathrm{nodes}}\;nik }} $$
(16)

where j = the importance of feature mK = the importance of node m

4.5.5 K- nearest Neighbor

K nearest neighbor is the faster algorithm which significantly produces the classification result as accurate and more precise also the performance must be enhanced. It is majorly used to find out similar objects and much applicable to solving the problem. Applications such as recommendation systems, search engines are utilized and essentially work based on KNN.

5 Experimentation and result Discussion

This section presents a clear description of the implementation details and a proper comparison with the existing approach. We used the Windows operating system with 12 GB RAM with a 2 GHz processor and 1 TB hard disk for experimentation purposes. Python programming language along with the help of proper library functions used for implementation. We used pycharm IDE for implementation purposes, and CPU will no longer be sufficient as a GPU when the dataset is huge. Hence, we used google colab.

5.1 Dataset Description

Notably, The 821,483,453 general tweets on Twitter are brought together between 16th march 2019 and 2nd October 2020. Among them, 438,072,932 are based on healthcare issues, especially numerous social environment health domains. Besides, three medical and health datasets are used to assess the coherence of the project. In the below sections, a brief explanation of this data is given. The other standard survey has taken between October 2013 and Jan 2016 using the UCI machine learning repository (common dataset in Twitter). This convention dataset has different medical tweets, which are gathered using many accounts on Twitter as follows: a. reutershealth b. kaiserhealthnews c. latimeshealth d. bbchealth e. msnhealthnews f. NBChealth g. cbchealth h. nytimeshealth i. gdnhealthcare j. everydayhealth k. nprhealth l. foxnewshealth. Table 1 describes the details regarding statistical data and Table 2 describes examples of various emo-tags list.

Table 2 Example of emo-tags list

Here our experimentation consists of 5 main phases. It includes data pre-processing which provides for conversion of lowercase, elimination of special character, elimination stop words, and conversion of number to terms, and stemming and lemmatization. After the resultant pre-processing dataset is passed to N-gram tokenization that emotion has been converted into tokens. Then we assign polarity value to each emotion and calculate cumulatively. The resultant had fed to Latent Dirichlet Allocation (LDA) with Topic modelling. In that, each topic is converted into a set of sentences. Then finally significant classifier like SVM, NB tree, DT tree, Random forest, KNN is used to analyses the sentiment of acute disease-affected people.

5.2 Data Pre-processing

The collected dataset is applied over the set of pre-processing techniques includes lemmatization and stop word removal. We also include some manually developed wordlist, which provides for some crucial keywords of sentiments like positive, negative, and neutral. Some of which are shown in the Table 3. We also embedded it in the pre-processing step.

Table 3 Set of manually developed wordlist

5.3 N-gram Tokenization

Then we applied N-gram tokenization over the resultant dataset. There annotation based on sentence-level and summarization is done and extracting the words based on opinion from N-gram dictionary. The words such as “a”, “the”, “and”, “so” etc. kinds of words are removed from the dictionary and provide clarity about what kind of emotions we want to predict. The resultant of N-gram tokenization is simulated and generated the graph and shown in below. Pink line is our approach and orange link base line approach which is except N-gram tokenization [i.e. only pre-processing]. Hence, we found that our approach outperforms well and able to provide result in an accurate manner (Table 4).

Table 4 Categorization of sentiments

5.4 Assign Polarity Score and Calculate it’s Cumulative

This step happens after the completion of N-gram tokenization. The resultant of n-gram tokenization is fed to the input of this step. To analyses the exact prediction, we will set some polarity scores to each entity so that the negation kinds of words such as “not,” never,” “none,” neither,” etc., can be recognized and extract the exact sentiment. After fix polarity score to every entity, we categorize this context into uni-gram bi-gram and tri-gram which defines the priority to analyses the sentiment also it is used to examine the kinds of statistics result involved in the disease.

The categorization over each entity is done exactly with Tri-gram in which the corresponding polarity score = 45. After finding these, cumulative is done over each entity, then entire sequence must recognize the negation, which means the negative sentiment is done. The comparison graph over each categorization is shown in below graph.

5.5 Latent Dirichlet Allocation (LDA) with Topic modelling

It was assigning probability value over every emotion and categorizes that based on that probability value. Also, it categorizes that into a high risk to low risk. Furthermore, it keeps on actioned this, so that model fine-tuned most finely. Here k = 100.

Figure 3 demonstrates the overall entities of the dataset, which clearly scatters entire emotions involved, got from the LDA model. It fine-tuned and provides a high reliable feature in which classification done with high accuracy. Each dataset which are scattered in Fig. 3 is the emotions i.e. (Positive, Negative, and Neutral). Also Table 5 and 6 describes the different emotions based on its random fixed probability values (Figs. 4 and 5).

Fig. 3
figure 3

Generated after N-gram tokenization to dataset

Table 5 Different emotions based on its random fixed probability values
Table 6 Comparison of various classifiers for analyzing sentiments
Fig. 4
figure 4

Categorize sentiments based on polarity score

Fig. 5
figure 5

Overall entities of dataset

5.6 Evaluate result using classifier:

For evaluation of result, we use most significant classifier like SVM, KNN, DT, RF, NB tree that are widely used for analyzing very crucial information. Classifier extracts the exact sentiment from each existing entity. We used evaluation metrics like Accuracy, Recall and Precision for our analysis (Figs. 6 and 7).

Fig. 6
figure 6

Overall comparison of classifier’s accuracy

Fig. 7
figure 7

Comparison of proposed Methodology with SVM along with baseline methodology with SVM

From above table describes the overall comparison of the classifier to classify and analyses sentiment with our proposed method. Where A denotes Accuracy, R denotes Recall, P denotes precision. In our experimentation analysis, we determine, SVM Outperforms well and produce better accuracy when compared to the remaining classifiers. It achieves the highest accuracy of 98.2% while classifying negative sentiment, 96.5% for organizing positive feelings, and analysing neural emotions achieves 96.1% accuracy. Next from SVM, Naïve Bayes outperforms well, and it also performs an accuracy of 95.2% accuracy for classifying negative sentiment. Consequently, the corresponding comparison graph is given as follows.

Then final comparison was our proposed methodology with SVM along with baseline methodology with SVM. Hence, we take SVM classifier and compare with base line SVM Classifier. Then we identified that our proposed methodology with SVM classifier analyses the sentiment and outperforms well when compared to traditional SVM classifier and the corresponding table and comparison are given as follows: (Table 7)

Table 7 Comparison of proposed Methodology with SVM along with baseline methodology with SVM

6 Conclusion and Future Work

Usually, analysing some acute disease-affected people mindset is a much challenging role. Also, the dataset must be very accurate. Hence, we collected datasets from three different environments, including a review from social media, a critical review from Twitter, and abstracts of medical study from the wall street journal. Then we proposed a practical framework that differs from the traditional approach, which includes four crucial techniques, i.e., Enhanced pre-processing, N-gram tokenization, assigning polarity score, and topic modelling with Latent Dirichlet Allocation. Finally, for evaluation purposes, we used a significant classifier that is prominently used in medical applications, including SVM, NB, DT, KNN, and RF. Out of this classifier, SVM outperforms well with our proposed method, and it got better accuracy of 98.2%. Also, we compared our proposed practical framework with a baseline approach to prove our work analyses acute disease affected people emotion in an efficient way. We will incorporate this with a deep learning approach to processing some vast features of the dataset in future work.