1 Introduction

As per Worldwide Cancer Statistics around 14.1 million people in this world are affected with cancer. In 2018, about 9.6 million deaths are estimated due to cancer and within that approximately 70% of deaths were caused in low-and-middle income countries. In the process of diagnosis, it is vital to consider the mood or behaviour of cancer affected patients in their treatment. Now-a-days, many patients are sharing their health conditions indifferent online health community portals, supporting organizations, pharmaceutical companies and cancer centres through social media. This kind of activities increases the awareness to the people about cancer by expressing their views, opinions and giving support to other patients in the portals. Today’s social media contains lot of cancer supported communities (Qiu et al. 2011; Whitten et al. 2002; Zhao et al. 2014) provide an open channel for people to explore about cancer related issues. Typically, people bonding with these type of interactions gives support which could play an important role to their survivorship (Shaw et al. 2000; Kim et al. 2012). Based on that, sentiment analysis or opinion mining (Fang and Zhan 2015; Pang and Lee 2008) methods can be helpful to track the feelings of the cancer patients by examining their attitudes and behaviour in the form of posts and comments. For the past decade, a lot of work has been carried on cancer and heart-diseases as they are growing consistently every day. Many of the studies determined that “text mining” is helpful in the development of medical research, particularly those diseases like heart-diseases (Torii et al. 2015) and cancer. This must be done because it is a highly tedious task to extract useful knowledge from large amount of unstructured data. In such case, “Big Data” processing proved to be useful and also it has gained much attention in recent years as it becomes very common requirement (Cheng and Lau 2015). Now-a-days, with the rapid increase in technology, data can be collected faster and more easily. This raises the problem of how to discover useful knowledge from high volumes of data. The process of Big Data is to gather bulk amount of data from various resources and to organize it in a meaning full way. Dealing or analysing the large amount of data helps in discovery of useful knowledge and meaningful patterns. Thus, big data requires scalable and efficient solutions that helps users to reachable at all levels of knowledge without any issues.

For this study, authors used apache spark framework with various text mining and machine learning strategies to examine and determine the sentiments of cancer patients (Tonks and Smith 1996). This framework provides great knowledge and information from a ridge of text which can be widely used in the field of medical research. The main objectives of this work are: to analyse the posts of various cancer patients obtained from different online peer support groups. To understand and identify the methods which makes sentiment analysis task as more convenient in health care area, by studying literature in this field and the methods they used. Moreover, we also suggest an archetype design analysis of user sentiments and opinions from the large-scale unstructured data. We also proved that the proposed distributed framework is suitable for faster analysis and computing. The rest of this paper is organized as follows: Sect. 2 describes a detailed related work. Section 3 summarizes about proposed methodology. Section 4 describes the experimental results and finally conclusion is presented in Sect. 5.

2 Related work

Recent literature has mainly aimed to work on analysing how social media was influenced to help the public in sharing health information. In this regard, many studies uses distributed analytics (Ficek and Kencl 2012; Rahnama 2014) by employing text mining strategies plays a vital role in processing high volumes of unstructured data. Baltas and Tsakalidis (2017) performed twitter sentiment analysis using apache spark with binary and ternary classification. Oneto et al. (2016) proposed a conventional extreme learning machine (ELM) model using spark cluster. Chen et al. (2016) proposed a scalable deep learning framework in mobile bigdata analytics using apache spark. Those results are clearly evident that deep learning with spark achieves higher performance when compared to other spark models. Nodarakis et al. (2016) also performed sentiment analysis using spark framework on large scale data. Du et al. (2017) proposed an optimized machine learning system to extract sentiments from HPV vaccines related tweets. They manually annotated 6000 tweets and performed hierarchical classification with SVM model. The results show better performance of 0.6732 F-score when compared to other baseline models. Alike, general sentiment analysis approaches, medical sentiment analysis has become active research area. As an example, Denecke and Nejdl (2009) proposed a method to measure credibility and content quality in patient generated content using subjectivity words. They also developed a medical ontology to assess factual content in the medical texts that appears in social media. Generally, sentiment analysis is performed either by rule or machine learning based approaches. In terms of methods, majority of works are presented with machine learning methods rather than rule-based approaches. Xia et al. (2009) proposed a multi-step opinion classification model to determine polarity in patient data. Cambria et al. (2012) proposed a framework by integrating Sentic PROMs with emotion analysis methods to measure healthcare quality. De la Torre-Díez et al. (2012) attempted to characterize breast cancer, diabetes and colorectal cancer content from social media groups. People later turned to characterize relationships of cancer patients on Twitter (Murthy and Eldredge 2016). Portier et al. (2013) applied sentiment analysis techniques to detect negative emotions and unenthusiastic mood changes in a person based on interactions in online cancer communities. Crannell et al. (2016) explored a study on analysing sentiment of families who are psychologically influenced by the patient. Chen and Zeng (2017) analysed online e-liquid reviews by extracting e-liquid features. They performed sentiment analysis to classify the polarity of features which were obtained from large online e-liquid websites. Ozcift and Gulten (2011) explored a study on improving performance of machine learning algorithms in medical diagnosis. They combined one machine learning classifier with a CFS algorithm to evaluate the classification performance. The resulting model was assessed on the three medical datasets and produced an improved accuracy rate of 74.47%, 80.49% and 87.13% when compared with base classifiers.

Chen et al. (2017a, b) proposed a CNN-MDRP model to predict disease risk from structured and unstructured data. The experiment was carried on real-life hospital dataset and reaches 94.8% accuracy with a convergence speed when compared to other existing algorithms. Lu (2013) proposed a topic identification model based on text classification. The proposed model was evaluated by collecting data from online health communities using different feature sets and classification methods. They also performed feature-based classification with C4.5, SVM and Naïve Bayes. The experimental results showed that SVM outperformed with an improved classification results among other methods. Chen et al. (2017a, b) proposed a unique approach for improving sentence level sentiment analysis. The evaluation was performed on different sentence level sentiment analysis datasets in comparison of eleven approaches. The results show that their proposed approach outperforms with other existing methods. Lin et al. (2016) discussed on TCM clinic records and obtained a multi-relationship model by combining several features using weighted LDA topic model. The performance of the proposed model was improved with better classification rate and produced a novel support in TCM clinical research. Jonnalagadda et al. (2012) explored a study on identifying opinion leaders from 147,528 obesity news articles. They prepared a corpus with 734,204 samples and achieved 88.5% efficiency. A novel deep learning model has been proposed by Manogaran et al. (2018) for heart disease diagnosis with multiple kernel learning. Minarro-Gimenez et al. (2014) applied neural language models to PubMed corpus for the first time. They aimed to work on word representations from the large amount of PubMed text articles using skip-grams. After, the interest is growing with neural language models CBOW and Skip-gram (Carod et al. 1997). They aimed to work on word representations from the large amount of PubMed text articles using skip-grams. Later, TH et al. (2015) performed skip-gram and CBOW on 1.25 million PubMed articles by assessing the word embeddings with word pairs. Chiu et al. (2016) discussed about training of good word embeddings for Biomedical NLP. They experimented on two different corpora proved that skip-gram model achieves better outcome than CBOW. Spinczyk et al. (2018) proposed a rule-based model for analysing sentiments from the patients suffering with anorexia nervosa. Using bag of words approach, the sentiment terms are identified from the documents which could help people to focus on specific topics during therapy.

2.1 Problem synopsis

To overcome the shortcomings found at literature, the proposed approach helps as a novel approach for discovering trust in a straightforward way. In this study, we tested our approach with various supervised and unsupervised algorithms for determining opinions from the health reviews. During opinion extraction, a lot of challenges were raised (Liang et al. 2014) and that were solved by integrating an generative statistical model known as Latent Dirichlet Allocation (LDA). The identified opinions were evaluated by reducing the inappropriate terms from the LDA model using various feature selection and reduction approaches. The proposed work is organized as follows:

Input A review corpus

Output A predictable model.

  1. (i)

    Initially, a corpus with number of reviews was shaped by performing different pre-processing techniques.

  2. (ii)

    Important features were extracted using N-gram tokenization method.

  3. (iii)

    TF-IDF was computed for each term to discover opinion polarity.

  4. (iv)

    A probabilistic LDA (P-LDA) model was employed for combining the review data to form distributed topics.

  5. (v)

    Significant terms were selected by Chi square feature selector.

  6. (vi)

    Optimal terms were extracted by handling curse of dimensionality using principal component analysis.

  7. (vii)

    Those reduced number of terms were classified with different classification models.

  8. (viii)

    Finally, a classifier is chosen among all the models as an accurate model based on efficiency and time complexity like evaluation metrics.

3 Methodology

3.1 Distributed computing framework

Map reduce (Ha et al. 2015) is a parallel and distributed paradigm, which empowers the processing of large scale datasets across Hadoop cluster (Madani et al. 2018). Basically, map reducer collects the input from Hadoop distributed file system (HDFS) and it comprises of two main tasks called mapper and reducer. Mapper is a base class which offers the projection of input data based on the input splits offered to the worker node and results an output with a key-value pair. The sort and shuffle stages produce data sorting based on the specified key input and creates an understandable format for the reducer. Further, reducer collects the inputs from intermediate data and performs the transformation for a given key value. In general, map reducer is not suitable for real time data processing because which requires shuffled data over the network. In the case of scalable datasets, mapper and reducer take long time in processing and having high latency. To overcome all these limitations, we performed sentiment analysis over apache spark framework as depicted in Fig. 1.

Fig. 1
figure 1

Architecture of spark framework

Apache spark is one of the fastest data processing frameworks and which is ten times faster than map reduce model and used to address the limitations of map reduce model. Spark does not allocate data to a disk at each iteration and it processes the data through memory until it reaches to its capacity. Once the disk capacity becomes full, then it pushes the data into the main storage. From the Fig. 1, spark driver master acts as a master node and several spark workers as worker nodes and theses worker nodes are handled by spark driver. The spark worker contains executors that relate to a spark to distribute the data in a cluster. Then, cluster manager looks after the responsibility of executing the tasks by instructing all the worker nodes in a cluster. API as a driver program enables the users to post the request and to get the reply in the form of a spark session or spark context. The spark session or spark context serves as a centralized part of communicating with all the spark workers. At the centre, spark works with resilient distributed datasets (RDD) to make the administrations like data collection, parallelism and fault node identification. In this work, the proposed approach employs data frames for operating and storing the data. These data frames contribute more in logical way, for operating tabular data. Spark model performs two essential operations called transformations and actions on the RDD’s.

Transformation is applied on RDD to generate a new RDD from an existing one and then action is preformed to collect the data from those RDD’s. The following Fig. 2 describes the overall implementation of the proposed system with Spark framework. Initially, this framework starts with data collection and performed pre-processing on the collected data. The following Tables 1 and 2 represents the detailed data statistics and various data pre-processing techniques that are used in this study. Later, feature extraction and feature selection are performed to extract relevant and more significant features from the pre-processed data. Further, classification was performed on the obtained features in the distributed computing model. The following sub-sections describes about the detailed explanation of the workflow.

Fig. 2
figure 2

Overview of proposed framework in spark cluster

Table 1 Statistics about the data
Table 2 List of example emoticons

3.2 Data collection

In this work, the efficiency of the proposed work is evaluated on three health and medical datasets. Initially, we have collected health tweets related to cancer from February 2, 2018 to October 2, 2018 using Twitter APIa. We acquire 821,483 public tweets from 438,072 user’s tweets with cancer related terms from various online cancer communities as represented in Fig. 3. The detailed description of this data is described in below sections. Similarly, we also utilized another benchmark Twitter dataset from UCI Machine Learning Repositoryb. This dataset contains health related tweets collected from various Twitter accounts like reutershealth, kaiserhealthnews, bbchealth, NBChealth, nytimeshealth, everydayhealth, foxnewshealth, goodhealth, latimeshealth, msnhealthnews, cbchealth, wsjhealth, usnewshealth, cnnhealth, gdnhealthcare, and nprhealth from August 2011 to December 2014. Furthermore, the third dataset is a larger dataset containing medical abstracts collected from Wall Street Journalc.

Fig. 3
figure 3

Online cancer related communities in twitter social media

  1. 1.

    https://developer.twitter.com/en/docs.html

  2. 2.

    https://archive.ics.uci.edu/ml/datasets/Health+News+in+Twitter

  3. 3.

    https://sourceforge.net/projects/corpusredundanc/files/?source=navbar.

3.3 Data pre-processing

The real-world data collected from various data communities comprises lots of noise as well as it is required to be pre-processed for extracting relevant features, which are necessary for modelling. Mostly, the quality of corpus is unprocessed, and it is essential to pre-process the data before moving it into further phases of analysis. Initially, all the instances with emoticons and URL’s were removed and then significance of each instance was identified. The hash tags were changed with defined words without hash mark.

Abbreviations misspelled, and slang words are processed by using regular expressions. In order to build meaningful corpus, all the less informative words are eliminated during this pre-processing phase. Pre-processing techniques like sanitization, stop-word removal and tokenization have been applied to diminish the corpus size by reducing unnecessary information. Through sanitization, all the numerical information is removed by transforming text into lower case from upper case. Lemmatization was performed to remove inflectional endings from the words. Stop-word removal is performed by eliminating all the English words which are not offering any necessary information. Additionally, we also developed three wordlists with 1200 extremely positive words, 1800 extremely negative words and 53 own stop-words to the existing positive, negative and stop-word lists. These wordlists were developed for understanding the influence of sentiment words in the data. The words in the wordlists are manually annotated with positive, extremely positive, negative and extremely negative sentiment labels by also including emoticons.

Generally, emoticons also contain sentiments and are habitually typed in tweets. So, these types of emoticons were taken from Wikipedia (Szegedy et al. 2015; Devi et al. 2018) and developed an emoticons dictionary with 140 emoticons by labelling them as positive, negative or extremely positive, extremely negative. Table 1 describes about the preliminary statistics of the collected corpus and Table 2 shows the example of emoticon classification used in this work. Table 3 shows the example of developed word lists.

Table 3 Manually developed wordlist

3.4 Feature extraction and selection

After pre-processing, N-gram tokenization (Timusk et al. 1995; Aisopos et al. 2011; Dey et al. 2018) was performed to extract the features by performing partitioning the text into a number of tokens. Basically, an N-gram is a set of occurring tokens in a frame and it is mostly used to predict the next tokens. Further, Sentence-level annotation and summarization is performed to extract the opinion words from the N-gram dictionary. This N-gram dictionary is manually annotated according to the pre-processed data and at summarization, terms like “a”, “and”, “the”, “there”, etc… were eliminated from the review sentences because they won’t contain any necessary information. Figure 4 shows the summary of each sentence with a frequency. After summarization and manual annotation, the score of each feature was computed to find number of positive, negative and neutral words from the summarized features is tabulated in Table 2. The score was computed separately for all positive words and negative words including their emoticons. Further, the opinion terms were identified and then processed into a feature vectorization and transformed into vectors. These feature vectors were evaluated based on Term Frequency and Inverse Document Frequency (TF-IDF) (Vittayakorn et al. 2016) measure by assigning weights to each feature vector. The significance of feature vector is measured using the Eq. (1).

$$ TFIDF\left( {t,d,D} \right) = TF\left( {t,d} \right) \cdot IDF\left( {t,D} \right) $$
(1)
$$ IDF\left( {t,D} \right) = log\frac{N}{{\left| {\left\{ {d \in D:t \in d} \right\}} \right|}} $$
(2)

where ‘t’ denotes the term that occurs number of times in a document ‘d’. N represents the total number of documents in corpus and \( \left| {\left\{ {d \in D:t \in d} \right\}} \right| \) represents the term ‘t’ appears in the total number of documents ‘D’. Based on this calculation, all the weighted feature vectors are modelled into Latent Dirichlet Allocation (Miura et al. 2013) through a pipeline. LDA consists of a “Bayesian” optimizer to extract relevant features by transforming them into different number of topics. Each topic in LDA is considered as a “dimension” and the following sub-section describes more about working of LDA model.

Fig. 4
figure 4

Example word cloud for summarized sentiment words of cancer related terms

3.4.1 Topic modelling with Latent Dirichlet Allocation (LDA)

LDA (Bashri and Kusumaningrum 2017) is an unsupervised probabilistic method that models corpus into set of topics and then each topic is modelled as a distribution over words. In this work, LDA is employed on feature vectors to pursue the text in corpus followed by a hierarchical Bayesian approach. Assume there are ‘t’ arrangements of topics in a corpus ‘c’, where corpus contains number of reviews ‘r’. Topics in corpus can be considered as a polynomial probability distribution of feature vectors and each review in the corpus is arbitrarily produced by ‘k’ topics. The following Eq. (3) represents the feature extraction process with LDA. A feature vector ‘f’ is obtained by uniting the topics ‘t’ for ‘r’ in corpus ‘c’ and sampled from each word distribution.

$$ P\left( {f_{i} } \right) = \mathop \sum \limits_{k = 1}^{N} P\left( {\frac{{f_{i} }}{{t_{i} }} = k} \right)P\left( {t_{i} = k} \right) $$
(3)

where \( P\left( {t_{i} = k} \right) \) represents the probability of topic \( k \) sampled for feature \( f_{i} \) for each review in corpus c. \( P\left( {f_{i} |t_{i} = k} \right) \) represents the probability of \( f_{i} \) under topic \( k \) and \( N \) denotes the total number of topics. The above equation can be more simplified by assuming that \( \phi^{k} = P\left( {f_{i} |t_{i} = k} \right) \) and refers to a multinomial distribution of feature vectors for topics k. \( \theta^{r} = P\left( t \right) \) refers to a multinomial distribution of topics for review \( r \). \( \phi \) and \( \theta \) are the estimated parameters defined for semantic representation of feature vectors and reviews. where \( N_{r} \) is denoted as number of features in a review ‘\( r \)’. R represents the total number of reviews. \( \alpha \) and \( \beta \) are known as hyper parameters for topic-word and review-topic Dirichlet distributions required in the process of corpus generation (Yu et al. 2017). \( \theta \) is a review level variable sampled once per \( r \) and f is feature level variable sampled once for each word in r with \( N_{r} \). At the same time, if there are any entailed features, it is hard to score them directly. Hence, Gibbs sampling is used to overcome such a limitation by acquiring the desired parameter values. It determines the number of topics from the review corpus and generates the probability distributions of the topics from reviews and features. Gibbs sampler uses a conditional probability and it is given as follows is Eq. (4),

$$ P\left( {t_{i} = k\left| {t_{ - i} ,f_{i} ,r_{i} , \ldots } \right.} \right) \propto \frac{{C_{fij}^{FN} + \beta }}{{\mathop \sum \nolimits_{f = 1}^{F} C_{fj}^{FN} + F \cdot \beta }} \cdot \frac{{C_{{r_{ij} }}^{RN} + \alpha }}{{\mathop \sum \nolimits_{k = 1}^{N} C_{{r_{j} }}^{RN} + N \cdot \alpha }} $$
(4)

where \( t_{i} = k \) represents the features \( f_{i} \) allocated to the topic \( k \), and \( t_{ - i} \) denotes the allocated topics to \( f_{i} \). \( r_{i} \) represents ith number of reviews. \( R \) represents the number of reviews and \( F \) denotes the number of features used in this study. \( C^{FN} \) and \( C^{RN} \) represents the topic-review matrix and \( C_{fj}^{FN} \) denotes the collection of features \( f_{i} \) as it is given to topic \( k \) without the current word \( f_{i} \). \( C_{{r_{j} }}^{RN} \) denotes the topic \( k \) given to some word in a review \( r \) without \( f_{i} \). By sampling, the parameters \( \theta \) and \( \phi \) are obtained as following,

$$ \hat{\phi }_{i}^{\left( k \right)} = \frac{{C_{{f_{ij} }}^{FN} + \beta }}{{\mathop \sum \nolimits_{f = 1}^{F} C_{{f_{j} }}^{FN} + F \cdot \beta }} $$
(5)
$$ \hat{\theta }_{j}^{\left( r \right)} = \frac{{C_{{r_{ij} }}^{RN} + \alpha }}{{\mathop \sum \nolimits_{k = 1}^{N} C_{{r_{j} }}^{RN} + N \cdot \alpha }} $$
(6)

In this proposed work, LDA is utilized to discover the text from the review corpus and fuse them into latent topics. The Table 4 represents an example of probabilistic LDA topics on the review corpus. We have modelled the number of topics as \( K = \) 100, 200, 300, 400 and 500 on the review corpus. The important features were identified from the topics based on the probabilities correlated with each feature. Later, the implementation of feature selection is presented in below section to handle curse of dimensionality problem.

Table 4 An example LDA model with k number of topics

3.4.2 Dimensionality reduction and feature selection with Chi square and PCA

Chi square selector (Meesad et al. 2011) and principal component analysis (Underhill et al. 2007; Vinodhini and Chandrasekaran 2014, 2015) are used to extract relevant features from the LDA topic modelling for use in model construction. LDA yields vast number of dimensions (topics), and it is highly required to employ a feature selection model to handle the curse of dimensionality. These both models reduce the feature space, which can also improve the speed and learning attitude. In this paper, Chi square selector is used to extract highly relevant features from all the dimensions and then PCA is applied to reduce the no. of dimensions with a higher degree. To obtain principal components (uncorrelated variables), PCA is applied to compute orthogonal transformations of the variables that constitute the dimensions of the existing features. In general, PCA deals with variability mostly but not correlation, so Chi square is used to find out the variables with a high degree of correlation.

3.5 Text categorization and sentiment analysis with LSTM

In this work, authors explored five different types of classifiers available in Spark ML Lib platform to perform sentiment analysis. Multinomial Logistic Regression (MLR), Multinomial Naive Bayes (NB), Linear Support Vector Machine (LSVC), Multilayer Perceptron (MLP), and Decision Tree (DT) were used for classification. Long-Short Term Memory (LSTM) classifier (Liang et al. 2017) is called on PySpark for classification of opinions in the proposed model. MLR (Hamdan et al. 2015) is a type of binary classifier that is used to model dichotomous outcome variables. It uses a logistic function to determine the correlation among the sample class and the extracted features from the input. This MLR method handles the multi-class problem by fitting (N-1) independent binary logistic classifier model. At the same time, it arbitrarily selects one target class as a reference class and fits (N-1) regression models that compare each of the remaining classes to the reference class. The limitation with this MLR model is that it cannot handle data with large number of target classes. Additionally, it requires a larger dataset to obtain better performance. NB is a well-known binary classifier (Brody and Davidson 1998) and it assumes that for any given label \( b \), the relationship among a conditionally independent feature \( a_{i} \) can be defined as Eqs. (7) and (8):

$$ P\left( {b|a_{1} , \ldots ,a_{\omega } } \right) \propto P\left( b \right)\mathop \prod \limits_{i = 1}^{\omega } P(a_{i} |b) $$
(7)
$$ P\left( b \right)\mathop \prod \limits_{i = 1}^{n} P\left( {a_{i} |b} \right) \to \hat{b} = argmax_{b} P\left( b \right)\mathop \prod \limits_{i = 1}^{\omega } P\left( {a_{i} |b} \right) . $$
(8)

In the above Eqs. (7) and (8), \( \omega \) denotes the feature count in a review with positive or negative label \( b. \) But in this paper, we have performed Multinomial Naïve Bayes (MNB) (Vittayakorn et al. 2016; Szegedy et al. 2015; Yu et al. 2017). Multinomial Naive Bayes is a probabilistic learning method used for efficient document classification. Initially, the probability of each class is computed with following equation:

$$ P\left( C \right) = \frac{{T_{C} }}{{T_{R} }} $$
(9)

where \( T_{R} \). represents the number of review(s) labeled with class ‘C’ and \( T_{R} \) represents the number of reviews given for training. Then, the probability of review to each class is computed with:

$$ P\left( {C|R} \right) = P\left( C \right)\mathop \prod \limits_{i = 1}^{m} P(a_{i} \in R|C) . $$
(10)

The above Eq. (10) shows the output of high probability class will be assigned as the review ‘R’ class. LSVC using linear kernel function supports only binary classification (Esuli and Sebastiani 2006). LSVC from Spark’s ML classifier better suits to scalable datasets and it is widely used for classification in the field of machine learning that solves optimization problems. SVM finds an optimal hyperplane which acts as a separator between two classes and it also identifies an optimum marginacurve between two classes called as a maximum marginal classifier.

The larger the margin between the hyperplane provides a good generalization for classification of data. MLP is a feed-forward artificial neural network, maps set of inputs onto set of suitable output. Generally, MLP consists of multiple layers of nodes and each layer is interconnected to the next layer to form a network. The nodes of the hidden layer use an activation function based on sigmoid function, whereas output layer nodes use activation function based on Softmax function. MLP foraon md for network training containing multiple layers of computational units connected in a feed-forward way.

$$ {\text{Sid Function}}:\sigma \left( {z_{i} } \right) = \frac{1}{{\left( {1 + e^{{ - z_{i} }} } \right)}} . $$
(11)
$$ {\text{Softmax Function}}:f\left( {z_{i} } \right) = \frac{{e^{{z_{i} }} }}{{\left( {\mathop \sum \nolimits_{k = 1} Ne^{{z_{k} }} } \right)}} . $$
(12)

N’ denotes the number of nodes in the output layers, and \( z_{i} \) is computed as \( z_{i} = w_{i} x + b_{i} \) where \( b \) is the bias for each node and \( w_{i} \) is the weight of \( i{\text{th}} \) node.

LSTM is a special class of recurrent neural network (RNN), which have the ability of learning long-term dependencies (Soutner and Müller 2013) and overcomes the limitations of RNN. LSTM captures the input terms from the sentence in a distributed term representation form which is used to represent a term in vocabulary in the form of continuous values. Each term \( w \) in dictionary \( W \) is inserted into n-dimensional space (\( L \in R^{n \times \left| W \right|} \)). Typically, a LSTM network contains a cell state \( C_{i} \) and hidden state \( h_{i} \) and it also contains a set of recurrently connected memory cells which consists of each three multiplicative units: forget \( F_{i} \) unit, input \( I_{i} \) unit and output \( O_{i} \) unit with weights \( W_{F} , W_{I} , W_{O} \) and bias \( B_{F, } B_{I, } B_{O} \) respectively. These multiplicative units aid the LSTM memory cell to execute various operations like read, write, reset and allows memory cell to access and store the information over an epoch. “σ” is sigmoid function used in input, forget, and output units for generation of values in between 0 and 1. The following equations represent a LSTM memory cell which can be denoted as:

$$ {\text{I}}_{\text{i}} = \, \sigma \left( {{\text{W}}_{\text{I}} \left[ {{\text{x}}_{{{\text{i}},}} {\text{h}}_{{{\text{i}} - 1}} } \right] + {\text{B}}_{\text{I}} } \right) $$
(13)

Input gate’s function ‘Ii’ generates new memory state if the significance of the new word is considerable. Based on the input and past hidden states, input gate determines the worth of preserving the new word, and thus allows creation of new memory.

$$ {\text{F}}_{\text{i}} = \, \sigma \left( {{\text{W}}_{\text{F}} \left[ {{\text{x}}_{\text{i}} ,{\text{ h}}_{{{\text{i}} - 1}} } \right] + {\text{B}}_{\text{F}} } \right) $$
(14)

Forget gate ‘Fi’ is like the input gate but it determines whether the past memory cell is useful for the computation of the current memory cell or not. The forget gate acts on the input word and the past hidden state and produces Fi.

$$ \tilde{C}_{\text{i}} = { \tanh }\left( {{\text{W}}_{\text{C}} \left[ {{\text{x}}_{\text{i}} ,{\text{ h}}_{{{\text{i}} - 1}} } \right] + {\text{B}}_{\text{C}} } \right) $$
(15)

where ‘\( \tilde{C}_{i} \)’ is new memory which is based on aspects of new word ‘xi’ and past hidden state ‘hi−1

$$ {\text{C}}_{\text{i}} = {\text{ F}}_{\text{i}} \times {\text{C}}_{{{\text{i}} - 1}} + {\text{ I}}_{\text{i}} \times {\tilde{\text{C}}}_{\text{i}} $$
(16)

Based on outcome of forget gate ‘Fi’, it leaves out past memory ‘Ci−1’ in this stage. It is also takes outcome of input gate Ii and new memory \( \tilde{C}_{i} \). Then the model sums these two results to produce the final memory ‘Ci’.

$$ {\text{O}}_{\text{i}} = \, \sigma ({\text{W}}_{\text{O}} [{\text{x}}_{\text{i}} ,{\text{h}}_{{{\text{i}} - 1}} ] + {\text{B}}_{\text{O}} ) $$
(17)
$$ {\text{h}}_{\text{i}} = {\text{O}}_{\text{i}} \times { \tanh }\left( {{\text{C}}_{\text{i}} } \right) $$
(18)

Output gate ‘Oi’ determines when to output the value stored in the memory cell to the hidden layer. ‘hi’ is new hidden state computed based on pointwise multiplying the output state and the new cell state.

4 Experimental results and discussions

This section presents a detailed study on the corpus and performance of the proposed method in terms of efficiency and computational complexity. Initially, a cluster is organized with 6 computing nodes configured with Linux Operating System; 8 GB RAM with 2.4 GHz processor and 1 TB hard disk. One among those six nodes is said to be as a Master Node and others were said to be Data nodes. The applications were built over Spark version 2.3.0 with Pyspark library installed using Python API on top of the Hadoop. The proposed approach has been executed on Spark data frames which takes the input in the form of tabular values.

4.1 Feature analysis and performance evaluation with running time on corpus 1, 2, 3

The corpus was duly processed in accordance with the following: (1) Significant terms associated with cancer feature must be present, (2) removal of duplicates in text and (3) removal of non-English characters other than emoticons from the text to evade difficulty in shaping multilingual tweets. Most of the tweets are undersized and have a bunch of odd words which makes the sentiment analysis task more complex. The obtained dataset was termed as corpus-1 and other benchmark datasets used in this study were termed as corpus-2 and corpus-3. The aim of using this corpus-1 is to determine the opinions of people on cancer through various online social media communities. During pre-processing phase, all the similar tweets were discarded from the corpora as they do not provide any useful information. The following Table 5 describes the statistics of terms extracted with different information retrieval techniques. After pre-processing, corpus -1 is comprised with 680,193 tweets with 7,652,217 terms and attained 7,173,144 terms by performing summarization. All the stop-words and sparse terms were removed during pre-processing and achieved 6,503,347 terms with TF-IDF. Further, the tokenized terms were considered as 577,318, 288,660 and 192,442 for unigrams, bigrams and trigrams by assigning polarity labels as positive, negative, and neutral.

Table 5 Statistics of terms extracted using different information retrieval techniques

Similarly, corpus-2 also consists of 58,927 health news, tweets with 395,635 terms. After pre-processing, 52,317 tweets were obtained with 285,987 terms after performing summarization. All the stop-words and sparse terms were removed during pre-processing and obtained 11,974 terms with TF-IDF. The tokenized terms were 1204 for unigrams, 603 for bigrams and 402 for Trigrams by assigning polarity labels as positive, negative, and neutral. Further, corpus-3 also contains a total of 34,611 medical documents with 7,964,227 terms. The obtained terms were 7,236,137 terms after performing tokenization and summarization. After removing stop-word and sparse terms, the terms obtained with TF-IDF were 6,592,318 and considered 601,498 for unigrams, 300,750 for bigrams and 200,450 for Trigrams. The sample unigram, bigram and trigrams extracted from a text review sample of corpus 1, 2 and 3 are presented in Table 6.

Table 6 List of unigram, bigram and trigrams from a sample text review of Corpus 1, 2 and 3

In this work, we aimed to work with bigrams because many of the previous works has been proved that working with bigrams extracts more sentiment information and achieves better results when compared to other n-grams (Ando et al. 2002; Barry 2017). The extracted bigrams are manually annotated by including medical terms to the available sentiment dictionaries such as SentiWordNet and labelled them as positive, negative, and neutral. Later, these reduced terms of various corpora were modelled into several topics using LDA as shown in below Figs. 5, 6 and 7.

Fig. 5
figure 5

No. of reduced topics with different feature extraction techniques when N = 300

Fig. 6
figure 6

No. of reduced topics with different feature extraction techniques when N = 200

Fig. 7
figure 7

No. of reduced topics with different feature extraction techniques when N = 100

Each topic in LDA results top terms from each tweet based on its frequency. As represented in below Figures, LDA results higher number of dimensions with methods such as Word2vec, TF-IDF and Doc2vec. As a part of topic modelling, we consider the size of N as 100, 200 and 300 topics. At this stage, the feature selection is highly required to obtain optimal number of dimensions to achieve better accuracy. In this regard, SVD and Chi square selector (CSS) were applied to extract more significant features and then PCA was employed to handle curse of dimensionality problem on the corpora. Finally, optimal topics that were obtained with CSS and PCA are used to perform classification considered for easier and faster computation. In the feature extraction phase, we have modelled LDA with N = 100, 200 and 300 topics based on the size of each extracted features.

Firstly, on corpus-1 singular value decomposition (SVD) was applied by considering 300 topics and extracted significant features with 230 topics. After that, Chi square is applied as a feature selector to obtain 185 relevant topics. Finally, PCA as a dimensionality reduction approach was incorporated and achieved an optimal 117 features. Similarly, the same topic modelling and feature selection was carried with 200 and 100 topics also. Further, the similar approach is applied on other two datasets and then those features were classified with various machine learning classifiers available in MLlib and LSTM. The performance of each algorithm was assessed with metrics like accuracy and running time with tenfold cross validation. The trained LSTM model performs better among other classifiers. The following Tables 7 and 8 present the achieved results for all corpora MLR, NB, LSVC, DTree, MLP, and LSTM on both single node and multi-node cluster machines. From the results, it is shown that the proposed framework performed very efficient in most of the cases with an improved accuracy of 97.84% on corpus-1 and 88.37% on corpus-2 and 84.1% on corpus-3. Finally, the running time of all models with all feature extraction, feature selection (Yan et al. 2012) over single node and multi-node cluster were represented in Figs. 8 and 9.

Table 7 Performance evaluation of different classifiers on single node with various feature selection techniques
Table 8 Performance evaluation of different classifiers on a cluster with various feature selection techniques
Fig. 8
figure 8

Time complexities achieved on corpus-1, 2 and 3 using various feature extraction methods on a single node

Fig. 9
figure 9

Time complexities achieved on corpus-1, 2 and 3 using various feature extraction methods on multi-node cluster

4.2 Discussions and findings

This work extends the body of literature that emphasizes the significance of machine learning for publicly available large-scale datasets. Working with social media allows for examining the discussions happening outside of the public health space. The outcome of this work gives an example of utilizing various machine learning methods to estimate the gigantic social media landscape around cancer. We performed sentiment analysis of cancer patients by collecting 821,483 user tweets from various online cancer communities between February 2, 2018 to October 2, 2018 using Twitter API. Our analysis in this study also found that, majority of the people with cancer exhibits positive feelings regarding their support, treatment, and awareness openly on social media. However, our study also found some errors in the positive, negative and neutral categories. This determines the low accuracy obtained with some existing models. To examine errors in data, we analysed the misclassified samples in each corpus. We identified 3 major causes of errors in this study. They are sarcastic sentences, slang words, and word indistinctness. The first issue arises when there is a difficulty in finding the polarity. For example, if there are two sentences with different polarities then it is said to be a sarcastic sentence. At this state, it is a challenging task to solve such issues in opinion mining. The second issue is about making slang words which are commonly seen in social media. This issue affects the text because there are no spaces between words and gives a new meaning to the original text. Updating the existing polarity lexicon is only the solution to overcome this challenge. And such updating at every time might be a difficult task. The third problem shows the difficulties in assigning polarities to word. Depending on the context, a simple word in the text may contain many meanings. This would create some difficulties in detecting polarities from such words. To solve this issue, words connected with each other must be considered for successful opinion identification.

The framework presented in this study addresses those issues and can be applied to find similar knowledge about other public health-related topics. This study used manually annotated corpus to train the machine learning classifiers to analyse the sentiments from cancer related content on Twitter using spark framework. Of the examination of various feature extraction and feature selection techniques, bigrams with LDA, Chi square selector and PCA outperforms among other models like doc2vec, word2vec, and SVD. These were generally found to be successful for extracting more significant and optimal features. From the literature, manual annotation of sentiment data in medical field provides better results by constructing a quality dataset. Generally, social media platforms consist of 30% of non-relevant information which restricts the accuracy of real-time sentiment analysis. Automatic annotation requires considerable investment in the preparation of the lexicon and scripting of the automated tagging is complex task and results in loosing syntactic information and their relations. On the other hand, manual annotation overcomes the problem with minimal enhancements of the lexicon. Manual annotation is mainly limited in time taking for training data construction. The findings from above graphs and Tables assess the performances of classification methods evaluated on various corpora in terms of accuracy using proposed approach on both single node and cluster. Apache spark, a distributed computing framework supports data processing and querying on large scale datasets. It is found to be highly productive and fast in real-time data analytics. LSTM better handles time series data when compared with CNN and FCNN because it can make use of internal memory to process arbitrary input length of text sequences. Word embeddings will be clearly notified and memorized in LSTMs when compared to CNN. From the experimentation, it is clearly evidenced that the proposed work outperforms CNN on all three corpuses. The popular models such as NB, LSVC, DTree and MLP also show better rates of accuracy. These models do not attain extremely better accuracy but successful in terms of runtime. These obtained results prove that the proposed approach will be suitable for performing sentiment analysis in easier way in medical research field.

5 Conclusion and future scope

In this proposed work, we implemented a distributed framework to analyse mood of cancer affected patients from various online cancer supporting communities. The corpus was constructed by collecting patient reviews from various domains using Twitter API. It was manually annotated and well pre-processed to remove un-necessary information. Later, feature extraction followed by N-gram tokenization was employed to extract highly relevant features using LDA topic modelling. Performance of the proposed work was evaluated and then compared with various classifiers such as MLR, NB, LSVC, MLP, DTree and LSTM. Based on the results, the performance evaluation of this proposed approach outperforms on both single and multi-node machines. Finally, we found that majority of the patients were expressed positive and some expressed negative and neutral about their disease. In immediate future, this proposed methodology can be extended with some potential feature extraction and feature selection techniques which work more efficiently on distributed environment. Furthermore, we will plan to propose an improved machine learning model that would be suitable for huge volumes of data at a faster rate. In this regard, we could also expect some potential extensions to this methodology for the development of sentiment analysis in health care field by providing valuable contributions for the researchers in future.