Keywords

1 Introduction

Sentiment analysis is procedurally distinguishing, extricating, evaluating, and contemplating abstract data by devices of computational phonetics, computational linguistics, biometrics, and text analysis. The sentiment analyzer assesses what constitutes a positive, neutral, or negative piece of writing. Sentiment analysis is helpful for data analysts within large enterprises for conducting research on market, public opinion, brand reputation, and understanding customer experiences [1]. The already existing product or a service is the object considered for sentiment analysis. Sentiment analysis can be broadly classified into four types, namely, fine-grained, emotion-based, aspect-based, and intent-based sentiment analysis. In fine-grained sentiment analysis, the polarity of the opinion is determined by simple binary classification of positive and negative sentiment. Depending on the use case, this type may also fit within the higher specification [2]. To find indications of particular emotional states mentioned in the text, emotion detection is used. Advanced sentiment analysis is done using aspect-based sentiment analysis, and the objective is to convey opinion with respect to the certain part of the input. The intent-based sentiment analysis ascertains the type of intention that the message is expressing. Opinion mining is synonymous to sentiment analysis despite the general understanding that sentiments are emotionally loaded opinions [3].

Analyzing enormous amount of unstructured text into a more specific news articles, devising suitable algorithms to understand the opinion from text and finding positive and negative score out of it is a challenging task. Thus, there is a need to incorporate suitable techniques to improve the accuracy of the results obtained from sentiment analysis of the news articles [4]. Although there are several approaches concerned with sentiment analysis of news articles, the outputs provided by these approaches lack accuracy to a considerable extent [5].

Negative stopwords include important information about the sentiment of the sentence, yet it has been discovered that most sentiment analysis pre-processing techniques discard these stopwords [6]. As a result, several times semantic information gets lost, resulting in inaccurate sentiment analysis. The contributions of importance of this article are as follows:

  1. 1.

    This article presents an approach where every sentence is processed and examined at the sentence level to establish its polarity. In order to minimize the loss of significant information for labelling news articles, the proposed approach does the pre-processing considering the negative stopwords and labels the sentiments of the article using Support Vector Machine (SVM).

  2. 2.

    The results obtained after applying the proposed approach on the dataset of different categories of news articles obtained from BBC are usually found to yield a comparatively higher accuracy for providing the sentiment polarity of various news items.

  3. 3.

    A regression analysis is presented to confirm that the proposed approach provides comparatively better accuracy for sentiment analysis of news articles.

A survey of relevant research on sentiment analysis of text data is described in Sect. 2 of this article. The proposed approach for sentiment analysis of news items is described in Sect. 3, which is followed by an implementation utilizing news articles from the BBC dataset in Sect. 4. This section also provides an evaluation of the proposed approach. Section 5 concludes with a discussion on the benefits and limitations of the proposed approach as well as outlines the goals and enhancements for future work.

2 Related Work

In Natural Language Processing (NLP), the field of sentiment analysis has been explored with a variety of approaches. There are plenty of researchers who are working on Sentiment Analysis of Texts. The researchers have extensively contributed to its development and enhancing its applications in various fields. There are various methods, ranging from dictionary-based methods to machine learning methods. In 2018, Urologin [7] proposed techniques for extracting and displaying text data. It performs combined text summarization and sentiment analysis. A text summarization technique based on pronoun replacement is created, and sentiment data is gathered using the VADER sentiment analyzer. However, their summarization approach may cause loss in semantic information, which can lead to a wrong sentiment analysis of the document. In their work, they used a standard sentiment analysis repository (Github repo VADER). Taj et al. [8] characterized news articles as positive, negative, and unbiased classes by gathering the all-out opinion scores of the sentences in the article using lexical-based methodology. It implements Lexicon-based sentiment assessment of news stories, but it has insufficient or restricted word inclusion. As a result, numerous new lexical items with distinct semantics should be refreshed in lexical data set. It solely employs news articles in English from one hotspot for sentiment analysis. Vilasrao et al. [9] intended to develop a system with emotion dataset and training dataset to obtain valency (in the form of emotional and neutral). They have used Lexicon-based approach and Deep Learning Technology. They presented their estimation and investigation utilizing dictionary-based methodology and deep learning techniques to deal with emotion classes. The output of their proposed framework (which perceives the presence of assumptions extremity) can be used to enhance the sentiment analysis framework but their approach is only lexical based and uses very less amount of data for training using traditional machine learning (ML) algorithm. As a result, the accuracy of their output is not very high. Souma et al.’s [10] work was to forecast the financial news sentiments. They used Simple Sequential LSTM network architecture for the analysis, but in their work, an assumption (which may be error prone) is made that if the stock log return value is negative then the sentiment is negative and vice versa. No standard dataset was used to perform the experiment and obtain the results. No normalization of the statements in the news articles was done since same weightage was given to all the statements. Shirsat et al.’s [11] work was concerned with sentence level negation identification from news articles. They used a dictionary-based approach with ML techniques but no standard dataset was used (scraped dataset was used) and the used dataset was quite small. Their approach was dictionary based so no semantic information was used. For obtaining word-level emotion distribution Li et al. [12] considered the use of dictionary with word-level emotion distribution (known as NRC-Valence arousal dominance) for assigning emotions along with intensities to the sentiment words as efficient. Two models were proposed by Basiri et al. [13] in their study that employed a three-way decision theory and proposed two models. The three-way fusion of one deep learning model and the conventional learning method was used in the first model (3W1DT), whereas, three-way fusion of three deep learning models was used in the second model (3W3DT). The results obtained using Drugs.com dataset showed that both frameworks outperformed the traditional deep learning methods. In addition, it was noted that the first fusion model performed significantly better than the second model in terms of accuracy and F1-metric. Using the Rotten Tomato movie review dataset, Tiwari et al. [14] have implemented 3 ML algorithms (Maximum Entropy, Naive Bayes, and SVM) with the n-gram feature extraction technique. They noted a drop in accuracy for n-grams with larger values of n, such as n = 4, 5, and 6. Using various feature vectors like Bag of Words (BOW), Unigram with Sentiwordnet Soumya et al’s [15] work divided 3184 Malayalam tweets into negative and positive opinions. They used the ML algorithms Naive Bayes, Random Forest and found that Random Forest performed better (with an accuracy of 95.6%) with Unigram Sentiwordnet when negation words were taken into account. Rao et al. [16] used Long Short-Term Memory (LSTM) to improve sentiment analysis by first cleaning the datasets and removing the sentences with weaker emotional polarity. On three publicly accessible document-level review datasets, their model outperforms the state-of-the-art models.

From the srvey, it is found that although there are several approaches concerned with sentiment analysis, but the output provided by these approaches lacks accuracy to a considerable extent. In many approaches, it is found that no standard dataset was used to perform the experiment and obtain the results. In some approaches, no normalization of the statements in the news articles were done since same weightage was given to all the statements [17]. The dataset used in some of the approaches were quite small. In some approaches, no semantic information was used. In few approaches, there was loss in semantic information which lead to wrong sentiment analysis of the document. Analyzing enormous amount of unstructured text into a more specific news articles, devising suitable algorithms to understand the opinion from text and finding positive and negative score out of it is a challenging task. In essence, it is the process of determining the emotional tone behind a series of words, used to gain an understanding of the attitudes, opinions, and emotions expressed within an online mention. Negative stopwords carry significant information about the sentiment of the sentence, but it is found that most of the approaches concerned with sentiment analysis removed these stopwords during preprocessing stage [18]. As a result, there can be loss in semantic information leading to incorrect sentiment analysis [19]. Thus, there is a need to incorporate suitable techniques to improve the accuracy of the results obtained from the sentiment analysis [20]. Hence, the intention is to develop a suitable approach to improve the accuracy of sentiment analysis by considering the negative stopwords [21].

3 Proposed Approach

The polarity of the text data is determined or expressed by the sentiment analysis approach. Essentially, there are three layers of sentiment analysis: A sentiment analysis at the document level will be performed first to ascertain the polarity of the document. In the case of a text file containing only product reviews, the algorithm decides the polarity of all the content in the document. It is because of this that the document only conveys opinions about one specific subject and cannot be used to evaluate other products. Every sentence is processed and examined at the sentence level to establish its polarity. Finding emotions about things and their characteristics are made possible by aspect-level sentiment analysis. The proposed algorithm Negative Stopwords Aware Sentiment Analysis (NSASA) does the preprocessing considering the negative stopwords and labels the sentiments of the article using SVM. The steps of the proposed algorithm NSASA are as follows:

Algorithm

Negative Stopwords Aware Sentiment Analysis (NSASA)

Algorithm NSASA begins with the initialization and storing of negative words in the negation_words variable depicted in step 1. Generation of a set of positive and negative words takes place in steps 2 and 3. English stopwords are stored in step 3 and lemmatizer object is created in step 4. A dictionary is created for storing separately the positive and negative set of words as depicted in steps 6–9. The dictionary containing stopwords is updated by removing negative words list from it as depicted in step 10. Training and testing of dataset takes place from steps 11 to 13. Step 14 fits the model over the dataset. SVM is applied on the model and the predictions are made and final label is printed from steps 14 to 18. The model used in this proposed algorithm corresponds to LinearSVC.

Various user-defined functions which are used in the algorithm NSASA are defined in Table 1. Table 2 provides the descriptions of various predefined functions and parameters which are used in the algorithm NSASA.

Table 1 User-defined functions and their definitions
Table 2 Predefined functions, parameters, and their description

4 Implementation Details and Evaluation

The proposed approach has been implemented using Python 3.7 with Google Colaboratory and the NLTK package. From the NLTK package nltk.download('punkt'), nltk.download('stopwords'), nltk.download('opinion_lexicon') were imported. Table 3 below displays the article’s categorization and the proportion of neutral, favorable, and unfavorable terms in it. This study makes use of the Bing Liu dictionary, which has 4783 negative words and 2006 positive terms [11].

Table 3 Category wise document polarity using Shirsat et al.

Table 4 shows the article’s categorization and the proportion of neutral, favorable, and unfavorable terms in it after using the proposed algorithm NSASA. Additionally, it makes use of the Bing Liu dictionary, which has 4783 negative words and 2006 positive terms.

Table 4 Category wise document polarity using NSASA

Figure 1 shows the percentage accuracy values obtained from the approach proposed by Shirsat et al. and the proposed NSASA for four different types of datasets.

Fig. 1
A double bar chart displays the highest level of accuracy for tech obtained through both the Shrisat et al and proposed N S A S A approach compared to sport, entertainment, and business.

Obtained accuracy values for four different datasets

For the purpose of evaluation, a comparison of the proposed approach using NSASA has been made with the approach provided by Shirsat et al. [11], and the results obtained are summarized in Table 5. Based on the results obtained after the experimentation it is found that the highest accuracy achieved by the approach proposed by Shirsat et al. in the Tech. data is 86%, and the lowest accuracy achieved by the approach proposed by Shirsat et al. in the business data is 75%. Whereas the highest accuracy achieved by the proposed NSASA in Tech. data is 98%, and the lowest accuracy achieved by the proposed NSASA in business data is 73%.

Table 5 Evaluation results

Thus, the approach of Shirsat et al. is found to provide results with an average accuracy of 80.25 whereas the proposed NSASA is found to provide results with an average accuracy of 85.75. The accuracy values obtained from the technique of Shirsat et al. and the proposed NSASA were also employed in a regression analysis, with the outputs reported in Table 6. Upon data analysis, it is discovered that the proposed NSASA algorithm’s p-value is 0.009156.

Table 6 Regression analysis results

Regression coefficient r (Multiple R) = 0.9908 and p < 0.05 are obtained in Table 6. This suggests that the accuracy values acquired from Shirsat et al. [11], and the accuracy values obtained from the proposed NSASA algorithm have a positive relationship. Thus, the proposed NSASA seems to provide better accuracy in sentiment analysis of news articles as compared to the approach specified by Shirsat et al.

5 Conclusion

The amount of text continually expanding is an invaluable source of knowledge and information that must be effectively retrieved to reap its benefits. It may be quite challenging to analyze the vast amount of unstructured content into more specialized news items to create the proper algorithms to extract opinions from text and assign positive and negative scores to it. Problems are encountered with the existing techniques for sentiment analysis in the presence of punctuations, ironical sentences, etc., which results in incoherent sentiment. In the proposed approach, every sentence is processed and examined at the sentence level to establish its polarity. Aspect-level sentiment analysis is used for finding emotions about things and their characteristics. The proposed algorithm NSASA does the preprocessing considering the negative stopwords and labels the sentiments of the article using SVM. The inclusion of negative stopwords in the proposed approach ensures that there will be minimum loss of significant information for labeling news articles. The proposed approach using SVM can be considered to be an improvement over Shirsat et al.’s approach (which has not considered the negative stopwords) to provide more accurate results. The proposed approach can be considered as an approach for sentiment analysis using negative stopwords to provide more accurate results for obtaining labeled positive, negative, and neutral sentiments from news articles. When the target classes are overlapping and the data set includes more noise, SVM does not perform very well. In the future, the proposed approach of using negative stopwords can be used to enhance the accuracy of the Naïve Bayes approach and other Machine Learning algorithms. Presently, the proposed approach has been tested on the news article dataset from BBC. This approach can be further applied to other datasets traditionally used in sentiment analysis, such as DUC-2002 and DUC-2004. The proposed approach is anticipated to be very helpful in determining the sentiment polarity of various news items and in retaining the information with the least amount of exclusion of important terms in the article. Apart from creating a sentiment analysis of news articles, the proposed approach can also benefit the governments in determining public opinion related to policies and program implementation expressed on various social networking platforms with better accuracy.