Abstract
It may be quite challenging to develop suitable techniques for analyzing the enormous amount of unstructured content of news items and extracting opinions from it. Understanding the attitudes, opinions, and feelings expressed in an online mention generally involves figuring out the emotional undertone of a string of words. Problems resulting in incoherent sentiment analysis are encountered in many of the existing approaches for sentiment analysis due to the presence of punctuation, ironic sentences, etc. Negative stopwords carry significant information about the sentiment of the sentence, but it is found that most of the approaches concerned with sentiment analysis remove these stopwords during the pre-processing stage. Thus, a suitable approach using negative stopwords has been proposed to minimize the information loss and improve the accuracy of the results obtained from sentiment analysis. The proposed approach has been evaluated using the dataset of various categories of news articles obtained from BBC, and it is found to yield an average accuracy of 85.75 for providing the sentiment polarity of various news items.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Sentiment analysis
- Machine learning
- Support vector machine
- Natural language processing
- Naive Bayes
- Deep learning
- Stopwords
1 Introduction
Sentiment analysis is procedurally distinguishing, extricating, evaluating, and contemplating abstract data by devices of computational phonetics, computational linguistics, biometrics, and text analysis. The sentiment analyzer assesses what constitutes a positive, neutral, or negative piece of writing. Sentiment analysis is helpful for data analysts within large enterprises for conducting research on market, public opinion, brand reputation, and understanding customer experiences [1]. The already existing product or a service is the object considered for sentiment analysis. Sentiment analysis can be broadly classified into four types, namely, fine-grained, emotion-based, aspect-based, and intent-based sentiment analysis. In fine-grained sentiment analysis, the polarity of the opinion is determined by simple binary classification of positive and negative sentiment. Depending on the use case, this type may also fit within the higher specification [2]. To find indications of particular emotional states mentioned in the text, emotion detection is used. Advanced sentiment analysis is done using aspect-based sentiment analysis, and the objective is to convey opinion with respect to the certain part of the input. The intent-based sentiment analysis ascertains the type of intention that the message is expressing. Opinion mining is synonymous to sentiment analysis despite the general understanding that sentiments are emotionally loaded opinions [3].
Analyzing enormous amount of unstructured text into a more specific news articles, devising suitable algorithms to understand the opinion from text and finding positive and negative score out of it is a challenging task. Thus, there is a need to incorporate suitable techniques to improve the accuracy of the results obtained from sentiment analysis of the news articles [4]. Although there are several approaches concerned with sentiment analysis of news articles, the outputs provided by these approaches lack accuracy to a considerable extent [5].
Negative stopwords include important information about the sentiment of the sentence, yet it has been discovered that most sentiment analysis pre-processing techniques discard these stopwords [6]. As a result, several times semantic information gets lost, resulting in inaccurate sentiment analysis. The contributions of importance of this article are as follows:
-
1.
This article presents an approach where every sentence is processed and examined at the sentence level to establish its polarity. In order to minimize the loss of significant information for labelling news articles, the proposed approach does the pre-processing considering the negative stopwords and labels the sentiments of the article using Support Vector Machine (SVM).
-
2.
The results obtained after applying the proposed approach on the dataset of different categories of news articles obtained from BBC are usually found to yield a comparatively higher accuracy for providing the sentiment polarity of various news items.
-
3.
A regression analysis is presented to confirm that the proposed approach provides comparatively better accuracy for sentiment analysis of news articles.
A survey of relevant research on sentiment analysis of text data is described in Sect. 2 of this article. The proposed approach for sentiment analysis of news items is described in Sect. 3, which is followed by an implementation utilizing news articles from the BBC dataset in Sect. 4. This section also provides an evaluation of the proposed approach. Section 5 concludes with a discussion on the benefits and limitations of the proposed approach as well as outlines the goals and enhancements for future work.
2 Related Work
In Natural Language Processing (NLP), the field of sentiment analysis has been explored with a variety of approaches. There are plenty of researchers who are working on Sentiment Analysis of Texts. The researchers have extensively contributed to its development and enhancing its applications in various fields. There are various methods, ranging from dictionary-based methods to machine learning methods. In 2018, Urologin [7] proposed techniques for extracting and displaying text data. It performs combined text summarization and sentiment analysis. A text summarization technique based on pronoun replacement is created, and sentiment data is gathered using the VADER sentiment analyzer. However, their summarization approach may cause loss in semantic information, which can lead to a wrong sentiment analysis of the document. In their work, they used a standard sentiment analysis repository (Github repo VADER). Taj et al. [8] characterized news articles as positive, negative, and unbiased classes by gathering the all-out opinion scores of the sentences in the article using lexical-based methodology. It implements Lexicon-based sentiment assessment of news stories, but it has insufficient or restricted word inclusion. As a result, numerous new lexical items with distinct semantics should be refreshed in lexical data set. It solely employs news articles in English from one hotspot for sentiment analysis. Vilasrao et al. [9] intended to develop a system with emotion dataset and training dataset to obtain valency (in the form of emotional and neutral). They have used Lexicon-based approach and Deep Learning Technology. They presented their estimation and investigation utilizing dictionary-based methodology and deep learning techniques to deal with emotion classes. The output of their proposed framework (which perceives the presence of assumptions extremity) can be used to enhance the sentiment analysis framework but their approach is only lexical based and uses very less amount of data for training using traditional machine learning (ML) algorithm. As a result, the accuracy of their output is not very high. Souma et al.’s [10] work was to forecast the financial news sentiments. They used Simple Sequential LSTM network architecture for the analysis, but in their work, an assumption (which may be error prone) is made that if the stock log return value is negative then the sentiment is negative and vice versa. No standard dataset was used to perform the experiment and obtain the results. No normalization of the statements in the news articles was done since same weightage was given to all the statements. Shirsat et al.’s [11] work was concerned with sentence level negation identification from news articles. They used a dictionary-based approach with ML techniques but no standard dataset was used (scraped dataset was used) and the used dataset was quite small. Their approach was dictionary based so no semantic information was used. For obtaining word-level emotion distribution Li et al. [12] considered the use of dictionary with word-level emotion distribution (known as NRC-Valence arousal dominance) for assigning emotions along with intensities to the sentiment words as efficient. Two models were proposed by Basiri et al. [13] in their study that employed a three-way decision theory and proposed two models. The three-way fusion of one deep learning model and the conventional learning method was used in the first model (3W1DT), whereas, three-way fusion of three deep learning models was used in the second model (3W3DT). The results obtained using Drugs.com dataset showed that both frameworks outperformed the traditional deep learning methods. In addition, it was noted that the first fusion model performed significantly better than the second model in terms of accuracy and F1-metric. Using the Rotten Tomato movie review dataset, Tiwari et al. [14] have implemented 3 ML algorithms (Maximum Entropy, Naive Bayes, and SVM) with the n-gram feature extraction technique. They noted a drop in accuracy for n-grams with larger values of n, such as n = 4, 5, and 6. Using various feature vectors like Bag of Words (BOW), Unigram with Sentiwordnet Soumya et al’s [15] work divided 3184 Malayalam tweets into negative and positive opinions. They used the ML algorithms Naive Bayes, Random Forest and found that Random Forest performed better (with an accuracy of 95.6%) with Unigram Sentiwordnet when negation words were taken into account. Rao et al. [16] used Long Short-Term Memory (LSTM) to improve sentiment analysis by first cleaning the datasets and removing the sentences with weaker emotional polarity. On three publicly accessible document-level review datasets, their model outperforms the state-of-the-art models.
From the srvey, it is found that although there are several approaches concerned with sentiment analysis, but the output provided by these approaches lacks accuracy to a considerable extent. In many approaches, it is found that no standard dataset was used to perform the experiment and obtain the results. In some approaches, no normalization of the statements in the news articles were done since same weightage was given to all the statements [17]. The dataset used in some of the approaches were quite small. In some approaches, no semantic information was used. In few approaches, there was loss in semantic information which lead to wrong sentiment analysis of the document. Analyzing enormous amount of unstructured text into a more specific news articles, devising suitable algorithms to understand the opinion from text and finding positive and negative score out of it is a challenging task. In essence, it is the process of determining the emotional tone behind a series of words, used to gain an understanding of the attitudes, opinions, and emotions expressed within an online mention. Negative stopwords carry significant information about the sentiment of the sentence, but it is found that most of the approaches concerned with sentiment analysis removed these stopwords during preprocessing stage [18]. As a result, there can be loss in semantic information leading to incorrect sentiment analysis [19]. Thus, there is a need to incorporate suitable techniques to improve the accuracy of the results obtained from the sentiment analysis [20]. Hence, the intention is to develop a suitable approach to improve the accuracy of sentiment analysis by considering the negative stopwords [21].
3 Proposed Approach
The polarity of the text data is determined or expressed by the sentiment analysis approach. Essentially, there are three layers of sentiment analysis: A sentiment analysis at the document level will be performed first to ascertain the polarity of the document. In the case of a text file containing only product reviews, the algorithm decides the polarity of all the content in the document. It is because of this that the document only conveys opinions about one specific subject and cannot be used to evaluate other products. Every sentence is processed and examined at the sentence level to establish its polarity. Finding emotions about things and their characteristics are made possible by aspect-level sentiment analysis. The proposed algorithm Negative Stopwords Aware Sentiment Analysis (NSASA) does the preprocessing considering the negative stopwords and labels the sentiments of the article using SVM. The steps of the proposed algorithm NSASA are as follows:
Algorithm
Negative Stopwords Aware Sentiment Analysis (NSASA)
Algorithm NSASA begins with the initialization and storing of negative words in the negation_words variable depicted in step 1. Generation of a set of positive and negative words takes place in steps 2 and 3. English stopwords are stored in step 3 and lemmatizer object is created in step 4. A dictionary is created for storing separately the positive and negative set of words as depicted in steps 6–9. The dictionary containing stopwords is updated by removing negative words list from it as depicted in step 10. Training and testing of dataset takes place from steps 11 to 13. Step 14 fits the model over the dataset. SVM is applied on the model and the predictions are made and final label is printed from steps 14 to 18. The model used in this proposed algorithm corresponds to LinearSVC.
Various user-defined functions which are used in the algorithm NSASA are defined in Table 1. Table 2 provides the descriptions of various predefined functions and parameters which are used in the algorithm NSASA.
4 Implementation Details and Evaluation
The proposed approach has been implemented using Python 3.7 with Google Colaboratory and the NLTK package. From the NLTK package nltk.download('punkt'), nltk.download('stopwords'), nltk.download('opinion_lexicon') were imported. Table 3 below displays the article’s categorization and the proportion of neutral, favorable, and unfavorable terms in it. This study makes use of the Bing Liu dictionary, which has 4783 negative words and 2006 positive terms [11].
Table 4 shows the article’s categorization and the proportion of neutral, favorable, and unfavorable terms in it after using the proposed algorithm NSASA. Additionally, it makes use of the Bing Liu dictionary, which has 4783 negative words and 2006 positive terms.
Figure 1 shows the percentage accuracy values obtained from the approach proposed by Shirsat et al. and the proposed NSASA for four different types of datasets.
For the purpose of evaluation, a comparison of the proposed approach using NSASA has been made with the approach provided by Shirsat et al. [11], and the results obtained are summarized in Table 5. Based on the results obtained after the experimentation it is found that the highest accuracy achieved by the approach proposed by Shirsat et al. in the Tech. data is 86%, and the lowest accuracy achieved by the approach proposed by Shirsat et al. in the business data is 75%. Whereas the highest accuracy achieved by the proposed NSASA in Tech. data is 98%, and the lowest accuracy achieved by the proposed NSASA in business data is 73%.
Thus, the approach of Shirsat et al. is found to provide results with an average accuracy of 80.25 whereas the proposed NSASA is found to provide results with an average accuracy of 85.75. The accuracy values obtained from the technique of Shirsat et al. and the proposed NSASA were also employed in a regression analysis, with the outputs reported in Table 6. Upon data analysis, it is discovered that the proposed NSASA algorithm’s p-value is 0.009156.
Regression coefficient r (Multiple R) = 0.9908 and p < 0.05 are obtained in Table 6. This suggests that the accuracy values acquired from Shirsat et al. [11], and the accuracy values obtained from the proposed NSASA algorithm have a positive relationship. Thus, the proposed NSASA seems to provide better accuracy in sentiment analysis of news articles as compared to the approach specified by Shirsat et al.
5 Conclusion
The amount of text continually expanding is an invaluable source of knowledge and information that must be effectively retrieved to reap its benefits. It may be quite challenging to analyze the vast amount of unstructured content into more specialized news items to create the proper algorithms to extract opinions from text and assign positive and negative scores to it. Problems are encountered with the existing techniques for sentiment analysis in the presence of punctuations, ironical sentences, etc., which results in incoherent sentiment. In the proposed approach, every sentence is processed and examined at the sentence level to establish its polarity. Aspect-level sentiment analysis is used for finding emotions about things and their characteristics. The proposed algorithm NSASA does the preprocessing considering the negative stopwords and labels the sentiments of the article using SVM. The inclusion of negative stopwords in the proposed approach ensures that there will be minimum loss of significant information for labeling news articles. The proposed approach using SVM can be considered to be an improvement over Shirsat et al.’s approach (which has not considered the negative stopwords) to provide more accurate results. The proposed approach can be considered as an approach for sentiment analysis using negative stopwords to provide more accurate results for obtaining labeled positive, negative, and neutral sentiments from news articles. When the target classes are overlapping and the data set includes more noise, SVM does not perform very well. In the future, the proposed approach of using negative stopwords can be used to enhance the accuracy of the Naïve Bayes approach and other Machine Learning algorithms. Presently, the proposed approach has been tested on the news article dataset from BBC. This approach can be further applied to other datasets traditionally used in sentiment analysis, such as DUC-2002 and DUC-2004. The proposed approach is anticipated to be very helpful in determining the sentiment polarity of various news items and in retaining the information with the least amount of exclusion of important terms in the article. Apart from creating a sentiment analysis of news articles, the proposed approach can also benefit the governments in determining public opinion related to policies and program implementation expressed on various social networking platforms with better accuracy.
References
MonkeyLearn. Sentiment analysis: A definitive guide, https://monkeylearn.com/sentiment-analysis/. Last accessed 5 Dec 2022
A. Zhao, Y. Yu, Knowledge-enabled BERT for aspect-based sentiment analysis. Knowledge-Based Syst. 227, 107220 (2021) ISSN 0950-7051
K. Roebuck, Sentiment Analysis: High-Impact Strategies What You Need to Now: Definitions, Techniques and Applications for Sentiment Analysis, Adoptions, Impact, Benefits, Maturity (Emereo Publishing, 2012)
P. Pooja, G. Sharvari, A survey of sentiment classification techniques used for Indian regional languages. Int. J. Comput. Sci. Appl. 5(2), 13–26 (2015)
A. Shoukry, Sentence-level Arabic sentiment analysis, in 2012 International Conference on Collaboration Technologies and Systems (CTS), (IEEE, 2012), pp. 546–550
P. Melville, W. Gryc, R.D. Lawrence, Sentiment analysis of blogs by combining lexical knowledge with text classification, in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (ACM, 2009), pp. 1275–1284
S. Urologin, Sentiment analysis, visualization and classification of summarized news articles: A novel approach. Int. J. Adv. Comput. Sci. Appl. 9(8), 616–625 (2018)
S. Taj, B.B. Shaikh, A.F. Meghji, Sentiment analysis of news articles: A lexicon based approach, in 2019 2nd International Conference on Computing, Mathematics and Engineering Technologies (iCoMET), (IEEE, 2019), pp. 1–5
G.S. Vilasrao, P.D. Sathya, Lexical approach for sentiment analysis on news articles with deep learning method. Int. J. Sci. Res. 8(12), 725–730 (2019)
W. Souma, I. Vodenska, H. Aoyama, Enhanced news sentiment analysis using deep learning methods. J. Comput. Soc. Sci. 2, 33–46 (2019)
V.S. Shirsat, R.S. Jagdale, S.N. Deshmukh, Sentence level sentiment identification and calculation from news articles using machine learning techniques, in Computing, Communication and Signal Processing, (Springer, 2019), pp. 371–376
Z. Li, H. Xie, G. Cheng, Q. Li, Word-level emotion distribution with two schemas for short text emotion classification. Knowledge-Based Syst., Elsevier 227, 107163 (2021)
M.E. Basiri, M. Abdar, M.A. Cifci, S. Nemati, U.R. Acharya, A novel method for sentiment classification of drug reviews using fusion of deep and machine learning techniques. Knowledge-Based Syst., Elsevier 198, 105949 (2020)
P. Tiwari, B.K. Mishra, S. Kumar, V. Kumar, Implementation of n-gram methodology for rotten tomatoes review dataset sentiment analysis. Int. J. Knowl. Discovery Bioinf., IGI Global 7(1), 30–41 (2017)
S. Soumya, K.V. Pramod, Sentiment analysis of Malayalam tweets using machine learning techniques. ICT Express, Elsevier 6(4), 300–305 (2020)
G. Rao, W. Huang, Z. Feng, Q. Cong, LSTM with sentence representations for document-level sentiment classification. Neurocomputing, Elsevier 308, 49–57 (2018)
R. Mishra, T. Gayen, Automatic lossless-summarization of news articles with abstract meaning representation. Procedia Comput. Sci., Elsevier 135, 178–185 (2018)
M. Birjali, M. Kasri, A. Beni-Hssane, A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowledge-Based Syst. 226, 107134 (2021) ISSN 0950-7051
P. Wang, J. Li, J. Hou, S2SAN: A sentence-to-sentence attention network for sentiment analysis of online reviews. Decis. Support Syst. 149, 113603 (2021) ISSN 0167-9236
W. Liao, B. Zeng, J. Liu, P. Wei, X. Cheng, W. Zhang, Multi-level graph neural network for text sentiment analysis. Comput. Electr. Eng. 92, 107096 (2021) ISSN 0045-7906
S. Zitnik, S. Blagus, M. Bajec, Target-level sentiment analysis for news articles. Knowledge-Based Syst. 249, 108939 (2022) ISSN 0950-7051
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yadav, C., Gayen, T. (2023). Stopwords Aware Emotion-Based Sentiment Analysis of News Articles. In: Haldorai, A., Ramu, A., Mohanram, S. (eds) 5th EAI International Conference on Big Data Innovation for Sustainable Cognitive Computing. BDCC 2022. EAI/Springer Innovations in Communication and Computing. Springer, Cham. https://doi.org/10.1007/978-3-031-28324-6_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-28324-6_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28323-9
Online ISBN: 978-3-031-28324-6
eBook Packages: EngineeringEngineering (R0)