Keywords

1 Introduction

News has always played a dynamic role in shaping a person’s thoughts, vision and perception. It can affect people’s emotions in a negative or a positive way on a large scale. Headlines alone can have a great impact on people’s minds as most of the people just scan through the headlines and judge a news article. Ever since the advent of the internet, online media has also gained a great momentum among people, with the latest news just a click away. This has led to the need of analyzing online news content to understand what type of emotion is being spread by news organizations via a process of sentiment analysis. Sentiment analysis is a text classification process which highlights the sentiment behind a piece of information. It can help the public understand the sentiment behind any news being published and hence, alter their choice of newspapers.

A lot of development has been made in the area of sentiment analysis of news articles but the work on Indian news and COVID-19 related news is very limited. There has been some progress in the area of sentiment analysis of news articles concerning COVID-19 news headlines since the outbreak of COVID-19. For example: the authors of [16] divided their research into two parts. In the first part they used the topic model to produce topics for each country and for the second part they used the RoBERTa model for sentiment classification of headlines. They examined key topics of English language COVID-19 news articles for 4 countries and analysed the associated sentiments. The authors of [17] studied and investigated the relation between the stock market and news sentiment related to COVID-19. Sentiment scores for COVID-19 related news were generated using BERT-based Financial Sentiment Index.

In this work, sentiment analysis has been done on the headlines of Indian news and the idea is driven by two primary things.

  1. 1.

    What are the emotions conveyed by the headlines and how they affect the audience?

  2. 2.

    How well the machine learning algorithms work in classifying these news headlines correctly?

Online news headlines of two leading Indian newspapers were collected for the period of September 2019 to September 2020 followed by their pre-processing to avoid any future inconsistencies. The preprocessed dataset was later classified into three classes of polarities (positive, negative and neutral) on the basis of the polarity scores received from a lexicon dictionary. A self-created scored corpus of corona related words was created and applied on the dataset to correctly classify covid-19 news headlines which were misclassified by the lexicon. The updated dataset was fed into two ML algorithms, decision trees and random forest, to check how well they could classify the news headlines. Their performance was measured based on metrics like accuracy, F1 score, precision, recall and confusion matrix.

2 Literature Survey

There has been significant work in analyzing sentiments of news headlines using lexicon and machine learning approaches.

The authors of [1] performed analysis of news related to the Coordinator Minister of Maritime Affairs period 2016–2019 using naïve bayes classifier, support vector machine and particle swarm optimization. Researchers of [2] tried to derive the degree of correlation between the information released by the Chinese government about coronavirus and the Chinese public opinion on the issue using BosonNlp sentiment dictionary. In paper [3], the authors studied the news articles reported on the BBC news website which were categorized in 5 classes (business, entertainment, politics, sport and tech). Sentiments were assigned using Wordnet dictionary and TF - IDF method, which fall under lexicon model. The work in [4] presented a way for enlarging small candidate seed lists of positive and negative words into larger lexicons using path-based analysis of synonym and antonym sets in Wordnet. The new seed lists were used to check public sentiments of each entity. The research done in [5] analyzed news articles associated with companies by preprocessing the data to give candidate words which were compared with the words in a positive and a negative dictionary to check for a match. A match led to accumulation of the words in specific classes. The task was accomplished using naïve bayes, Bernoulli naïve bayes and Laplacian smoothing. [6] focused on the effect of certain words used in political texts to influence the sentiments of the public in polls for 3 Indian political parties. The results obtained from SentiWordNet dictionary were compared to naïve bayes and support vector machine results. Researchers of [7] attempted to classify news comments into positive, negative and neutral classes using AFFIN-111 word list, TF-IDF, support vector machine and k-nearest neighbor algorithm. The results indicate better performance of the SVM model. The work accomplished in [8] provided a comparative study of the methods that can be used for analysis of quotations in newspaper articles. The idea involved using a bag of words approach and similarity approach. The labelled dataset became a training set for SVM classifier to determine sentiment classification. The authors of [9] proposed a technique to analyze sentiments of headlines using three models, namely linear SVM, TF-IDF and linear SVM, SGD and linear SVM. The results indicate better performance of TF-IDF and linear SVM for small datasets and of SGD and linear SVM for large datasets. The study conducted in [10] employed a lexicon-based approach to identify sentiments of news articles. Each article was considered a document and assigned polarity using WordNet dictionary. The results showed only a few articles under neutral class. In paper [11], the researchers tried to provide a platform to serve positive news to the public by eliminating negative sentiment news from a pool of articles. This was achieved using a news aggregator and processing engine, SentiWordNet for feature extraction, SVM model and filtering out negative polarity news. The work done in [12] involved sentiment classification of Indian news extracted from Indian journals. The approach was machine learning based and made use of recurrent neural networks with long short-term memory units. The author of [13] studied financial news articles to determine the effect on future stock trends. A dictionary of polarity words was created and applied on the news documents, which were converted into a set of vectors and classified using random forest and naïve bayes algorithms. An extensive study was conducted in [14] that aimed to determine the effect of mutual fund news in India on the investors, amidst covid-19 outbreak, and to create a model to forecast assets under management indicator. The articles extracted from Indian journals were assigned sentiments using the VADER lexicon tool. Assets under management values and sentiment scores were used as variables to train linear regression and multiple regression models. The results showed that sentiment scores and assets under management were generally directly proportional and a regression model could be used to predict assets under management. Researchers of [15] performed sentiment analysis on Punjabi news using a machine learning approach. The dataset was pre-processed and transformed using a TF-IDF vectorizer followed by application of SVM. The classifier first classified news into categories like crime, politics, entertainment, weather and sports and then sorted them into positive and negative classes with high accuracy.

3 Methodology

This study involves a hybrid approach which makes use of lexicon and machine learning techniques to determine overall polarity of Indian news headlines reported during a period of one year from September 2019 - September 2020. Figure 1 illustrates the flow diagram of the proposed work, which has been divided into several phases. Each of the phases has been discussed in detail in the following sections.

3.1 Dataset Collection

The headline texts for this study were collected from two leading Indian English newspapers, namely The Hindu and The Times of India. A python script was written using Beautiful Soup, a python package for parsing HTML and XML files, to scrape the headlines from the prints’ website. All the scraped headlines were stored in a csv file for pre-processing.

3.2 Data Preprocessing

Data preprocessing was done to transform raw data into useful and effective format and to minimize the inconsistencies. The steps involved were:

  • Removal of punctuations

  • Tokenization

  • Stop word removal

  • Lemmatization

  • Removal of duplicate and missing headlines

3.3 Lexicon Based Approach

In this phase, the unlabeled dataset was labeled using bag of words approach where each word was associated with an opinion value that contributes to the overall polarity score of a headline. Initially, TextBlob (a python library) was used to assign polarity to headlines using its polarity sentiment function. However, while skimming through the labeled dataset, many mislabeled tuples were found which led to the use of another python library, NLTK VADER. Its Sentiment Intensity Analyzer function was used to assign polarity scores and it gave an improved dataset. The labels were assigned according to the following rules:

$${\text{IF}}\,{\text{polarity}}\_{\text{score}} < {0 }\,{\text{THEN}}\,{\text{negative}}{.}$$
(1)
$${\text{IF}}\,{\text{polarity}}\_{\text{score}} = {0}\,{\text{THEN}}\,{\text{neutral}}{.}$$
(2)
$${\text{IF}}\,{\text{polarity}}\_{\text{score}} > {0}\,{\text{THEN}}\,{\text{positive}}{.}$$
(3)
Fig. 1.
figure 1

Flow diagram of the research work.

The work proceeded with the labelling executed by NLTK VADER.

3.4 Covid Headlines Classification

The NLTK VADER couldn’t correctly classify most news headlines related to coronavirus. This was due to different interpretations of many words when used in general and when used in coronavirus context. The words had flipping polarities in both the scenarios. For example, the word “positive” when used in a sentence in general displays a positive sentiment but when used in coronavirus headlines it is rather negative. In such cases, VADER’s classification of the headline as positive was a misclassification. To overcome this shortcoming and cater to the news related to coronavirus, a separate covid headlines classification function was created working around the idea of using a self-scored corpus.

Since the term covid19 is polyonymous, a corpus of words was generated containing alternate terms for covid19 and another corpus of words was created containing words that were wrongly classified by the sentiment VADER but were really commonly found with news related to covid-19. Polarity scores of words within the VADER corpus were modified in the new corona corpus so as to fit the coronavirus context such that headlines of covid19 could be correctly classified. Polarity scores assigned by sentiment VADER and polarity scores assigned using the self-created corpus were then compared on the following grounds:

$$\begin{aligned} & \quad {\text{IF}}\,\left( {{\text{Positive}}\,{\text{Score}}} \right)_{{\text{Covid }}} > \left( {{\text{Neutral}}\,{\text{Score}}} \right)_{{{\text{Vader}}}} + \left( {{\text{Negative}}\,{\text{Score}}} \right)_{{\text{Vader }}} \,{\text{THEN}} \\ & \left( {{\text{Compound}}\,{\text{Score}}} \right)_{{\text{Vader }}} = {1}{\text{.}} \\ \end{aligned}$$
(4)
$$\begin{aligned} & \quad {\text{IF}}\,\left( {{\text{Negative}}\,{\text{Score}}} \right)_{{\text{Covid }}} > \left( {{\text{Neutral}}\,{\text{Score}}} \right)_{{{\text{Vader}}}} + \left( {{\text{Positive}}\,{\text{Score}}} \right)_{{\text{Vader }}} \,{\text{THEN}} \\ & \left( {{\text{Compound}}\,{\text{Score}}} \right)_{{{\text{Vader}}}} = - {1}{\text{.}} \\ \end{aligned}$$
(5)
$$\begin{aligned} & \quad {\text{IF}}\,\left( {{\text{Negative}}\,{\text{Score}}} \right)_{{\text{Covid }}} < \left( {{\text{Neutral}}\,{\text{Score}}} \right)_{{\text{Vader }}} \,{\text{AND }}\left( {{\text{Negative}}\,{\text{Score}}} \right)_{{\text{Covid }}} > \\ & \;\,\left( {{\text{Positive}}\,{\text{Score}}} \right)_{{{\text{Vader}}}} \,{\text{AND }}\left( {{\text{Neutral}}\,{\text{Score}}} \right)_{{\text{Vader }}} - \left( {{\text{Negative}}\,{\text{Score}}} \right)_{{\text{covid }}} < = {0}{\text{.5}} \\ & {\text{THEN}}\,\left( {{\text{Compound}}\,{\text{Score}}} \right)_{{{\text{Vader}}}} = { - 1}{\text{.}} \\ \end{aligned}$$
(6)
$$\begin{aligned} & \quad {\text{IF}}\,\left( {{\text{Positive}}\,{\text{Score}}} \right)_{{\text{Covid }}} < \left( {{\text{Neutral}}\,{\text{Score}}} \right)_{{{\text{Vader}}}} \,{\text{AND}}\,\left( {{\text{Positive}}\,{\text{Score}}} \right)_{{{\text{Covid}}}} > \\ & \;\,\left( {{\text{Negative}}\,{\text{Score}}} \right)_{{{\text{Vader}}}} \,{\text{AND}}\,\left( {{\text{Neutral}}\,{\text{Score}}} \right)_{{{\text{Vader}}}} - \left( {{\text{Positive}}\,{\text{Score}}} \right)_{{{\text{covid}}}} < = { 0}{\text{.5}} \\ & {\text{THEN}}\,\left( {{\text{Compound}}\,{\text{Score}}} \right)_{{{\text{Vader}}}} = - {1}{\text{.}} \\ \end{aligned}$$
(7)

Based on the above cases, the final polarity was updated in the database.

3.5 Decision Tree

This phase comprised application of decision tree algorithm on the dataset obtained from the lexicon phase. Decision tree was chosen for analysis of Indian headlines as there was less study done in this discipline. Due to the lack of ability of decision trees to work on textual data, a feature weighting technique TF-IDF, was used to give numeric scores to the text data. The dataset for classification was divided into 80% training and 20% testing data.

In this research, scikit-learn module’s DecisionTree classifier was used with criterion Gini, that is CART algorithm of decision tree, and its various parameters were tuned to optimize the result of the classification. After drawing validation curves over a range of values for each parameter, the optimal values were chosen to avoid overfitting or underfitting and to obtain the best result.

3.6 Random Forest

This phase encompassed application of the random forest algorithm on the dataset since it is an ensemble technique for the decision tree algorithm and was less explored by the other researchers. Due to the textual nature of the data and the inability of ML algorithms to work on text, a feature weighting technique, TF-IDF, was applied to assign numeric scores to the text. Based on various observations, the default value of the maximum features parameter of the TF-IDF function was modified so as to omit the transformation of rare words in the vocabulary.

Later, the data was fed into the random forest classifier function provided by the scikit-learn module in python and performance metrics were obtained. The metrics had to be improved by tuning the hyperparameters and this task was achieved by plotting validation curves for each parameter. On studying the curves well, optimal values for each attribute were chosen such that overfitting and underfitting could be minimized and the algorithm’s performance improved.

4 Results

4.1 Lexicon Phase

Post assignment of polarities by lexicon dictionaries TextBlob and NLTK Vader, the following results were obtained.

There were 72% neutral, 18% positive and 10% negative labels assigned by TextBlob which indicated that most of the data was neutral. While NLTK VADER labeling gave 50% neutral, 22% positive and 28% negative labels. Figure 2 gives a comparison of the results obtained by TextBlob and NLTK VADER.

Fig. 2.
figure 2

Comparison of polarity classification done by TextBlob and NLTK.

4.2 Covid Headlines Classification

After using a scored corpus of words to correctly classify the covid19 headlines and modifying the polarity scores within VADER corpus, it was observed that the number of headlines under each label changed to a certain extent. The positive class saw a decrease of 139 headlines, the negative class experienced an increase of 342 headlines and the neutral class saw a decrease of 285 headlines. Table 1 provides a comparison of the total number of headlines under each label before and after the application of scored corpus. It lists down the changes in magnitudes accompanied by associated signs.

Table 1. Comparison of total number of headlines before and after covid headlines classification function.

4.3 Decision Tree

Decision tree algorithm performed in differing ways on each dataset. The results were analyzed using four metrics - accuracy, precision, recall and f1 score - to measure the classifier’s results. Confusion matrix was also analyzed that displays the number of true and false classifications. These values were attained by using inbuilt functions of the scikit-learn python module.

Applying the algorithm on the dataset obtained from the lexicon phase without the covid headline classification function gave an accuracy score of 94% at the default values of parameters of the classifier. Decision tree tends to overfit at default values of hyperparameters thus, optimal values of hyperparameters were set to reduce overfitting.

The accuracy score after tuning the hyperparameters and using the modified dataset obtained is 93.05% along with high values of precision, recall and F1-score as shown in Table 2. These results show that the algorithm applied on the dataset functioned adequately.

Table 2. Classification report for decision tree.

4.4 Random Forest

The random forest algorithm performs differently on each dataset so it becomes imperative to analyze its performance via some functions. This study involved four metrics - accuracy, precision, recall and f1 score - to judge the classifier’s results along with confusion matrix that displays the number of true and false classifications. These values were obtained using functions of the scikit-learn python module.

The algorithm gave an accuracy score of 87.98% when executed on the dataset obtained after the lexicon phase. At that point, the covid headlines classification function was not applied on the headlines and the classifier had all its parameters set to their default values.

However, after all the optimizations, the classification of news headlines by random forest gave a promising accuracy score of 92.36%. It was supported by the high values of precision, recall and F1-score as shown in Table 3. These attributes indicate that the algorithm worked well on the dataset.

Table 3. Classification report for random forest.

5 Conclusion

In this research project, sentiment analysis on news headlines of Indian journals was successfully implemented using both lexicon and machine learning approaches. Amongst the two dictionaries used as part of the lexicon model, NLTK gave more convincing results than TextBlob. An effective corpus of covid-19 words was created to correctly classify coronavirus related news and the dataset thus formed, was fed to machine learning classifiers, decision trees and random forest. The algorithms gave good accuracy scores of 93.05% and 92.36% respectively along with high F1-scores, precisions and recalls. It was concluded that the Indian newspapers highlight neutral news the most, followed by negative and positive news in respective order.

This finding can be very useful for the Indian news agencies while targeting readers and contemplating their reporting strategies. If they report a lot of negative news then there could be a decrease in their sales and subscribers. Hence, they can study the effects of their reported news on the general public and alter their strategies for greater views and subscribers.

The public too can take advantage of the work by understanding sentiments behind Indian journalism and hence filtering the newspapers they want to read. They can skip or read a newspaper on the basis of the sentiment conveyed by its corresponding headlines. If it appears to be appealing and of a desiring sentiment then they can continue reading the newspaper, or else move to another one. This could be a way to avoid highly negative or depressing information.

The work can be further expanded to create a platform, or can be integrated with an application, to segregate news based on the sentiment choice of the user where filtered news is provided to the public based on the sentiment of their choice.

Corporate organizations can make use of this research to strategize their advertisement and branding of products as the language of advertisements in the news media forms a sentiment in a person’s mind. A person forms an opinion about a product even before knowing its details. Thus, the opinion of people on a product through advertisement plays a major role in the sales of any product.

This study can be extended in other fields, such as analyzing the mental and behavioral effects of Indian news headlines/taglines, because the type of news affects the mental and psychological well-being of a person. Reading too much of a particular type of news can impact one’s mind and thought process. It can make the brain perceive a certain type of information alone which can lead to unwanted emotions like fear, aggression, greed etc. The nature of news can also affect the behavior and actions of an individual since a person can copy or imitate things from the news both in positive or negative ways.

The research can be used to study the impact of news in promoting the art and culture of India since Indian journals contain extensive cultural information which helps people to form an opinion of India’s culture, food, art, traditions etc. It lets a person form an opinion of a place they might or might not have visited and this opinion can considerably impact the tourism industry for the area.