Abstract
In the new normal of fake news, wide-scale disinformation and alternate facts, the need for fact-checks and bot detection is real and immediate. It is generally accepted that altering the psyche, our thinking, is the most potent form of controlling and shaping human behavior. Fake news, disinformation and alternate reality are all aimed at shaping our beliefs. In this experiment, we set out to understand if the stream of news articles is itself designed to influence society at large, either to think positively or negatively – in other words can dictate how society views itself – disconnected from reality. In this exercise we are seeking to identify if there is a systematic prevalence of positive/negative sentiment in a given stream of news articles, using standard NLP techniques.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Cogito Ergo Sum
The often quoted, “I think, therefore I am” [1] is a profound reflection on human condition. To think is distinctly human and it is upon us to nurture and protect that faculty. Sometimes, our thinking springs forth from unknown source, as in the case of inspired works such as E = mc2 [2] or Paradise Lost [3] and we don’t need any protection from such sources. Then, there is, ephemeral source of information which are mostly rooted in some local context and short-lived, such as media in the myriad forms it is delivered to us. While it is up to the individual to choose wherefrom they source information, left unchecked, the potential for outlets, with undesirable objectives, to misrepresent reality, spread falsehood, mislead and shape societal thinking, is real and present. Arguably, Brexit in recent history and during World War II – an argument can be made that public opinion was shaped by a select few with access to media outlets.
1.1 Age of Disinformation and Fake-News
In the new normal of fake news, wide scale disinformation and alternate facts, the need for fact checks and bot detection is real and immediate. It is generally accepted that altering the psyche, our thinking, is the most potent form of controlling and shaping human behavior. Fake news, disinformation and alternate reality are all aimed at shaping our beliefs. In this experiment, we set out to understand if the stream of news articles is itself designed to influence society at large, either to think positively or negatively – in other words can dictate how society views itself – disconnected from reality. To answer this question, we have processed large number of content generated over a period of time. Each article was first prepared for NLP and then classified either as negative (depressing) or positive (uplifting) using several different classifiers. We then present a time series of the sentiment to understand if there has been a demonstrable shift in the sentiment of the article stream.
1.2 Technology Is a Double Edged Sword
In this new era of numerous technology advances such as, internet and social media tools, this problem is further exacerbated. So one can posit technology can amplify societal negative tendencies. However, other concomitant advances technologies such as machine learning, natural language processing, API driven access to data, allows us to devise solutions to counteract anti-social behaviors, and possibly mitigate this risk. The solution to a problem induced by technology, happens to be rooted in technology, as well.
This is urgent and in the rest of this paper we present our efforts to engineer a solution to classify individual articles using NLP and characterize streams of article and determine sentiment projected in s given stream of news articles.
2 Technical Overview
We performed a broad sentiment analysis of articles published by several digital outlets as a function of time and developed a time series of promoted sentiment for various media outlets. We do not know and we are not seeking to establish if the public opinion was indeed shaped during these periods. What we will establish is the sentiment article by article over time.
2.1 The Experiment
We processed articles from two outlets CNN and Guardian between Jan 2011 through Nov-2019 and retrieved 68158 articles from CNN and 38625 articles from Guardian. Our scope was to analyze one geographical region at a time. In this study we processed news from US region from both CNN and Guardian. Although we wanted to study news article from other outlets, these were the only two news corpus we could find.
Each article once retrieved, was prepared for NLP Tasks, then we performed sentiment analysis, and each article was labeled as either positive or negative sentiment. This sentiment, the article, publisher and date of publication were persisted. This is a classic big data “pipeline” problem. Using this pipeline pattern, parallelizing is straightforward – each stage can be run in parallel using shared queue.
We now discuss the technical considerations and the architecture of our solution in detail for each pipeline stage. We implemented the peline in python.
2.2 Data Retrieval
Data Sources usually limit the rate at which clients retrieve data. Rate limits are imposed at the IP address level and/or api key level and sometimes on both IP address and api key. One must manage this tactfully so that we can complete a session in reasonable time, without being blocked by the content provider.
Using standard HTML parser that comes with Beautiful Soup package, we extract and store the URLs. Independently, content is retrieved from each URL using text libraries available in Python and it is stored in the file system as files, so we could leverage file processing capabilities and the meta data is persisted in the database with the following tuple structure:
Outlet: From which source we have gathered the data in our case CNN
Date: The published date of the article
Title: Title of that article:
Url: Actual URLof that article if someone wants to read it from the website
File_name
Guardian data is only marginally different allowing us to reuse much of the utilities we wrote for CNN.
Articles from Guardian are available from 2008 and CNN articles are available from 2011. There were lot more US articles from CNN as one would expect but by partitioning the tasks as described above, we distributed the load on 3 nodes and processed approximately 47 K articles in 80 min, at times processing more than 500 articles per minute.
We show here the scraping table.
2.3 Managing IP Address Using Proxies
As mentioned before, CNN limits us to 3500 calls per day and below we will elaborate the techniques we used to retrieve ~50 K articles in 80 min. We could not achieve the same level of throughput from Guardian perhaps because of internet latencies and possibly our public proxy ip addresses might have been shared.
2.4 Proxies
Exceeding the CNN limit, results in 24 h block period. To overcome this constraint we used Rotating IP service from US Proxy. After much trial and error, we chose to utilize paid proxy service so our proxies were not shared over the internet. We used a total of 50 IP addresses and 30 were valid.
The randomized proxy approach resulted in significant reduction in data gathering time.
2.5 Preprocessing
In this phase we removed duplicate articles, and performed the required NLP tasks, as follows:
-
1.
removed extra spaces, special characters, single characters, new lines,
-
2.
converted entire text to lowercase.
-
3.
removed stop words using stop words library from nltk
-
4.
performed lemmatization/stemming
-
5.
removed participle, tense form of the words.
This preprocessing resulted in 50% reduction in the number of bytes to be processed.
2.6 Nlp
In this phase we classified each article using 5 different binary classifiers namely
-
1.
Naive Bayes,
-
2.
MultinomialNB,
-
3.
BernoulliNB,
-
4.
LogisticRegression, and
-
5.
LinearSVC
and assigned a sentiment to the article using majority voting scheme.
2.7 Training Data for Sentiment Labeling
We use the known positive words and negative words to train our classifier models.
short_pos = open(“trainning_files/positive.txt”,”r”, encoding=‘iso-8859-1’).read()
short_neg = open(“trainning_files/negative.txt”,”r”, encoding=‘iso-8859-1’).read()
Words associated with positive sentiment
Words associated with negative sentiment
We trained on 80% and tested on 20% of the data.
2.8 Verification and Testing
Let us consider the three sentences: “This article was rich, clear, willing, ingenuous, attractive, sensational, and hot”
“This is the best marvelous, imaginative, and realistic one I have seen”
“This article was utter junk. There were absolutely 0 points. I don’t see what the point was at all. Horrible essay, sucks” with the corresponding result shown above.
We applied this to entire document and for each document we tabulate the sentiment generated by the 5 classifiers as shown.
3 Result Analysis
We achieved an accuracy of 72% we got based on 80/20 split across the 5 classifiers as shown below
In the table below, the average executing time (around 16800 .txt files) and the accuracy achieved for each classifier is presented:
Classifier | NB | MultiNB | BinaryNB | Logistic | SVC |
---|---|---|---|---|---|
Accuracy percentage | 72.16% | 72.61% | 72.24% | 72.38% | 72.83% |
Executing time | 0.00639 | 0.00191 | 0.00184 | 0.00219 | 0.00177 |
4 Visualization
Below we show number of positive/negative articles for CNN and Guardian for the entire period.
Sentiment analysis results of ten years.
4.1 Yearly Sentiment
In each year, the blue represents negative articles and the red represents positive.
We count the neg/pos articles in each month and we present the monthly neg/pos article count for all of 2019, using barcharts.
4.2 Sentiment Trend
In addition, we visualize the trend
4.3 Production Deployment Urls
All work has been deployed using Microsoft Azure cloud and may be viewed here
- 1.
- 2.
- 3.
- 4.
5 Conclusions
We find no discernible change in the positive/negative sentiment for CNN and Guardian as one would expect. We are actively seeking data to conduct additional experiments.
References
Python newspaper documentation. https://buildmedia.readthedocs.org/media/pdf/newspaper/latest/newspaper.pdf
Text Summarization of an Article. https://medium.com/jatana/unsupervised-text-summarization-using-sentence-embeddings-adb15ce83db1
Insights about Nltk library. https://medium.com/datadriveninvestor/python-data-science-getting-started-tutorial-nltk-2d8842fedfdd
Usage of MultiCore Processing. https://medium.com/python-pandemonium/how-to-speed-up-your-python-web-scraper-by-using-multiprocessing-f2f4ef838686
How Lemmatization works. https://www.analyticsvidhya.com/blog/2018/02/natural-language-processing-for-beginners-using-textblob/
Data Source. https://www.cnn.com/us/article/sitemap-{yyyy}-{mm}.html, starting from 2011-07 till 2019-12-07
Data Source. https://www.theguardian.com/us-news/{yyyy}/{mm}/{dd}/all starting from 2008/jan/01 day till 2019/dec/31
Web Construction. https://getbootstrap.com/, https://www.echartsjs.com/en/index.html, and https://www.wix.com/
Data is uploaded on GitHub. https://github.com/Darshansol9/News_Articles_Sentiment_Analysis
Acknowledgements
Generous support from IBM Power Systems Academic Initiatives (IBM PSAI) is acknowledged. Raman is grateful to the many determined volunteers and their fine contribution.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Solanki, D., Hsu, HM., Zhao, O., Zhang, R., Bi, W., Kannan, R. (2020). The Way We Think About Ourselves. In: Schmorrow, D.D., Fidopiastis, C.M. (eds) Augmented Cognition. Theoretical and Technological Approaches. HCII 2020. Lecture Notes in Computer Science(), vol 12196. Springer, Cham. https://doi.org/10.1007/978-3-030-50353-6_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-50353-6_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-50352-9
Online ISBN: 978-3-030-50353-6
eBook Packages: Computer ScienceComputer Science (R0)