A Classification Model for Modeling Online Articles

Alhalaseh, Rula; Rodan, Ali; Alazzam, Azmi

doi:10.1007/978-3-030-43364-2_7

Rula Alhalaseh¹⁰,
Ali Rodan¹¹ &
Azmi Alazzam¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1187))

Included in the following conference series:

International Symposium on Intelligent Computing Systems

539 Accesses

Abstract

Due to the constant evolvement of the web and the viral spread of online news on social media, predicting the popularity of a news article became a topic of interest to many categories of people ranging from marketing personnel to politicians. In this paper, we focus on comparing four classification algorithms on a dataset consisting of 39000 news articles taken from Mashable website. The articles were classified into two classes: Popular and not popular. Four different machine learning algorithms were used for classification of the data (KNN, Naïve bayes, Adaboost, and decision tree). Finally, the four classification methods were compared with each other.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Mizo News Classification Using Machine Learning Techniques

A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News

Automatic Classification of Web News: A Systematic Mapping Study

Keywords

1 Introduction

In recent years, with the viral evolvement of social media, the sharing, commenting on and reading of various kinds of articles, including news and articles of social or political nature, has become the center of people’s daily entertainment. As tons of news, rumors and stories are published on a daily basis, there comes a need for predicting whether a news piece can go viral before it is published. Predicting news popularity became a trendy field of research for researchers, authors and advertisers to build their strategies as well as make it reach as many individuals as possible. This will also help in extracting and implementing the features contributing to a viral article outspread. Also some politicians are concerned of the influence of news articles on the population and the effects from spreading such news. In this paper, the dataset used is a real-world dataset taken from UCI Machine Learning Repository [1] that collected over than 39000 articles from Mashable [2] website.

The dataset has various informative features [3], we also intend to compare and analyze the performance of several machine learning algorithms to predict the popularity of news articles. Measurement of popularity is known as the number of times an article gets shared, liked and commented on. For the popularity measure we adopt a common binary task to classify the articles into popular and unpopular and then use the machine learning algorithms to build a classification model that can used to classify new articles based on some features.

There are two main prediction approaches to measure popularity [4].

The first approach is to use features that are only known and observed after publishing an article and the second approach does not use these features. The first approach is more common. An Example of the first approach can be found in [5] where the evolution of user generated content popularity is discussed. Another proposed methodology for predicting online contents popularity in a more precise manner rather than attempting to infer the possibility that a content will be popular can be found in [6].

The statistical analysis of the time of user reaction to a newly opened a discussion thread online which was made on the popular news website Slashdot [7, 8]. It also performed a characterization that enabled predicting intermediate and long-term user behavioral pattern with and acceptable result of precision.

Predicting the popularity of online content was elaborated in [9, 10]. In [11] the authors proposed a framework for modeling and predicting the popularity of online contents that aimed to infer the likelihood with which the content will be popular.

Since the prediction task is easier to implement, higher accuracies in prediction are often achieved. Popularity prediction of articles that do not use features which is not a commonly used approach as a low performance in prediction is expected. And moreover using the features as in the first approach are said to improve the content before having it published.

2 Literature Review

The work mentioned in [3] consists of a robust evaluation of five state of the art models for classification of around 39 thousand articles that were labeled and collected from Mashable website [2]. Experiments on Random forest in [3] conducted the best result having a discrimination power of 73% for binary classification.

A research to address the prediction task both as a regression and a classification problem as well as to predict the number of news tweets was discussed in [12]. The paper illustrated that even though predicting the exact number of tweets may have a high error percentage, there is a possibility for predicting ranges of popularity on news tweets with 84% overall accuracy. Furthermore, it considered four types of features (news source, category of the article, subjectivity language used, and names mentioned in the article) to predict the tweets number that has the article mentioned in them. Three popularity classes were studied which ranged between 1–20 tweets, 20–100 tweets and more than 100, discarding the articles with no tweets.

Another study that used news tweets proposed the passive aggressive algorithm to predict how many times a news link tweet will be retweeted [13]. The study discussed also the some social features like how many users are following the user tweeting, which can determine the number of times an article will be retweeted. Moreover, this research noticed that the number of urls and hashtags could also boost the tweet to make it get to as many people as possible.

Xuandong et al. researched the topic of predicting whether a mashable news article will be viral or not by addressing two dimensions of the problem: multidimensional classification and numerical [14]. They applied linear regression, polynomial regression, GAM with smoothing splines and Lasso to predict the exact amount of shares of a news article. In the paper, GAM with smoothing splines gave the best CV error which was 0.7649. In their paper, they used SVM, Random Forest and Bagging to predict popularity of the news, resulting into four categories for each news article and with Random forest giving best result of 50.4% accuracy in prediction.

The research work in [15] tested two binary classification tasks for prediction: popular and unpopular as well as appealing and non-appealing when compared to articles that were published on the same day used 10 English news outlets that related with one year. The paper used bag of words of the title and description, keywords and characteristics such as date of publishing combined with Support Vector Machine (SVM). The appealing task gave better accuracy results of 62–86% when compared with the popular and unpopular task which gave results ranging from 51–62%.

3 Methodology

In this section, the methodology of the classification algorithms is discussed. There are four classification algorithms that are discussed and implemented to classify the data collected from different articles. The goal of this study is to build a classification model with high accuracy that can be eventually used to predict the popularity of articles before a decision is made whether to publish them or not. The next four Subsect 3.1–3.6 will discuss and elaborate the methodology in details.

3.1 Dataset

The dataset obtained from [1] consists of over 39 thousand articles from Mashable [2] which is one of the largest well known news website. The data was retrieved and prepared by [3] on a 2 years period, from January, 7 2013 and January, 7 2015. Special occasion articles were discarded which was only a small portion of the dataset and did not follow the general structure of HTML as the processing of each occasion would require a specific parser. The collected data was donated to the UCI Machine Learning Repository [1] for public use. The processing and collection process of the Mashable [2] data in [3] was implemented in Python, while we are going to use Weka tool for our work. After preprocessing of the dataset the resulting articles were a total of 39 thousand data points with 60 features.

We summarized the work done on the Mashable [2] dataset before proceeding with the classification. The dataset classification considered 47 features in total that were extracted from html code. The features of the dataset are shown in Table 1 [3]. These features have different types: number-integer value, ration within [0, 1], a bool that can be either a 0 or 1 and a nominal value. Columns with (#) indicate the number of variables within each feature. The number of shares attribute which we will be working on in our paper was concluded in [3] by selecting a large list of characteristics that describe different aspects of the article and that were considered possibly relevant to influence the number of shares as we have the minimum, maximum and the average number of shares in various social networks in the dataset.

In this paper, a binary classifier is used. Two classes are considered popular and unpopular. If the article has more than 1400 shares it is considered as a popular article, and otherwise it is considered as unpopular article. Thus, the classification algorithm will use the existing data to predict the classes based on the 47 attributes of this data.

The data will be divided into two parts: two thirds will be used for building (training the model) and one third for validation in order to avoid over fitting.

In the next subsections we discuss the data labeling process and the different algorithms that were used in building the classification model.

Table 1. Statistical measures of the articles in Mashable dataset [3]

A Classification Model for Modeling Online Articles

Abstract

Similar content being viewed by others

Mizo News Classification Using Machine Learning Techniques

A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News

Automatic Classification of Web News: A Systematic Mapping Study

Keywords

1 Introduction

2 Literature Review

3 Methodology

3.1 Dataset

3.2 Process of Labeling Classes and Evaluating Results

3.3 AdaBoost

3.4 K Nearest Neighbor K-NN

3.5 J48 Decision Tree (Pruned Tree)

3.6 Naïve Bayes

4 Results and Summary

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation