Keywords

1 Introduction

Research in social media has become a point of interest from many researchers because of the increasing field of online social networks in most platforms. Social Networks are nowadays the most familiar interactive media to communicate, share, and publish an unlimited amount of human life information. Communications mean the exchange of particular types of content, including text, photo, audio, and video data. Online Social Networks supply very little support to prevent unwanted data on user timeline. Sometimes the shared information may be vulgar or not wanted and it is inevitable to see it. Facebook, for example, gives users the ability to declare who is allowed to add data to their walls. (i.e., friends, friends of friends, or defined groups of friends). In Facebook, no data checking for the contents happen and hence it is highly likely that offensive content gets posted without unchecking or filter no matter of the users [1].

Most of the exchanged data over the Social Networks are in the text format. Text mining is a technique made together with data mining, machine learning, and information retrieval. Text mining may also point out as text data analysis or data mining in which significant information can be retrieved from the text. To make text preprocessing or prerequisites, the phases are parsing, tokenization, normalization, etc. [2]. Text classification techniques will be used for automatically labeling a set of categories based on contents of each text data. The classification will be one of the following categories (politics, economics, sport, religion, technology, TV, ads, foods, health, and porno).

Online Social Networks (OSN) are used by different languages’ speakers. It is not only used by English speakers. There are many users on OSN use other languages than English e.g. Arabic. Arabic is fourth one of the top ten languages on the internet in June 2017 [3]. The need and attention in classifying Arabic texts have increased recently, due to a lot of reasons: The Arabic language is very rich with contents, there are about 184 million Arab Internet users and a large percentage of them cannot read English [3]. In addition to, the online Arabic contents have grown quickly in the last decade, exceeding 3% of the entire online contents and is ranked the eighth in the whole internet content [4]. However, there is lack of language resources and text processing techniques for the Arabic language [5].

This paper presents a dataset collected from Arabic Facebook pages, the dataset contains 25,000 posts, collected automatic and manually labeled to 10 categories, and then we apply some text processing techniques which include, removing non-Arabic letters, removing word suffix and prefix, normalization, and transformation. The labeling phase includes two stages, in the first, we labeled each post to a specific class, then we ask 10 Facebook users who are Arabic native speakers to labeled the posts, and according to users’ feedbacks, some posts classification are changed.

The next sections are organized as follows: Sect. 2 contains related works review. In Sect. 3 the data collection methodology is viewed. Section 4 contains results and evaluation followed by Sect. 5 which contains the conclusion.

2 Related Work

The dataset building is different in terms of the research purpose, such as Natural Language Processing (NLP) and Text Mining. The dataset differs also in the collect sources i.e. websites, social networks, news pages, and blogs. Also, datasets vary in size and language. Some datasets are available for free and some of them are available commercially. In recent years interest is begun to build datasets from online social networks, because of the large amounts of data available in it. There are no free standard datasets available for the Arabic text classification research, unlike English text classification, so researchers rely on collecting their dataset for each research point [4]. Few research efforts were done for Arabic datasets building.

Al-Kabi et al. [6] collect 4050 comments from social media such as Facebook, YouTube, Twitter, Digg, and Yahoo. These comments were used to build a dataset in Arabic and English languages, SocialMention and Twendz tools were used to gather comments together with reviews in Arabic and English language. The dataset was classified to three classes only, political news, commercial, and academic. Three classification models were used to evaluate the dataset ((Naïve Bayes, Support Vector Machine (SVM), and K-Nearest Neighbor algorithm (K-NN)), and the conducted results showed that the Naïve Bayes algorithm gave the best results for both SocialMention and Twendz tools with an accuracy of 66.2% and 45.3%, respectively.

Abdul-Mageed et al. [7] created annotated data comprising a four of datasets contain different dialects: the first dataset contains 2798 chat message collected randomly from of an Egyptian room chat session in Maktoob chat, the second dataset contains 3015 Arabic tweets collected from Twitter. The third dataset consists of 3008 sentences, was collected from 30 Talk Pages on Wikipedia. The fourth one comprises 3097 sentences Web forum collected from a larger pool of threaded conversations pertaining to different varieties of Arabic, the topics covered in this forum is religion or politics. The proposed system by Abdul-Mageed et al. gives the best accuracy with the first dataset, which achieves 84.65% in subjectivity classification.

Yin et al. [8] built a dataset for short text classification, the data was collected from two micro blogs and contains five classes, Politics, Economy, Education, Entertainment, and S&T. Semi-supervised learning and SVM were used to improve the accuracy of the classification, and the highest performance of this algorithm in terms of precision and recall was 80.49% and 81.77%, respectively.

Al-Tahrawi and Al-Khatib [4] used Al-Jazeera News dataset to evaluate the performance of polynomial networks classifier. Alj-News dataset was collected from Al-jazeera News Arabic Website. The dataset contains 1500 Arabic documents divided evenly into five classes: Politics, Science Economic, Art, and Sport. And the performance in terms of precision and recall was 90% and 89%, respectively.

Al Mukhaiti et al. [9] built a dataset for Arabic Sentiment Analysis. The data was collected from Facebook, YouTube, Twitter, Keek, and Instagram, and contains 2009 tweets/review. A system developed by Siddiqui et al. [10] was used to evaluate the dataset, the evaluation metrics were precision, recall, and accuracy and the results were 75.9%, 79.8, and 77.7%, respectively.

Alayba et al. [11] built a dataset for Arabic Sentiment Analysis on health services, the dataset contains 2026 tweets collected from twitter using twitter API, three machine learning algorithms were used: Naïve Bayes (NB), Logistic Regression (LR), and SVM, with a change on the size of training set and test set in three phases, The accuracy results were between 85% and 91% and the best classifiers was SVM using linear support vector.

3 Data Collection

The most time-exhaustion and the most important phase of text mining is Data collection [9]. In this paper, we will explain the overview of the dataset development process. This process divided into three phases, data Acquisition phase, data filtering phase, and data labeling phase. Figure 1 depicts the dataset development phases.

Fig. 1.
figure 1

Dataset development phases

The following steps depict the dataset development phases:

  1. 1.

    Collect data by crawling the Arabic Facebook pages.

  2. 2.

    Filter the data collected in the previous step

    1. a.

      Removing URLs

    2. b.

      Removing non-Arabic

    3. c.

      Removing repeated posts.

  3. 3.

    Manually labeled the filtered posts to one of the ten categories chosen.

We chose ten categories/classes for the dataset, these ten classes cover most of the social network topics, and the classes are politics, economics, sport, religion, technology, TV, ads, foods, health, and porno. Figure 2 shows the process of building the proposed dataset. The algorithm used for building the dataset can be summarized as:

Fig. 2.
figure 2

Arabic dataset building process

3.1 Data Acquisition Phase

We have collected about 40,000 Arabic Facebook posts. To collect the posts we developed a web browser to collect the data automatically, it has the ability to collect posts, comments, and replies, by automatically scrolling down the Facebook page to show all posts from the page date of creation, then gathering all the posts and save them in a text file.

3.2 Data Filtering Phase

This phase included of removal of the following types of posts, URLs only posts, non-Arabic posts, and repeated posts. Some posts contain only URL(s), this URL is in English letters and does not useful in Arabic dataset. Arabic Facebook pages sometimes share non-Arabic posts, English or Franco Arabic (Arabic spellings with English letters and digits). The last type is repeated posts, different pages may publish the same post, or a page may re-post an old post, the post was added only once. In case that the post contains an Arabic text plus to URLs, digits, non-Arabic letters, or emotions symbols, this post will be filtered, the Arabic text only will be saved, and the remaining parts will be removed. Table 1 depicts some examples.

Table 1. Filtering examples

3.3 Data Labeling Phase

The filtered posts were used in the labeling phase, wherein the filtered posts were labeled as politics, economics, sport, religion, technology, TV, ads, foods, health, or porno. No system accessible for assessing posts for Arabic text classification. So, we divide the labeling phase to two parts, first As a speaker of the Arabic language, I categorized the data collected, by reading and characterize each post to one of the previous ten classes. Table 2 views example of the classified posts.

Table 2. Labeling phase

To assess our labeling made during the first part, the dataset was given to ten Arabic native speakers who additionally confirm the validity of the dataset created. We built an online web page to help the users to assess the dataset remotely, at any time, from any location and from any device, PC or smartphone. In Figs. 3 and 4 screenshots of the web page from PC and smartphone, are shown respectively.

Fig. 3.
figure 3

Webpage screenshot from a PC

Fig. 4.
figure 4

Webpage screenshot from a smartphone

4 Results and Evaluation

Data Classification is a two steps process: (1) the training (or learning) phase and (2) the test (or evaluation) phase where the actual class of the instance is compared with the predicted class. If the hit rate is acceptable to the analyst, the classifier is accepted as being capable of classifying future instances with unknown class [12].

Our dataset building process contains three phases: 1. Data Acquisition, 2. Data Filtering, and 3. Data Labeling. Table 3 views the total number of posts in each class after phase 3.

Table 3. Labeling dataset phase

To evaluate our dataset, RapidMiner Studio Professional 7.6 was used to analyze data, RapidMiner is an open-source platform independently used for data mining [13]. RapidMiner is code-free software for designing advanced analysis processes with machine learning, data and text mining and business analytics, and predictive analytics, through its graphical user interface, data mining processes can be easily designed and executed. There are operators for tokenization, stemming, and stop words filtering. RapidMiner provides extensions of data loading, data transformation, data modeling, and data visualization methods. One of the most useful extensions in RapidMiner is the Text Processing package, which includes operators that support text mining. RapidMiner has an important feature, it can process a lot of languages including the Arabic language.

We applying RapidMiner operators: Naive Bayes, k-Nearest Neighbors (k-NN), support vector machine SVM, Performance (for classification) and Apply model (for testing). Evaluation metrics includes weighted mean recall, Weighted mean Precision, Kappa statistic, and Accuracy. The weighted mean recall is the average of recall calculated per class. Weighted Mean Precision is the average of precision obtained per class. Kappa Statistic (the accuracy varies from 0 to 1) measures the approval, of prediction with the true class and it means that the classifier is in total agreement with a random classifier. The accuracy is defined as the ratio of numbers of correctly classified posts to the total number of posts. Recall and Precision are defined as:

$$ {\text{Recall}} = {\text{TP}}/\left( {{\text{TP}} + {\text{FN}}} \right) $$
(1)
$$ {\text{Precision}} = {\text{TP}}/\left( {{\text{TP}} + {\text{FP}}} \right) $$
(2)
$$ {\text{Accuracy}} = \left( {{\text{TP}} + {\text{TN}}} \right)/\left( {{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}} \right) $$
(3)

Where TP, TN, FP, and FN refer to: Truly Positive, Truly Negative, Falsely Positive, and Falsely Negative claims of the classifier respectively.

Table 4 shows the result of testing our dataset on RapidMiner tool. It gives the accuracy of 95.12% when using SVM model, which gives the highest accuracy among used models. Figure 5 depicts the output result from RapidMiner. Comparing our results with other available dataset mentioned in Sect. 2 related work, show that the accuracy of our dataset is high compared to mentioned results as shown in Table 5.

Table 4. Evaluation results
Fig. 5.
figure 5

RapidMiner results

Table 5. Comparison of some existing works

5 Conclusions

In recent years, online social networking sites have spread widely, and data is published and shared in large quantities every moment. Social networking sites do not give users ability to filter or categorize content on their walls. Therefore, we have created in this paper a dataset of online Arabic text collected from the Facebook to be used in the text classification process. We chose Arabic because of the paucity of the Arabic dataset available while online Arabic content is increasing and online Arab users are growing. The dataset building process divided into three phases, data Acquisition, data filtering, and data labeling phase. The dataset was collected from Arabic Facebook pages, then in data filtering phase the URLs, non-Arabic, and repeated posts were removed from the dataset. Finally, in the labeling phase, each post was given a label from the ten chosen categories and ten Facebook Arabic users were involved in the labeling process. To evaluate our dataset RapidMiner tool was used and the performance achieved in terms of accuracy was 95.12% with the SVM model. Our dataset will help researchers in the field of short Arabic text processing.