Keywords

1 Introduction

Nowadays, online social networks (OSNs) has been a compulsory tool that every individual needs in their daily live. There are various OSNs platform that have been introduced such as Facebook, Twitter, Instagram and others that are widely used to share about peoples’ daily activities pictures and videos (to name a few). With these kind of platforms, one can communicate with other people without boundaries. Based on studies from the Malaysian Communications and Multimedia Commission, until 2018, the rate of online social network usage in Malaysia has reached 87.4% [1]. This finding shows that Malaysian citizens are more likely to communicate and interact with each other through OSNs platform.

OSN is commonly associated with someone identity. According to [2], identity is a distinct object connected to a human being. “Name” is a typical instance of an individual. Every individual has a unique name to represent his or her identity. For instance, passport is also typically used to represent individual’s identity. Passport normally contains name, date of birth, address, telephone number, nationality, fingerprints and photograph of the person. Although each person may have several identities, he or she must have a unique one in the sense that the identity is only belonging to him or her.

Recently, there are increasing issues regarding the utilisation of false identities in OSNs [3]. A typical situation for using such false identities is to impersonate someone with the goal to perform several criminal activities in the cyber space such as gathering further information for a spear, personal interest and also spread propaganda or campaign. Besides, the false identity is also used for distributing malware such as phishing attack, spamming and scamming [4]. Thus, the promotion of false identity will lead to the creation of fake account in OSNs, which are against the actual goal of the social network platform.

Instagram and Twitter have wealthy and fully functional Application Programming Interfaces (APIs) for acquiring appropriate, real-time and up-to-date user data. While Facebook APIs facilitates access to profile data such as user operations, friends, colleagues, and most fundamental user details (age, birthday, profile status, relationship status, likes, group details, etc.). Typically, the social network profile consists of two primary components of data: static and dynamic. Former is about the data that is statically set by user, while the dynamic one involves demographics and interests of users, and vibrant information refers to user activity and social network position [5]. These type of data can be detected using machine learning techniques such as K-nearest neighbour (KNN), support vector machine (SVM) and neural network (NN) [6].

As Facebook is one of the most popular OSNs[2], and often used by many people particularly in Malaysia [1], this study tends to identify the fake account of Facebook users using machine learning techniques towards data in the Southeast Asia countries. This is mainly due to the fact that most of Facebook users in these countries have identical demographic data such as time stamp, language and culture.

2 Literature Review

In order to use the service, most OSNs require a user to create a network profile containing their fundamental (sometimes private) data such as name, gender, place, e-mail address, etc. The openness of these social networking sites allow opponents to exploit the service by generating various types of fake profiles to perform illegal, adversarial, unlawful, false or malicious actions such as spamming, promotion and marketing, stalking, intimidation, defamation, etc. Specific reasons for setting false profiles, however, usually rely on the sort of social network being targeted. Adversary generates forged identities on networks such as Facebook and Twitter to access users’ private data, endorse a specific brand or individual, or defame a user, etc. They strive to monitor members’ behaviour or gain the confidence of company experts for specialist locations like LinkedIn and Researchgate. Attackers often target dating websites to take advantage of individuals looking for perfect games and working colleagues by playing with their feelings or stealing private data to obtain cash from these customers. One of the most hazardous fake profiles on OSN dating is called Catfisher, a person who utilizes the websites of internet dating to tempt individuals into a scam romance.

According to [7] fake profiles can be splitted into five classifications, namely compromised profiles, cloned profiles, sybil accounts, sockpuppets, and fake bot profiles, as depicted in Fig. 1. Cloned profiles are divided into inter-site cloning and intra-site cloning. While fake bot profiles are splitted into social bots, spam bots, like bots influential bots and bots nets. On various online social networking sites, the five categories can be regarded as the distinct ways adversaries accomplish their ill goals.

Fig. 1.
figure 1

Type of false online social network profile [7]

There are some attributes for the OSN detection technique that scholars should consider in order to analyse the characteristic that differentiates these profiles from the actual one. Preparation of precise and accurate function set must be prioritised in order to produce an efficient online social network detector. The characteristics can be either manually noted from social network sites or studied using literature study. It is also feasible, however, that some of the characteristics in the literature may not prove to be effective at the moment as opponents continue to change their behaviour to fool and bypass detection systems. Several scientists have recognized various characteristics of internet profiles from time to time in order to train their fake profile detection models [8] and, based on their nature, this study has classified them into 5 groups as follows:

  1. a.

    Network-based attributes

    This attributes shows how fake account connect to their contact according to degree of relation such as first degree is their friend and second degree is for their friend of friends.

  2. b.

    Content-based attributes

    This attributes tells about how content based can lead us to detect anomalies activities around the Facebook fake profile.

  3. c.

    Temporal features

    This features study about time management of Facebook profile such as time-based activities.

  4. d.

    Profile-based features

    This features study about profile based activity like following number of other Facebook accounts, post activities and etc.

  5. e.

    Action-based features

    This features study on daily activities that has been performed. It includes how many tag has been posted, location sharing, friends tag and etc.

After studying all of these attributes and features, we manage to detect many of malicious activities that is not tally to real Facebook profile behaviour. This factor also can lead to decision-making rule that we can create during implementation of classification of Facebook fake account.

3 Methodology

3.1 Data Collecting Method

This study requires real-world Facebook datasets which are not openly accessible. There are some social graph datasets available which have profile-based feature data, however such datasets are in anonymised form and are unavailable to be used. Therefore, the study needs to obtain data from the Facebook API, although it is restricted to authorised user. These problems are often cited by authors who working on Facebook such as [6]. As Facebook is constantly updating the security policy on privacy, therefore it is hard to access the data without Facebook’s permission [9]. Figure 2 shows the type of data collection techniques.

Fig. 2.
figure 2

Data collection technique [7]

According to [7] API-based and bot-crawler approaches are time consuming for data collection techniques and are extremely subject to user privacy and safety environments. To solve a fake account problem in a Facebook, this study utilised an artificial data generated which produce the synthetic data sample based on a network structure or the characteristics of existing datasets. Also, the synthetic data can be produced using different accessible instruments based on any current social network’s recognized statistics or parameters. For example, a dummy data set can be generated for analysis purposes if the degree distribution, clustering coefficient, average centrality between the degree and other statistical parameters are known. The generation of artificial or synthetically data can be done by using various online data generators such as GEDIS Studio, Databene Benerator, Mockaroo etc. [10]. This study chooses Mockaroo online generator as the data created is more realistic and similar to the real data [11, 12]. There were 800 sample of data that are successful generated by the Mockaroo which comply with the feature of fake account dataset. Table 1 exhibits the details of the collected Facebook user data.

Table 1. Facebook user data collection

3.2 Features Identification

Following the collection of the different information characteristics, the next stage is t identify and define a set of characteristics extracted from these data characteristics that would assist as far as possible to distinguish true users and fake users. Finally, a set of 17 characteristics were select out of the different applicants as described in [13], but after a few revision had been done [7, 13,14,15] the study manage to trace out the most importance characteristics in order to detect fake account as shown in Table 2.

Table 2. Feature set table with description and intuitive justifications

3.3 Learning Classifiers

This study uses monitored machine learning classification algorithms as a final phase in the methodology for detecting false accounts on Facebook. Supervised learners take annotated datasets as input and build predictive models that are used for tasks involving one value prediction using other values in the dataset. In this study situation, the two classes are true users and fake users.

The assumption of using teaching classifiers (as is the strategy followed by many other writers listed in associated work) is that the long-term values of characteristics are likely to differ for true customer accounts and false accounts engaged in multiple anomalous operations.

K-Nearest Neighbor (KNN).

KNN is a technique for classifying objects based on the nearest feature space training examples. One of the simplest of all machine learning algorithms is the k-nearest neighbor algorithm. Training method for this algorithm comprises only of storing the training data ‘function vectors and labels. The unlabeled query point is simply allocated to the label of its closest k neighbors in the classification method [16].

Support Vector Machine (SVM).

SVM is decision plane ideas that fines a decision’s limit. The SVM’s objective is to find a hyperplane in the amount of characteristics that clearly classify the data point. It is mainly a classier technique that performs functions in a multidimensional space by building hyperplane that differentiates instances of different class labels. Several constant and different categorical variables can be handled by SVM. SVM supports regression as well as classification [17].

Neural Network (NN).

In NN, nodes are linked, sharing their resources to find the most precise outcome, updating the outcome of perception. It is also known as the connecting computer network, which transmits inner values to each other. It has an input, output and hidden layer where input is where we insert the data, output is what is the outcome and hidden is where neural network learn itself about the dataset to generate output [18].

4 Evaluation

This section involves the assessment of the techniques that has been used to determine the efficiency of detecting fake accounts in Facebook. All of the above classifiers were introduced to a blended datasets consisting of a prior recognized real accounts belonging to the first and second stage of users in the social neighbourhood, as well as the Fake accounts. The dataset also includes user accounts that are friends of colleagues in the social neighbourhood, assumed to be real in active accounts and assumed to be Fake in inactive one. In creating projections for unknown user accounts, this study assessed the capacity of different machine learning classification models (KNN, SVM and NN) using Orange tools.

First step is cleaning dataset for learning classifier. In this clustering concept in Orange tool (Linear Projection and Circular Placement) for clustering is apply to the dataset. The combination of features mention in Table 2 is used to determine clustering process and categories the data into four clusters as follows (Fig. 3):

Fig. 3.
figure 3

Clustering process

  1. a.

    Fake account user (G1)

  2. b.

    Assume Fake account User (G2)

  3. c.

    Inactive User (G3)

  4. d.

    Real User (G4).

Second, by using cluster that had been created, the study apply it on classifier learning and use three technique for learning classifier (KNN, SVM and NN) to get the best result of detection. Based on the previous research, the best classifier for detection fake account in Facebook are KNN, SVM and NN [5, 19]. By this it can alleviate the comparison between the three Learning Classifier, which are capable of delivering the best results. Table 3 exhibits the comparison that had been made between these the Classifier:

Table 3. Result of classifier

Based on Table 3, Classifier Accuracy (CA) is the correct fraction of prediction model. If the value is closer to 1, the probability of the model prediction is high. According to the testing dataset, KNN model has the highest CA value (0.829) compared to other models. Figures 4, 5 and 6 depict the results of all the prediction models (KNN, SVM and NN).

Fig. 4.
figure 4

Prediction result for KNN

Fig. 5.
figure 5

Prediction result for SVM

Fig. 6.
figure 6

Prediction result for NN

Figure 4 shows that the prediction of KNN algorithm, which is for G1 group almost all fake account can be detected due to strictly used number of neighbour that had been set. For G2, G3 and G4, up to 70% of detection can be achieved. Referring to Fig. 5, the prediction of SVM algorithm also up to 70% of detection for respective group. Figure 6 explains the result that had been obtain using NN. The result shows that up to 70% detection had been done. In a nutshell, all classifiers provided 70% precision of detection and 30% error rate.

5 Conclusion

Over the years, fake accounts have evolved constantly to avoid their detection. It is therefore essential to create methods to detect the false accounts. Based on the user profile operations and communication with other users on Facebook, this study reveals the fundamentals endeavour to detect the fake accounts in Facebook based on users from Southeast Asia countries. The study used artificial generated dataset for Facebook features as the fine-grained privacy settings on Facebook posed a major challenge to the collection of data. Then, the most frequently used machine learning classification methods are used to identify the highest classifiers. Future research is recommended to utilise hybrid approach on detecting fake account. Other characteristic parameters that can be used to detect fake account such as account ID, location data, devices that is used as a tool to browse social media also should be considered in future research.