Keywords

1 Introduction

In recent years, every person across the globe uses social networks as a platform to publicly share their knowledge and views, as an embodiment of their life. Fake profile identification is a salient problem within the field of social network analysis (SNA) and knowledge discovery (KDD). To maintain the network security and privacy in an interactive environment like social media, it is required to detect structural abnormalities that violate typical behavior of social networks. An outlier is a data point or a collection which deviates so much from the other observations as to arouse suspicions [1, 2]. Generally, normal data follow a salient pattern behavior differentiating it from anomaly. A general overview of graph-based anomaly detection methods is shown in Fig. 1. The anomaly in the graph can prevail as individual data outlier, in groups and disguises its nature on contextual analysis. Detection strategies calibers in using approaches based on statistics, classification, clustering, subspace, community detection, labeling/ranking, and so on.

Fig. 1
figure 1

General overview of anomaly detection in social network

Studies show that the vertices which deviate from the typical behavior of social networks offer necessary insights into a network. There is a vast amount of bogus id existing in social networks than legitimate users present in different communities. In order to identify the fake profiles which are a serious threat to the secure use of social media, it is highly mandated to detect the anomaly behavior. Graph data reveals inter-dependencies that should be accounted for during the outlier detection process. When compared to traditional data analysis, graph data processing is highly beneficial. They maintain significant interconnections between data in the long run and are flexible to change.

In this paper, anomalous vertices in graph topology are identified by two phases. The primary phase predicts the edge probability in the network by applying a link prediction classifier. Next phase produces the new features set as output. The contributions of the work:

  • Deploy sampling of vertices and edges using positive and negative sample strategy.

  • Generate a link prediction classifier to find anomalous vertices in a graph

  • We adopt a Meta classifier model in the training and testing phase.

2 Related Work

Graphical data analysis has grown over the past years and hence has increased the need for research in social networks. In a graph, data structure substructure reoccur [3]. Hence, it was proposed that the anomalies are substructures that occur infrequently. To understand the problems associated with fake profile creation, several detection methods have been discussed [4]. Most research contributions in machine learning and deep learning prevail but still fake accounts thrive in social networks [5]. In network sampling, there is a comparison of two alternatives for sampling networks that have a predefined number of edges [6].  A random selection of edges with uniform distribution and the edges of the 1.5 ego-networks  finds network hubs that are the nodes with high traffic. The hubs are used as they are showing new information due to their high degree. However, the use of hubs may result in major bottlenecks in the networks. Infiltration of fake users into social networks using random friend requests [7]. Feature extraction of the networks are done based on the degree of user, connections that exist midst the friends of the user, amount of communities the user is coupled to. However, this may result in high anonymity level and may fail in detecting the spammer profiles. Feature extraction uses principal component analysis [8]. Main components in the networks are evaluated by reducing the total number of variables and also reduces the dimensions of all observations based on the classification of alike observations. Communication structure among the variables is also found by PCA. However, another dilemma in the PCA method is based on the selection of core components. Relational dependency networks accomplish collective classification of attributes of the designated variables and are assessed based on a joint probability distribution model. The technique should be trained with more unstructured multimedia data, databases, graph-based analytics, and conception for improved results.

Detection of strange nodes in bipartite graphs involved calculation of normality scores [9] created on the neighborhood relevance score where nodes with a minor normality score had a higher probability of being irregular. In machine learning, using simple heuristics or more refined algorithms is created that is based on the strength of the connection between the users [10]. The first evaluation method involves splitting of datasets hooked on training sets and testing sets. In the second evaluation method, the goal was to measure the classifiers recommendations average users precision. The connection-strength heuristic does not propose a general method of identifying which of the users are to be avoided.

Relational dependency networks (RDN) [11] can achieve collective classification when several attributes midst the selected variables are assessed along on the basis of a joint probability model. Multiple Gibbs sampling iterations were used to fairly accurate the joint distribution. The RDN results are nearly as good as the upper bound data-rich condition. The RDN is capable as it becomes possible to predict leadership parts with some degree of accuracy for circumstances in which very little specific data is known about the individual actors. The suggested method should be qualified with more unstructured data like multimedia data, graph data which will give imagining for improved results. In profile similarity technique [12] similar user profiles are grouped on the basis of profile attributes. User profiles are validated remotely. Using this strategy will drastically improve the search speed of a profile and gradually reduce the memory consumption.

Randomwalk [5, 13] extracts random path sequence from graph. Randomwalk is often used as a part of algorithms that create node embeddings. Deepwalk [14] builds upon a random walk approach which aids to learn latent meaningful graph representations. After the online clustering, anomaly detection is done. If the vertex or edges are far from all current clustering centroid points, then it will be claimed as abnormal. Deepwalk [15] is scalable [4]. Results show that when compared to spectral methods, this approach shows significantly better results for large graphs. This method is designed to operate for sparse graphs too. Netwalk [16], predicts anomalous nodes on dynamic traversal of nodes. As the graph network evolves, the network feature representation is updated dynamically. SDNE [16] and dLSTM [17] are deep network embedding models that facilitate learning vertex representations. It studies the network behavior using autoencoder and locality-preserving constraints.

A generic link prediction classifier [18, 19] aims to detect anomalous vertices by estimating the probability of edge absence in the graph. Based on features collected from graph topology, for each vertex the average probability of missing edges is calculated. Nodes with the highest average probability are predicted as anomalous nodes. This method was not applied on bipartite and weighted graphs [20, 21]. Label propagation algorithms (LPA) [22] is the fastest algorithm that involves labeling nodes in graph structure. The core idea is to handle highly influential nodes through label propagation. On edge traversal, we find the nodes with the more link connectivity are inferred as leader nodes or influential nodes. LPA is applicable to static plain graphs and it suffers from inconsistency and unstable behavior. Random forest algorithm [2325] is a classification method to find link patterns in a given graph structure. It is handled on static plain graphs. This method performs probabilistic ranking to find irregular vertices with good accuracy prediction results. Node similarity communication matching algorithm [26] is proposed to identify the cloned profile in online social network (OSN). It monitors the behavior of profile and finds for similarity matching with recent activity with 93% detection accuracy. It outperforms well over K-nearest neighbor (KNN) and support vector machine (SVM) methods.

In this section, many researchers have contributed their machine learning strategies to solve the problem domain. Most of the strategies either use feature reduction or feature extraction to identify the characteristic behavior of the graph data. The analysis is focused to find patterns from node interconnections as tracing malicious behavior is problematic. Though KNN and random forest classification provides better prediction independently, the performance aspect on integrating the classifier to anomaly detection is targeted.

3 Proposed Methodology

The input of the proposed system is a graph dataset that represents the social networks in the form of nodes and edges. The work flow of the proposed methodology is shown in Fig. 2. The working model is categorized into three stages involving anomalous node identification processes from the original graph.

Fig. 2
figure 2

Work model of the proposed methodology

3.1 Phase 1: Building a Classifier Model for Edge Connectivity

For every different graph manifest, a link prediction classifier is built which predicts a link between two unconnected nodes. For this, a training dataset is prepared from the graph that consists of negative and positive and samples. Negative samples are the major part of the dataset as they consist of the unconnected node pairs. Both Algorithm 1 and 2 shows steps in generating the positive and negative samples from graph input data, respectively.

Algorithm 1

Negative samples generation

Input: Graph G with node pairs

Output: Negative sample pairs

# Get unconnected node-pairs

all_unconnected_pairs ← new list()

#Traverse adjacency matrix

Initialize offset ← 0

for each i ← tqdm(range(G[0]) do

    for each j ← range(offset, G[6]) do

        if ( (i \(\ne\) j) and (shortest_path_length(G,i, j) \(\le \user2{ }2\))) and (adjacency(i,j) = 0) then

          all_unconnected_pairs ← append( [node_list[i], node_list[j]])

         end if;

      offset = offset + 1

    end for;

end for;

Algorithm 2

Positive samples generation

Input: Graph G with node pairs

Output: Positive sample pairs

Initialize

         initial_node_count ← len(G nodes)

         #copy of nodes pairs in facebook graph dataframe

         fb_df_temp ← fb_ df

#empty list to store removable links

omissible_links_index ← new list()

for i ← tqdm(fb_df.index.values) do

   # remove a node pair and build a new graph

     G_temp ← fb_df_temp.drop(index = i)

        # check if there is no splitting of graph and number of nodes is same

       if (number_connected_components(G_temp) =  = 1) and (len(G_temp.nodes) =  = initial_node_count) then

        omissible_links_index ← append(i)

        fb_df_temp ← fb_df_temp.drop(index = i)

    end if;

end for;

Edges are randomly dropped from the graph in a condition that even in the process of dropping edges the nodes of the graph remain connected. These removable edges are appended to the data frame of unconnected node pairs. Feature extraction is performed on the graph after the dropping of the links. In the stage to obtain a trained dataset, jaccard coefficients/jaccard index is used to integrate the overlapping features. Let x,y be some graph vertex, the jaccard index is performed by comparing its neighbor nodes which is represented in Eq. 1.

$${\text{jaccard Index}}\left( {x,y} \right) = \frac{{\left| {{\text{neighborhood }}\left( x \right) \cap {\text{neighborhood}} \left( y \right)} \right|}}{{\left| {{\text{neighborhood }}\left( x \right) \cup {\text{neighborhood }}\left( y \right)} \right|}}$$
(1)

After feature extraction to validate the model, the dataset is split for the training and testing of the model’s performance. Random forest classifier is employed to generate the link prediction model based on the obtained features. Figure 3 represents the resultant graph after building the link prediction classifier.

Fig. 3
figure 3

a Original graph, b graph from link prediction classifier

3.2 Phase 2: Sampling Vertices from Graphs

In this phase, the samples are generated from a graph using a specific criterion. The threshold is set to be min_friends to satisfy the feature extraction process. The calculation of min_friends uses the knowledge of degree distribution to set the scale. Algorithm 3 shows the steps to perform the sampling of vertices and edges for a given graph G.

Algorithm 3

Graph vertex sampling

Input: Graph G with node pairs, min_ friends

Output: Selected edge list

     Initialize

        selected_edges ← new set()

        list_nodes = list(G.nodes())

     for each val \(\in\) list_nodes do

        if len(list(G.neighbors(val))) \(\ge\) min_friends then

           Temp_edges ← new set()

           for node in list(G.neighbors(val)) do

              if len(list(G.neighbors(node))) \(\ge \user2{ }\) min_friends then

                Temp_edges ← add(val)

                Temp_edges ← add(node)

         end if;

         if len(Temp_edges) \(\ge \user2{ }\) min_friends then

             selected_edges ← union(selected_edges, Temp_edges)

           end if;

        end for

   end if;

end for;

3.3 Phase 3: Detection of Outlier from Resultant Graph

In this phase, the link prediction classifier will generate interesting features that enable it to perform outlier detection. This method is centered on the hypothesis that a vertex with many low variance link connections has a higher likelihood of being anomalous. Test set is obtained by performing random edge samples from the graph. The algorithm steps involved in selecting node links that are completely traversed and chooses neighbors with more than minimum friend neighbors. We configure the min friend to be set to three. We omit the nodes with very low neighbors connected as it provides poor network characteristics. These sets of vertices are considered for further evaluation of anomaly detection and the outliers are detected which is represented in Fig. 4.

Fig. 4
figure 4

Result of graph vertex outlier

4 Experimental Evaluation

4.1 Data Preparation

The experiments are conducted on a system operating with a 64 bit windows platform with Intel quad core processors supporting GPU specification. For experimental evaluation, we use Python-3.6 utilizing machine learning libraries comprising scikit-learn version 0.19.1 and NumPy version 1.14.0 to perform mathematical operations.

We evaluated our algorithm on the Facebook dataset from network repository which has 620 nodes and 2102 edges. The fb_pages_food is an undirected graph with nodes representing the Facebook pages and edges having mutual likes among them. In addition to that we have also added 20 randomly generated edges or anomalies using Erdos_Renyi algorithm. To incorporate anomalous nodes, reindexing of the node pairs can be done in the initial dataset.

4.2 Generation of Anomalous Edges

To generate random anomalous edges in the existing graph, Erdos–Renyi Model is used. In this model, each edge distribution has a fixed probability of being available or missing, regardless of the number of connecting links in the network. The probability of edge distribution existing for our model has been set as 0.5 since it gives an equal distribution of the edges. Figure 5 gives the degree of distribution of the edges across the existing network.

Fig. 5
figure 5

Degree distribution of Facebook network

4.3 Result and Experimental Setup

Figure 6 shows the anomalous nodes that are obtained from the proposed algorithm. The node information helps to prevent malicious activity. Nodes are crawled such that outlier nodes are extracted on configuring the threshold value set to 0.8. A node exceeding the limit is outliers and monitoring the network activity of the node is encouraged.

Fig. 6
figure 6

Outlier nodes detected

To evaluate the performance metric of the detection model, a confusion matrix is employed. The confusion matrix is used to solve this binary classification as either genuine/ normal node or outlier node. The confusion matrix gives values such as True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). The mathematical formulation of the performance parameters are represented in Eqns. 25, where TP is nodes correctly classified as outlier, TN is node correctly classified as normal, FP is nodes misclassified as outlier from normal, and FN is nodes misclassified as normal from outlier.

$${\text{Accuracy }} = \left( {{\text{TP }} + {\text{TN}}} \right)/{\text{Total Nodes}}$$
(2)
$${\text{Precision}} = {\text{TP}}/\left( {{\text{TP}} + {\text{FP}}} \right)$$
(3)
$${\text{Recall}} = {\text{TP}}/\left( {{\text{TP}} + {\text{FN}}} \right)$$
(4)
$$f1 - {\text{score}} = \left( {2 \times {\text{Precision}} \times {\text{Recall}}} \right)/\left( {{\text{Precision }} + {\text{ Recall}}} \right)$$
(5)

Figure 7 tabulates the f1-score, support, precision, and recall for the trained model. The KNN uses the Minkowski metric for this step. We have also measured the error rate for different k values (k lies between 1 and 40).

Fig. 7
figure 7

Classification report

Figure 8 depicts that the highest error rate is 11% for k = 0 and the lowest error rate of 6% is for k = 6. The error rate remains constant (i.e.) 6% for the value of k greater than 6. The algorithm has been trained using KNN model, and its accuracy is found to be 93% with the maximum error rate of 0.14. Table 1 summarizes the comparison of the proposed model outperforming other methods [23, 26].

Fig. 8
figure 8

Error rate calculation

Table 1 Comparison table

5 Conclusion

To understand the behavior patterns of large networks, detection of anomalies are very important. We propose a novel strategy that discovers anomalous nodes on the basis of characteristic traits of the underlying network structure. The method adopts current techniques in machine learning to generate useful communities. We experimented this method in the Facebook social network dataset, but there is considerable scope to be tested with a diverse application. The proposed method is trained using KNN model and its accuracy is found to be 93% with the maximum error rate of 0.14. As for future work, the algorithm can be prolonged to be applied on undirected and dynamic real-world graph datasets. We plan to extend the work to incorporate a neural network learning model so as to increase the performance of the proposed model.