Abstract
Nowadays, social networks have become part of our everyday routine to connect and communicate to diverse communities. The detection of abnormal vertices in these networks has become more vital in exposing network intruders or malicious users. In this paper, we have included a novel meta-classifier which will notice anomalous vertices in complicated interactive network topology by extracting useful patterns. We identify that a vertex with several link connections will incorporate a higher probability of being abnormal. Henceforth, we choose to apply the link prediction model on the Facebook dataset with high interconnectivity. In each step, we determine abnormal vertices are detected with minimal false positive estimates and better accuracy when compared to alternative prevailing methods. Hence, the method incorporated reveals the outliers in the social networks. We proposed an approach that can identify the false profiles with 93% accuracy and lower false positive rates with sustainable accuracy.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In recent years, every person across the globe uses social networks as a platform to publicly share their knowledge and views, as an embodiment of their life. Fake profile identification is a salient problem within the field of social network analysis (SNA) and knowledge discovery (KDD). To maintain the network security and privacy in an interactive environment like social media, it is required to detect structural abnormalities that violate typical behavior of social networks. An outlier is a data point or a collection which deviates so much from the other observations as to arouse suspicions [1, 2]. Generally, normal data follow a salient pattern behavior differentiating it from anomaly. A general overview of graph-based anomaly detection methods is shown in Fig. 1. The anomaly in the graph can prevail as individual data outlier, in groups and disguises its nature on contextual analysis. Detection strategies calibers in using approaches based on statistics, classification, clustering, subspace, community detection, labeling/ranking, and so on.
Studies show that the vertices which deviate from the typical behavior of social networks offer necessary insights into a network. There is a vast amount of bogus id existing in social networks than legitimate users present in different communities. In order to identify the fake profiles which are a serious threat to the secure use of social media, it is highly mandated to detect the anomaly behavior. Graph data reveals inter-dependencies that should be accounted for during the outlier detection process. When compared to traditional data analysis, graph data processing is highly beneficial. They maintain significant interconnections between data in the long run and are flexible to change.
In this paper, anomalous vertices in graph topology are identified by two phases. The primary phase predicts the edge probability in the network by applying a link prediction classifier. Next phase produces the new features set as output. The contributions of the work:
-
Deploy sampling of vertices and edges using positive and negative sample strategy.
-
Generate a link prediction classifier to find anomalous vertices in a graph
-
We adopt a Meta classifier model in the training and testing phase.
2 Related Work
Graphical data analysis has grown over the past years and hence has increased the need for research in social networks. In a graph, data structure substructure reoccur [3]. Hence, it was proposed that the anomalies are substructures that occur infrequently. To understand the problems associated with fake profile creation, several detection methods have been discussed [4]. Most research contributions in machine learning and deep learning prevail but still fake accounts thrive in social networks [5]. In network sampling, there is a comparison of two alternatives for sampling networks that have a predefined number of edges [6]. A random selection of edges with uniform distribution and the edges of the 1.5 ego-networks finds network hubs that are the nodes with high traffic. The hubs are used as they are showing new information due to their high degree. However, the use of hubs may result in major bottlenecks in the networks. Infiltration of fake users into social networks using random friend requests [7]. Feature extraction of the networks are done based on the degree of user, connections that exist midst the friends of the user, amount of communities the user is coupled to. However, this may result in high anonymity level and may fail in detecting the spammer profiles. Feature extraction uses principal component analysis [8]. Main components in the networks are evaluated by reducing the total number of variables and also reduces the dimensions of all observations based on the classification of alike observations. Communication structure among the variables is also found by PCA. However, another dilemma in the PCA method is based on the selection of core components. Relational dependency networks accomplish collective classification of attributes of the designated variables and are assessed based on a joint probability distribution model. The technique should be trained with more unstructured multimedia data, databases, graph-based analytics, and conception for improved results.
Detection of strange nodes in bipartite graphs involved calculation of normality scores [9] created on the neighborhood relevance score where nodes with a minor normality score had a higher probability of being irregular. In machine learning, using simple heuristics or more refined algorithms is created that is based on the strength of the connection between the users [10]. The first evaluation method involves splitting of datasets hooked on training sets and testing sets. In the second evaluation method, the goal was to measure the classifiers recommendations average users precision. The connection-strength heuristic does not propose a general method of identifying which of the users are to be avoided.
Relational dependency networks (RDN) [11] can achieve collective classification when several attributes midst the selected variables are assessed along on the basis of a joint probability model. Multiple Gibbs sampling iterations were used to fairly accurate the joint distribution. The RDN results are nearly as good as the upper bound data-rich condition. The RDN is capable as it becomes possible to predict leadership parts with some degree of accuracy for circumstances in which very little specific data is known about the individual actors. The suggested method should be qualified with more unstructured data like multimedia data, graph data which will give imagining for improved results. In profile similarity technique [12] similar user profiles are grouped on the basis of profile attributes. User profiles are validated remotely. Using this strategy will drastically improve the search speed of a profile and gradually reduce the memory consumption.
Randomwalk [5, 13] extracts random path sequence from graph. Randomwalk is often used as a part of algorithms that create node embeddings. Deepwalk [14] builds upon a random walk approach which aids to learn latent meaningful graph representations. After the online clustering, anomaly detection is done. If the vertex or edges are far from all current clustering centroid points, then it will be claimed as abnormal. Deepwalk [15] is scalable [4]. Results show that when compared to spectral methods, this approach shows significantly better results for large graphs. This method is designed to operate for sparse graphs too. Netwalk [16], predicts anomalous nodes on dynamic traversal of nodes. As the graph network evolves, the network feature representation is updated dynamically. SDNE [16] and dLSTM [17] are deep network embedding models that facilitate learning vertex representations. It studies the network behavior using autoencoder and locality-preserving constraints.
A generic link prediction classifier [18, 19] aims to detect anomalous vertices by estimating the probability of edge absence in the graph. Based on features collected from graph topology, for each vertex the average probability of missing edges is calculated. Nodes with the highest average probability are predicted as anomalous nodes. This method was not applied on bipartite and weighted graphs [20, 21]. Label propagation algorithms (LPA) [22] is the fastest algorithm that involves labeling nodes in graph structure. The core idea is to handle highly influential nodes through label propagation. On edge traversal, we find the nodes with the more link connectivity are inferred as leader nodes or influential nodes. LPA is applicable to static plain graphs and it suffers from inconsistency and unstable behavior. Random forest algorithm [23– 25] is a classification method to find link patterns in a given graph structure. It is handled on static plain graphs. This method performs probabilistic ranking to find irregular vertices with good accuracy prediction results. Node similarity communication matching algorithm [26] is proposed to identify the cloned profile in online social network (OSN). It monitors the behavior of profile and finds for similarity matching with recent activity with 93% detection accuracy. It outperforms well over K-nearest neighbor (KNN) and support vector machine (SVM) methods.
In this section, many researchers have contributed their machine learning strategies to solve the problem domain. Most of the strategies either use feature reduction or feature extraction to identify the characteristic behavior of the graph data. The analysis is focused to find patterns from node interconnections as tracing malicious behavior is problematic. Though KNN and random forest classification provides better prediction independently, the performance aspect on integrating the classifier to anomaly detection is targeted.
3 Proposed Methodology
The input of the proposed system is a graph dataset that represents the social networks in the form of nodes and edges. The work flow of the proposed methodology is shown in Fig. 2. The working model is categorized into three stages involving anomalous node identification processes from the original graph.
3.1 Phase 1: Building a Classifier Model for Edge Connectivity
For every different graph manifest, a link prediction classifier is built which predicts a link between two unconnected nodes. For this, a training dataset is prepared from the graph that consists of negative and positive and samples. Negative samples are the major part of the dataset as they consist of the unconnected node pairs. Both Algorithm 1 and 2 shows steps in generating the positive and negative samples from graph input data, respectively.
Algorithm 1
Negative samples generation | |
---|---|
Input: Graph G with node pairs Output: Negative sample pairs # Get unconnected node-pairs all_unconnected_pairs ← new list() #Traverse adjacency matrix Initialize offset ← 0 for each i ← tqdm(range(G[0]) do for each j ← range(offset, G[6]) do if ( (i \(\ne\) j) and (shortest_path_length(G,i, j) \(\le \user2{ }2\))) and (adjacency(i,j) = 0) then all_unconnected_pairs ← append( [node_list[i], node_list[j]]) end if; offset = offset + 1 end for; end for; |
Algorithm 2
Positive samples generation | |
---|---|
Input: Graph G with node pairs Output: Positive sample pairs Initialize initial_node_count ← len(G nodes) #copy of nodes pairs in facebook graph dataframe fb_df_temp ← fb_ df #empty list to store removable links omissible_links_index ← new list() for i ← tqdm(fb_df.index.values) do # remove a node pair and build a new graph G_temp ← fb_df_temp.drop(index = i) # check if there is no splitting of graph and number of nodes is same if (number_connected_components(G_temp) = = 1) and (len(G_temp.nodes) = = initial_node_count) then omissible_links_index ← append(i) fb_df_temp ← fb_df_temp.drop(index = i) end if; end for; |
Edges are randomly dropped from the graph in a condition that even in the process of dropping edges the nodes of the graph remain connected. These removable edges are appended to the data frame of unconnected node pairs. Feature extraction is performed on the graph after the dropping of the links. In the stage to obtain a trained dataset, jaccard coefficients/jaccard index is used to integrate the overlapping features. Let x,y be some graph vertex, the jaccard index is performed by comparing its neighbor nodes which is represented in Eq. 1.
After feature extraction to validate the model, the dataset is split for the training and testing of the model’s performance. Random forest classifier is employed to generate the link prediction model based on the obtained features. Figure 3 represents the resultant graph after building the link prediction classifier.
3.2 Phase 2: Sampling Vertices from Graphs
In this phase, the samples are generated from a graph using a specific criterion. The threshold is set to be min_friends to satisfy the feature extraction process. The calculation of min_friends uses the knowledge of degree distribution to set the scale. Algorithm 3 shows the steps to perform the sampling of vertices and edges for a given graph G.
Algorithm 3
Graph vertex sampling | |
---|---|
Input: Graph G with node pairs, min_ friends Output: Selected edge list Initialize selected_edges ← new set() list_nodes = list(G.nodes()) for each val \(\in\) list_nodes do if len(list(G.neighbors(val))) \(\ge\) min_friends then Temp_edges ← new set() for node in list(G.neighbors(val)) do if len(list(G.neighbors(node))) \(\ge \user2{ }\) min_friends then Temp_edges ← add(val) Temp_edges ← add(node) end if; if len(Temp_edges) \(\ge \user2{ }\) min_friends then selected_edges ← union(selected_edges, Temp_edges) end if; end for end if; end for; |
3.3 Phase 3: Detection of Outlier from Resultant Graph
In this phase, the link prediction classifier will generate interesting features that enable it to perform outlier detection. This method is centered on the hypothesis that a vertex with many low variance link connections has a higher likelihood of being anomalous. Test set is obtained by performing random edge samples from the graph. The algorithm steps involved in selecting node links that are completely traversed and chooses neighbors with more than minimum friend neighbors. We configure the min friend to be set to three. We omit the nodes with very low neighbors connected as it provides poor network characteristics. These sets of vertices are considered for further evaluation of anomaly detection and the outliers are detected which is represented in Fig. 4.
4 Experimental Evaluation
4.1 Data Preparation
The experiments are conducted on a system operating with a 64 bit windows platform with Intel quad core processors supporting GPU specification. For experimental evaluation, we use Python-3.6 utilizing machine learning libraries comprising scikit-learn version 0.19.1 and NumPy version 1.14.0 to perform mathematical operations.
We evaluated our algorithm on the Facebook dataset from network repository which has 620 nodes and 2102 edges. The fb_pages_food is an undirected graph with nodes representing the Facebook pages and edges having mutual likes among them. In addition to that we have also added 20 randomly generated edges or anomalies using Erdos_Renyi algorithm. To incorporate anomalous nodes, reindexing of the node pairs can be done in the initial dataset.
4.2 Generation of Anomalous Edges
To generate random anomalous edges in the existing graph, Erdos–Renyi Model is used. In this model, each edge distribution has a fixed probability of being available or missing, regardless of the number of connecting links in the network. The probability of edge distribution existing for our model has been set as 0.5 since it gives an equal distribution of the edges. Figure 5 gives the degree of distribution of the edges across the existing network.
4.3 Result and Experimental Setup
Figure 6 shows the anomalous nodes that are obtained from the proposed algorithm. The node information helps to prevent malicious activity. Nodes are crawled such that outlier nodes are extracted on configuring the threshold value set to 0.8. A node exceeding the limit is outliers and monitoring the network activity of the node is encouraged.
To evaluate the performance metric of the detection model, a confusion matrix is employed. The confusion matrix is used to solve this binary classification as either genuine/ normal node or outlier node. The confusion matrix gives values such as True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). The mathematical formulation of the performance parameters are represented in Eqns. 2–5, where TP is nodes correctly classified as outlier, TN is node correctly classified as normal, FP is nodes misclassified as outlier from normal, and FN is nodes misclassified as normal from outlier.
Figure 7 tabulates the f1-score, support, precision, and recall for the trained model. The KNN uses the Minkowski metric for this step. We have also measured the error rate for different k values (k lies between 1 and 40).
Figure 8 depicts that the highest error rate is 11% for k = 0 and the lowest error rate of 6% is for k = 6. The error rate remains constant (i.e.) 6% for the value of k greater than 6. The algorithm has been trained using KNN model, and its accuracy is found to be 93% with the maximum error rate of 0.14. Table 1 summarizes the comparison of the proposed model outperforming other methods [23, 26].
5 Conclusion
To understand the behavior patterns of large networks, detection of anomalies are very important. We propose a novel strategy that discovers anomalous nodes on the basis of characteristic traits of the underlying network structure. The method adopts current techniques in machine learning to generate useful communities. We experimented this method in the Facebook social network dataset, but there is considerable scope to be tested with a diverse application. The proposed method is trained using KNN model and its accuracy is found to be 93% with the maximum error rate of 0.14. As for future work, the algorithm can be prolonged to be applied on undirected and dynamic real-world graph datasets. We plan to extend the work to incorporate a neural network learning model so as to increase the performance of the proposed model.
References
Hodge VJ, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126
Saranya S, Rajalakshmi M (2022) Certain St rategic study on machine learning-based graph anomaly detection. In: Shakya S, Bestak R, Palanisamy R, Kamel KA (eds) Mobile computing and sustainable informatics. Lecture notes on data engineering and communications technologies, vol 68. Springer, Singapore. https://doi.org/10.1007/978-981-16-1866-6_5
Noble CC, Cook DJ (2003) Graph-based anomaly detection”, ACM SIGKDD ’03, August 24–27, 2003
Roy PK, Chahar S (Dec.2020) Fake profile detection on social networking websites: a comprehensive review. IEEE Trans Artif Intell 1(3):271–285. https://doi.org/10.1109/TAI.2021.3064901
Moonesinghe HDK, Tan PN (2008) OutRank: a graph-based outlier detection framework using random walk. Int J Artif Intell Tools 17(1)
Altshuler Y et al (2013) Detecting Anomalous behaviors using structural properties of social networks. In: Greenberg AM, Kennedy WG, Bos ND (eds) Social computing, behavioral-cultural modeling and prediction, SBP 2013. Lecture Notes in Computer Science, vol 7812. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37210-0_47
Fire M, Katz G, Elovici Y (2012) Strangers Intrusion detection—detecting spammers and fake profiles in social networks based on topology anomalies. In: ASE Human J
Mohammadrezaei M, Shiri ME, Rahmani AM (2018) Identifying fake accounts on social networks based on graph analysis and classification algorithms. Secur Commun Netw 1–8. https://doi.org/10.1155/2018/5923156
Sun J, Qu H, Chakrabarti D, Faloutsos C (200) Neighborhood formation and anomaly detection in bipartite graphs. In: Fifth IEEE international conference on data mining (ICDM'05), p 8. https://doi.org/10.1109/ICDM.2005.103
Fire M, Kagan D, Elyashar A et al (2014) Friend or foe? Fake profile identification in online social networks. Soc Netw Anal Min 4:194. https://doi.org/10.1007/s13278-014-0194-4
Campbell WM, Dagli CK, Weinstein CJ (2013) Social network analysis with content and graphs
Nandhini DM, Das BB (2016) Profile similarity technique for detection of duplicate profiles in online social network
Vempala S (2005) Geometric random walks: a survey. Comb Comput Geom MSRI Publ 52:573–612
Perozzi B, Al-Rfou R, Skiena S (2019) DeepWalk: online learning of social representations. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. https://doi.org/10.1145/2623330.2623732
Tan Q, Liu N, Hu X (2014) Deep representation learning for social network analysis. Front Big Data 2:2
Yu W, Cheng W, Aggarwal CC, Zhang K, Chen H, Wang W (2018) NetWalk: a flexible deep embedding approach for anomaly detection in dynamic networks. pp 2672–2681. https://doi.org/10.1145/3219819.3220024.
Maya S, Ueno K, Nishikawa T (2019) dLSTM: a new approach for anomaly detection using deep learning with delayed prediction. Int J Data Sci Anal
Kagan D, Elovichi Y, Fire M (2018) Generic anomalous vertices detection utilizing a link prediction algorithm. Soc Netw Anal Mining
Fadaee SA, Haeri MA (2019) Classification using link prediction. Neurocomputing 359:395–407. https://doi.org/10.1016/j.neucom.2019.06.026
Fire M, Tenenboim L, Lesser O, Puzis R, Rokach L, Elovici Y (2011) Link prediction in social networks using computationally efficient topological features. In: IEEE Third Int’l conference on privacy, security, risk and trust and IEEE third Int’l conference on social computing. https://doi.org/10.1109/passat/socialcom.2011.20
Lichtenwalter RN, Lussier JT, Chawla NV (2010) New perspectives and methods in link prediction. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining—KDD ’10,2010. https://doi.org/10.1145/1835804.1835837
Bhatia V, Saneja B, Rani (2017) INGC: graph clustering & outlier detection algorithm using label propagation. In: International conference on machine learning and data science 2017
Homsi A, Al Nemri J, Naimat N, Kareem HA, Al-Fayoumi M, Snober MA (2021) Detecting twitter fake accounts using machine learning and data reduction techniques. DATA
Primartha R, Tama BA (2017) Anomaly detection using random forest: a performance revisited. https://doi.org/10.1109/ICODSE.2017.8285847
Rahman O, Quraishi MA (2019) Experimental analysis of random forest, K-nearest neighbor and support vector machine anomaly detection. https://doi.org/10.13140/RG.2.2.19998.18245
Revathi S, Suriakala M (2018) Profile similarity communication matching approaches for detection of duplicate profiles in online social network. In: 2018 3rd International conference on computational systems and information technology for sustainable solutions, pp 174–182
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Saranya, S., Rajalakshmi, M., Devi, S., Suruthi, R.M. (2023). A Meta-Classifier Link Prediction Model for False Profile Identification in Facebook. In: Smys, S., Kamel, K.A., Palanisamy, R. (eds) Inventive Computation and Information Technologies. Lecture Notes in Networks and Systems, vol 563. Springer, Singapore. https://doi.org/10.1007/978-981-19-7402-1_2
Download citation
DOI: https://doi.org/10.1007/978-981-19-7402-1_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-7401-4
Online ISBN: 978-981-19-7402-1
eBook Packages: EngineeringEngineering (R0)