Abstract
Link prediction refers to the process of mining and determining whether a link between two nodes in a given network may emerge in the future or it is already present but hidden in the network. Link prediction may be categorized under the class of recommendation systems, e.g., finding or predicting link/recommendation between users and items. Thus, efficient link prediction in social networks is the focus of the study described in this paper. Finding hidden links and extracting missing information in a network will aid in identifying a set of new interactions. We developed a technique for link prediction by exposing the benefits of social network analysis tools and algorithms. We used popular network models commonly used by the research community for testing our algorithm accuracy against well-known algorithms leading to similarity measures. We also decided on using a graph database to model the network for providing better scalability and efficiency compared to storing graph information in a relational database. The experimental results reported in this paper demonstrate how the proposed algorithm outperforms traditional link prediction algorithms described in the literature.
Access provided by CONRICYT-eBooks. Download chapter PDF
Similar content being viewed by others
Keywords
1 Introduction
In the recent surge and evolution of the world wide web, many opportunities arose for analyzing user-generated data, where the term big data became a buzz word and is now used almost everywhere such that high volume of data is available at all times in the Internet. Content of the available web-based data is mostly generated from on-line social networks and e-commerce web applications, among others. Domain specific data encapsulates either homogeneous or heterogeneous actors and the links connecting them leading to an n-mode network, where n is number of heterogeneous groups of actors. For instance, data generated from social networks relates mostly to interactions between users/visitors of the networks, where people are modeled as nodes and a friendship relationship is reflected as links connecting people. On the other hand, data generated from e-commerce websites models items (clothes, food, electronics, etc.) and people as nodes to reflect items viewed and bought by people. Accomplished purchase may suggest linking people to items. This behavior of interaction, whether between people or people and items, may be modeled as a social network. A social network can be viewed as a graph where a vertex represents a person or an item, and an edge corresponds to the underlying relationship between vertexes, e.g., friendship, collaboration, among others.
One of the attractive areas for network analysis is collaborations in research where researchers mostly coauthor papers reporting their findings. Collaboration between authors may last short or long leading to a number of coauthored papers over a period of time. Thus, collaboration may be modeled as a social network, and hence can be represented as a graph G(V, E) where V is set of nodes or vertexes representing authors and E is set of edges or links that exist only between researchers who have coauthored at least one paper. Building such a network or the similar, e.g., whether representing scholars or friends on Facebook or other networking sites, provides the possibility of analyzing and may be predicting or uncovering hidden links in the graph. The latter predictions may highlight a possible fruitful collaboration between potential researchers and hence would lead to a recommendation system (RS) which may bring to the attention of target researchers the importance of initiating a collaboration.
A recommendation system is an important mechanism which assists people in exploring items of their interest by guiding them into the specific set of available items in a system’s directory. This kind of systems do their recommendations and predictions based on user’s preferences and behavior. A separate profile is built for each user and items previously searched for or preferred by a user would help a recommender system in deciding what similar products to recommend. Recommendation systems are used in different domains and are very common in websites such as Google, Amazon, and other e-commerce websites in order to recommend to their users some suggested searches or guide them in buying new items. For example, Amazon recommendation system works by getting a list of items “user A” searched for and viewed, then uses this historical information, and checks what other users examined and purchased while also looking at the same set of items. After this step, the recommendation system will use the set of similar items for recommendations to the selected user. Recommendation systems are also used nowadays in social networking platforms such as Facebook and others to help in suggesting friends. This is done by predicting hidden links between actors and using some common features between users of social media. Such information may lead to new friendships between individuals in a social networking platforms such as Facebook and Twitter. In a similar settings, scientists are in need of different collaboration partners, i.e., experts in a specific topic similar to their research field. Indeed, research interests, co-citation, and bibliographic coupling have constituted some key metric and measure in searching for potential collaboration within a network.
Link prediction is also extensively used and important in the security domain. Since criminal activities occur in groups, finding a criminal may lead to identifying his/her whole criminal partners. Such that we can build a criminal network where nodes represent criminals and relationships represent an involvement of two criminals in an act. Performing link prediction in this kind of network will help governments, intelligence agencies, and other security companies to identify criminals and unveil possible actors involved.
Motivated by the above description, the work described in this paper tackles the issue of relating nodes in a general social network and then making appropriate suggestions by finding hidden links in the analyzed network. Completing this work will help in:
-
Uncovering hidden relationships between nodes
-
Categorizing and filtering the network
-
Predicting links between nodes.
The method described in this paper has been tested on some benchmark networks. The reported results demonstrate the applicability and effectiveness of the proposed approach.
The rest of the paper is organized as follows: Section 2 reviews the most popular previous works. The proposed method is described in Sect. 3. Section 5 presents the conducted experiments, the evaluation process, and the results. Section 6 is conclusions and future research directions.
2 Related Work
A considerable amount of research work cover recommendation systems and link prediction, and how they may be applied in different fields. For example, in [1] the authors worked on rating prediction and recommendation of items for users. They carry out the ratings prediction by treating individual user-item ratings as predictors of missing ratings. The final rating is estimated by fusing predictions from the following sources: predictions based on ratings of same item by other users, predictions based on different item ratings made by same user, and ratings predicted based on data from similar users’ ratings of similar items. Also in [2], the authors built an algorithm FolkRank ranking scheme which generates personalized rankings of items in a folksonomy and recommends users, tags, and resources. The basic idea is to extract communities of interest from folksonomy, which are represented by their top tags and most influential persons and resources. Once these communities are identified, interested users can find them and participate in. This way community members can more easily get to know each other by using link prediction. Furthermore, [3] proposed two new improved approaches for link prediction: (1) CNGF algorithm based on local information network and (2) KatzGF algorithm based on global information network.
There are also several efforts that investigate expert recommendation for business institutions, e.g., [2, 4]. Petry et al. [5] developed an expert recommendation system called ICARE, which recommends in an organization experts to work with. The focus of the work does not include relations between authors from their publications and citations, it rather considers organizational level of people, their availability, and their reputation. Reichling and Wulf [6] investigated effectiveness of a recommender system for a European industrial association in supporting their knowledge management, foregone a field study, and interviews with the employees. Experts were defined according to their collection of written documents which were automatically analyzed. Additionally, a post-integrated user profile with information about their background and job is used. Using bookmarking services of individual users in building user profiles provides further information about users’ interests and confirms their recommendations.
Research on link prediction can also be found in [7], where authors proposed a supervised machine learning framework for co-offence prediction. The authors build a network of criminals and offenders first, then they started to find hidden links between known criminals and potential ones by relating offenders to socially related, geographically related, or experience related criminal cooperation opportunities. Additionally, Benchettara et al. [8] proposed a new link prediction algorithm to predict links in large-scale two-mode social networks. Based on topological attributes introduced in the paper, the score (or likelihood) of a link between two nodes can be measured, and they defined link prediction as a two class discrimination problem. Thus, a supervised machine learning approach is applied using these attributes to learn a prediction model. Finally, they validated their results on real datasets which are DBLP bibliographical database and bipartite transaction graph of an on-line music e-commerce site. Hasan et al. [9] developed another successful work using supervised learning for prediction; BIOBASE and DBLP networks are used to validate the model.
Another domain of link prediction in the research domain is to recommend possible future partnership to authors who never worked together before. Using this link prediction, the system will suggest people from other domains to work on similar projects, and this may lead to a fruitful partnership to the benefit of the community. However, the focus of the research is mostly focused on homogeneous networks of authors. Brandão et al. [10] modeled a social network of authors for recommending collaborations in academic networks. They presented two new metrics for their social network, namely institutional affiliation aspect and the geographic localization information. They analyzed how these metrics influence the resulting recommendations. Chen et al. [11] proposed a new way for scholar recommendations based on community detection. They used SCHOLST dataset in order to build a network of authors who are clustered into communities based on their research fields. Then they calculated friendship scores for each community in order to do coauthor recommendation based on communities. Davis et al. [12] introduced two approaches for link prediction in heterogeneous networks. In the first algorithm called unsupervised multi-relational link predictor (MRLP), they extended the well-known Adamic/Adar approach. Secondly, they used their previous research based on homogeneous networks in this study by extending for heterogeneous networks. A supervised framework for high performance link prediction (HPLP) shows that a supervised approach is superior to others, including MRLP. Tang et al. [13] proposed a methodology based on modularity analysis of heterogeneous YouTube dataset. Finally, Radivojac et al. [14] analyzed disease-gene networks.
Heck [15] focused on author link prediction, where authors are also modeled as nodes in a social network. What makes this work interesting is the selection of links between authors where bookmarking services are included in edge identification along with author co-citation and bibliographic coupling measurements. They argued how it is important to consider bookmarking along with the other metrics for better link prediction. Sun et al. [16] developed a methodology to predict coauthor relationship among authors in heterogeneous bibliographic network. They tested their algorithm on DBLP bibliographic network and according to their results prediction can be improved using logistic regression-based coauthorship prediction model based on meta path-based topological features. These are the combination of different meta paths and measures.
Discovering new hidden links in a social network is not a trivial task. In [17], when recommending new friendships in a traditional social network, the number of friends in common can be used to estimate the social proximity between users’ ground model to smooth the rating predictions. Lichtnwalter and Chawla [18] showed how to evaluate developed methodologies in order to select the best technique. For more detailed information on this topic, the reader may refer to the review reported in [19, 20] (Table 5.1).
3 The Methodology
3.1 The Algorithm
In our algorithm, we used centrality measures and path information to predict new links between nodes. Eigenvector centrality points popular nodes in a network. However, unpopular nodes (with not many connected links) may be more informative and discover strong links due to rarity in real networks. Betweenness centrality shows whether a certain part of a network is centralized or not. Centralized networks have a higher betweenness value since they have controller nodes to which everyone is connected. This situation will lead to less interaction between nodes because their connections will be over central nodes. Decentralized networks can have more shortest paths and can be more flexible. Also, we are not only looking for common neighbors while predicting new links to get information. The path-passed approach can provide more information compared to locally dealing with nodes in a network. Hence, close nodes will serve more possible connections, we are only considering shortest paths between nodes, meanwhile gaining from the complexity.
where nodes u and v satisfy {u, v ∈ V | e u, v ∈ E} in a given network G(V, E). (V ) and (E) are sets of vertices and edges, respectively. z is a shortest path, denoted s. paths, between u and v; x ∈ V (z). c eigen (x) is eigenvector centrality of node x, and c betw (x) represents betweenness centrality of node x. length(z) shows number of hops in path z between nodes u and v.
- Case 1: :
-
In Fig. 5.1a, shortest paths between 0 and 4 pass via nodes 2 and 6, while shortest paths from 0 to 3 pass via nodes 1 and 2. Node 2 is common for both cases. In this situation, the edge from 0 to 4 is more probable than the edge from 0 to 3 since node 6 has lower eigenvector and betweenness centralities compared to node 1. In our algorithm, we used exp(x) function to avoid negative values and higher weight for rare neighbors. Also, the difference using exp is insignificant when the value is small, while it is larger when the value is large. The number of hops is 1 in this case since there is no shorter path of length 2.
- Case 2: :
-
In Fig. 5.1b, the edge from 0 to 2 is more probable than the edges from 0 to 3 and from 0 to 5 since it has many more shortest paths than the others.
3.2 Graph Database
In the implementation of our algorithm, we have used a graph database to store the graph network data. Traditionally storing data in a relational database dominated due to its high performance. But recently data types have changed in the Internet more towards social networking and big data domains; this involves complex interconnected information. Thus, storing and manipulating complex data has become an issue using traditional relational databases. This motivated for the development of several database structures like graph databases. A graph database provides a method or a tool to model and store a graph related data by focusing on the relationship between entities and attributes of the nodes as basic constructs of a data model [21].
We have used Neo4j tool which is an open source graph database based on Java combining graph storage services and a high performance scalable architecture for graph-based operations. In our work, we used Java libraries provided by Neo4j to create and store the datasets. We also used the graph-based methods in the library in order to get the shortest paths between nodes, eigenvalue, and betweenness centralities of each node.
4 Datasets
The best way to test our algorithm is to apply it on real-world networks to check if we can successfully predict links between real entities. Accordingly, we have applied our algorithm on six well-known real-world network datasets where the number of nodes and edges is shown in Table 5.2:
-
Zachary Karate Club [22]: is one of the most popular networks in terms of community structure. This network corresponds to members of a karate club at a US university in the 1970s; members are friends. This club has induced fights between its members such that members were split into half. This makes it a perfect real scenario network for link prediction on a member to check to which group he belongs.
-
Dolphin Social Network [23]: is an undirected social network of frequent associations between 62 dolphins in a community living off Doubtful Sound.
-
Les Misérables [24]: is a network corresponding for co-appearance of characters in the novel Les Misérables. It is interesting to test on this network as there are several communities to apply link prediction on them.
-
Books About US Politics (orgnet.com): is a network of books about US politics sold by Amazon where edges between books represent frequent co-purchasing of books by same buyers.
-
Word Adjacencies [25]: is an adjacency network of common adjectives and nouns in the novel David Copperfield by Charles Dickens.
-
American College Football [26]: is a network of American football games between Division IA colleges during regular season in Fall 2000.
5 Experiments and Results
After collecting the datasets related to the various networks, the following steps are applied to run our algorithm which will output the confusion matrix for the evaluation code:
-
1.
Randomly remove δ percentage of edges
-
2.
Run the algorithms presented in Table 5.1 on the new network and get the corresponding confusion matrix
-
3.
Calculate eigenvalue and betweenness centralities for all nodes in the network
-
4.
Run our proposed algorithm
-
5.
Select value α, which serves as a threshold for the algorithms predicted results.
The results shown in Tables 5.3, 5.4, 5.5, 5.6, 5.7, and 5.8 are average results where for each network we perform Step 1 of removing edges randomly ten times. Moreover, we chose three different values of δ for removing 10, 30, and 50% of the edges. We decided on these values based on clustering coefficient analysis. This analysis may help in finding missing links between nodes, called structural holes. High degree nodes will have lower local clustering values in this analysis which means more structural holes will exist in the network so that central nodes will collect all the flow of information and reduce alternative paths. In Fig. 5.2, we can see that we got more less local clustering values for 0.7 and 0.9 to the edge removal percentage. In addition, we experimentally used α = 0. 1 in all tests as threshold value for accepting an edge as predicted. We did this testing for the networks by choosing ℓ = 5, which is the maximum depth used in all algorithms to search for a shortest paths between two nodes. In Fig. 5.3, we figured the path distance distribution of the datasets to show why we chose 5 as maximum depth. As shown for all networks, the performance of the algorithms are close to each other but our algorithm most of the times reports better values for the evaluation metrics than the other algorithms. In the various tables, I means δ = 0. 1, II means δ = 0. 3, and III means δ = 0. 5.
According to these tables, our precision and F1-score values are better than others in many cases by considerable margins. However, it is hard to decide which algorithm is better than others from accuracy results since the results are close to each other by small fraction. Also, we can clearly see from specificity results that we are not predicting non-existing links since our results are the higher compared to others, and our dataset is imbalanced which means that the number of negative examples (FP + TN, connections not to be predicted) is not close to the number of positive examples (TP + FN, connections to be predicted). Because of this, our sensitivity values are low since we have many false negatives by having more positive examples than negative examples.
To further check into how our algorithm is functioning and the advantage of using social network analysis in investigating the results of link prediction, we show in Fig. 5.4 a sample run of our algorithm on the Karate network with 30% of its edges removed. After removing edges, we ran the evaluation metrics we presented above to investigate the behavior of different algorithms compared to ours. In order to explain how our algorithm is predicting different than others, we used the color of the nodes to represent betweenness values on a white–red scale where white corresponds to low betweenness while red represents high betweenness. We also used size of the nodes to represent eigenvalues where the size of a node is directly proportional to its eigenvalue.
After running the algorithms on this network, we show edges which were successfully predicted by our algorithm as blue dashed lines; these edges connect nodes 32–33, 8–33, 15-32, and 20–32. All other algorithms have predicted the link between nodes 32 and 32 except Leicht–Holme–Newman algorithm. This is because these nodes have a large number of common neighbors (5) facilitating the prediction of this edge. While none of the other algorithms predicted the existence of the other edges which were successfully predicted by our algorithm, except for Katz which predicted the edge between nodes 8 and 32. This reported result is due to the fact that there is no common neighbors between nodes 8–33, 15–32, and 20–32. Thus, the other algorithms failed to predict these links. While our algorithm successfully predicts the mentioned links because it does not only use common neighbors between two nodes but also considers the sum of all shortest paths between the two nodes.
6 Conclusions
In this paper, we tackled the problem of predicting the existence of links in a graph by using network analysis. Finding hidden relationships between actors in a network has various advantages in predicting different future partnership, collaboration, etc., that based on the actors properties can help and accomplish a new trend in the research domain. It also provides the ability to unveil already existing links between people. For example, detecting series of related criminals for security reasons. By performing link prediction using social network analysis, we are able to benefit from existing graph theory algorithms that provide good analytical solutions to the problem. In our paper, we used a combination of shortest path, betweenness, and eigenvalue centralities for the link prediction algorithm. We showed with examples how our algorithm can perform better on real-world datasets than other link prediction algorithms which mostly focus on common neighbors for prediction. We will continue to expand the algorithm by incorporating various other features from the network like closeness, connectedness, etc. We will also investigate the possibility of developing a classifier to help in the process.
References
Wang J, De Vries AP, Reinders MJT (2006) Unifying user-based and item-based collaborative filtering approaches by similarity fusion. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 501–508
Hotho A, Jäschke R, Schmitz C, Stumme G (2006) Information retrieval in folksonomies: search and ranking. Springer, Berlin
Dong L, Li Y, Yin H, Le H, Rui M (2013) The algorithm of link prediction on social network. Math Probl Eng 2013:7 pp.
Heck T, Peters I, Stock WG (2011) Testing collaborative filtering against co-citation analysis and bibliographic coupling for academic author recommendation. In: Proceedings of the 3rd ACM RecSys’ 11 workshop on recommender systems and the social web, pp 16–23
Petry H, Tedesco P, Vieira V, Salgado AC (2008) Icare. a context-sensitive expert recommendation system. In: ECAI’08, pp 53–58
Reichling T, Wulf V (2009) Expert recommender systems in practice: evaluating semi-automatic profile generation. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, New York, pp 59–68
Tayebi MA, Ester M, Glässer U, Brantingham PL (2014) Spatially embedded co-offence prediction using supervised learning. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 1789–1798
Benchettara N, Kanawati R, Rouveirol C (2010) Supervised machine learning applied to link prediction in bipartite social networks. In: 2010 international conference on advances in social networks analysis and mining (ASONAM). IEEE, New York, pp 326–330
Hasan MA, Chaoji V, Salem S, Zaki M (2006) Link prediction using supervised learning. In: SDM’06: workshop on link analysis, counter-terrorism and security
Brandão MA, Moro MM, Lopes GR, Oliveira JPM (2013) Using link semantics to recommend collaborations in academic social networks. In: Proceedings of the 22nd international conference on World Wide Web companion. International World Wide Web conferences steering committee, pp 833–840
Chen J, Tang Y, Li J, Mao C, Xiao J (2014) Community-based scholar recommendation modeling in academic social network sites. In: Web information systems engineering–WISE 2013 workshops. Springer, Berlin, pp 325–334
Davis D, Lichtenwalter R, Chawla NV (2011) Multi-relational link prediction in heterogeneous information networks. In: 2011 international conference on advances in social networks analysis and mining (ASONAM). IEEE, New York, pp 281–288
Tang L, Wang X, Liu H (2009) Uncoverning groups via heterogeneous interaction analysis. In Ninth IEEE international conference on data mining, 2009. ICDM’09. IEEE, New York, pp 503–512
Radivojac P, Peng K, Clark WT, Peters BJ, Mohan A, Boyle SM, Mooney SD (2008) An integrated approach to inferring gene–disease associations in humans. Proteins Struct Funct Bioinf 72(3):1030–1037
Heck T (2013) Combining social information for academic networking. In: Proceedings of the 2013 conference on computer supported cooperative work. ACM, New York, pp 1387–1398
Sun Y, Barber R, Gupta M, Aggarwal CC, Han J (2011) Co-author relationship prediction in heterogeneous bibliographic networks. In: 2011 international conference on advances in social networks analysis and mining (ASONAM). IEEE, New York, pp 121–128
Lopes GR, Moro MM, Wives LK, De Oliveira JPM (2010) Collaboration recommendation on academic social networks. In: Advances in conceptual modeling–applications and challenges. Springer, Berlin, pp 190–199
Lichtnwalter R, Chawla NV (2012) Link prediction: fair and effective evaluation. In: Proceedings of the 2012 international conference on advances in social networks analysis and mining (ASONAM 2012). IEEE Computer Society, New York, pp 376–383
Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. J Am Soc Inf Sci Technol 58(7):1019–1031
Lü L, Zhou T (2011) Link prediction in complex networks: a survey. Physica A 390(6):1150–1170
Silvescu A, Caragea D, Atramentov A (2002) Graph databases
Zachary WW (1977) An information flow model for conflict and fission in small groups. J Anthropol Res 33(4):452–473
Lusseau D, Schneider K, Boisseau OJ, Haase P, Slooten E, Dawson SM (2003) The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations. Behav Ecol Sociobiol 54(4):396–405
Knuth DE (1993) The Stanford GraphBase: a platform for combinatorial computing, vol 37. Addison-Wesley, Reading, MA
Newman MEJ (2006) Finding community structure in networks using the eigenvectors of matrices. Phys Rev E 74(3):036104
Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821–7826
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Afra, S., Aksaç, A., Õzyer, T., Alhajj, R. (2017). Link Prediction by Network Analysis. In: Kawash, J., Agarwal, N., Özyer, T. (eds) Prediction and Inference from Social Networks and Social Media. Lecture Notes in Social Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-51049-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-51049-1_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-51048-4
Online ISBN: 978-3-319-51049-1
eBook Packages: Computer ScienceComputer Science (R0)