Keywords

2.1 Introduction

In this section, we will discuss about what a social network is, what are its components, and how a social network can be visualized as a graphical network. After discussing about these basic preliminaries, we will focus on the centrality analysis and its measures to understand how they can play a crucial role in understanding the evolution of social web graphs.

2.1.1 What is Social Networking?

Before the late 90s, people heavily relied on the information, mainly news-related content on newspapers, television, radios, etc. With the inception of social networking websites in the early 2000s, the paradigm shifted from hand-delivered news material in the form of newspapers, telegrams, etc. to electronic mode [1]. By 2005–06, Facebook and Twitter had entered the arena of the social networking web and are still highly popular social interacting platforms. Other sites like Tumblr, Spotify, Foursquare, and Pinterest tried to fill up specific social networking niches. Since the last decade, social networking websites have captivated the communication needs so well that now it has become an integral part of almost every human being’s daily routine life. The impact has been so much powerful that there has been a drastic increase in the number of online apps, news apps, and channels. The news channels have expanded their horizon running from national television broadcasters to live updates on social sites and their own apps [2]. This growth in people’s involvement over social networking websites has not constrained to this but has seen a rapid growth in the e-commerce trading [3] as well. The recent emergence of Amazon, Flipkart, and other E-commerce websites has seen a high jump in the digital and retail market. Similarly, YouTube, Netflix, Amazon Prime, etc. (offering various channels and web series) have created a whole new world of viewers. In simple words, social networking has made the world connected remaining at their places, still being able to do the most of what they can.

This discussion brings to the conclusion to define Social Networks as [4]:Social networking is the use of Internet-based social media sites to stay connected with friends, family, colleagues, customers, or clients. Social networking can have a social purpose, a business purpose, or both, through sites such as Facebook, Twitter, LinkedIn, and Instagram, among others. Social networking has become a significant base for marketers seeking to engage customers.” Following this, we will now discuss some interesting insights and analyses over the social web graphs.

2.1.2 Social Networks as Graph

Social Networks analysis is stated as the study of investigating social structures, behaviors, and interactions between individuals using network analysis and graph theory. To perform analysis over edge links and nodes of the graph are used to model this interaction and relationship [5]. Each individual in the graph represents a node and the connecting edge between these individuals is drawn based upon some notion like relationship as a friend, similar likes or recommendations, etc. This connecting tie is of very high importance as this forms the basis of analysis of the network. There is a wide literature available on how to define social ties between the entities in the graph [6,7,8]. The connecting edges stores the information database which enables us to associate nodes referred to as generating graph embeddings [9].

2.1.3 Why Centrality Analysis?

Given a connected graph simulating a real scenario, it can be visualized that few node positions are more central while the rest are peripheral, in context to the view of the graph at that instance. These key positions help in identifying the nodes of prime importance and deciding upon their roles in the network. This notion was first discussed by Roethlisberger et al. [10]. This information can be of great significance in determining the flow of information, forming various strategies, and many more so that information can be channelized in the network as soon as possible and in a more appropriate manner. Consider a small example of a connectivity network as shown in Fig. 2.1.

Fig. 2.1
figure 1

Connectivity network

In this small network, if node 1 is chosen as a central node, then, it may not be a good choice to circulate information as it is more distant to more number of nodes in the network. On the contrary, node 2 will be a good choice for the node to be central as it is closer (just an edge away) to all the nodes in the graph. However, the concept of centrality varies depending upon the objective and purpose. In the same manner, the applicability of the centrality measure may also vary. In the rest of the chapter, we discuss various centrality measures, their advantages, limitations, and their implication aspects.

2.2 Network Centrality: Measures and Concepts

Before delving into the details of network centrality measures and their types, let's first understand the categorization. The network centrality measures can be broadly classified into three classes:

  • Geometric Measures

    In this class of measure, network centrality for a node is a function of distances to other nodes in the network. The importance of a node is determined based on how much a node is approachable to other nodes or how approachable a specific node is by other nodes.

  • Spectral Measures

    In this class of centrality measure, a node’s importance depends upon the eigenstructure of some graph-related matrix. In other words, a node is central depending upon its neighborhood nodes. Here the centrality is a function of the nodes associated with the node of interest.

  • Path-Based Measures

    Here the centrality measure depends upon the fact that how often a node is visited between a defined source and destination. The concept originates from the idea of edge betweenness which gives the number of shortest paths passing over the edge.

    Apart from these, there exist other centrality measures but they may be considered as the variation to the basic versions. The deviation in the applicability procedure is problem and application specific. We will now put our focus on the centrality measures based upon these three categorizations.

2.2.1 Geometric Measures

  • Degree Centrality

It is the simplest and historically first centrality measure that accounts for the count of the number of ties. It simply indicates the size of an individual’s network. For a directed graph network, this centrality measure may have in-degree and outdegree centrality defined separately. For example, in the case of a web page navigation network, where nodes in the network represent the web pages. Incoming edge may be defined as the number of web pages referring to a particular page (say x). The number of outgoing edges from the web page ‘x’ stands for all the web pages which are being referred to by ‘x’. Depending upon the situation or need, the centrality measure aspect is taken into consideration. Functionally, it can be defined as per the following equation (see Fig. 2.2 for an example).

$$ \user2{c}_{{\user2{deg}}} ~\left( \user2{x} \right) = ~\user2{d}_{{\user2{in}}} ~\left( \user2{x} \right) $$
Fig. 2.2
figure 2

Graph Instance representing degree centralities; (i) Node having a large number of connections has high centrality measure and importance. (ii) Node with the smallest size has only one connection and hence has the lowest centrality score. (iii) Rest all nodes have the same centrality score owing to the same number of connections

Liu et al. [11] have taken into account these centrality measures to study the effect of networked criterion-based community engagement on their performance. The in-degree centrality measure analysis in the study accounted for the popularity or measure of how much popularity index a student has in the network. Similarly, the outdegree centrality measure defines how actively a student links to other students in the network. Ergun et al. [12] used the concept of degree centrality to study the effect of social networking structure formed in an Online Learning Environment. Similarly, there are other implications of this centrality measure-based result mentioned in the reported literature from [13,14,15].

  • Closeness Centrality

Alexander Bavelas (December 26, 1913 [16]–August 16, 1993) was an American psych sociologist credited as the first to define closeness centrality. Degree centrality only takes into account the connections and weight each link equally important. However, that may not be true for many real-world networks. For example, in a road traffic network, nodes which have high connectivity to many nodes may not be as equally important to the nodes which have reachability to the nodes in the least time. In these situations, nodes that are more central and have smaller distances from other nodes in the network are considered to have high significance. Based on this notion, the functional definition can be given as

$$\user2{c}_{{\user2{close}}} (\user2{x}) = \frac{{\mathbf{1}}}{{\sum\nolimits_{{y}} {d(\user2{y},\user2{x})} }}$$

Here d(y, x) represents the shortest path from node y to x. Let us consider a case as shown in Fig. 2.3.

Fig. 2.3
figure 3

Closeness centrality scores

Here the closeness centrality for the first node is calculated as

$$c_{{close}} {\text{}}\left( 1 \right) = {\text{}}\frac{1}{{1 + 2 + 3 + 4}} = {\text{}}0.1$$

Similarly, for other nodes, the closeness centrality measures are calculated. Clearly, for the middle node, the centrality score will be highest as it has reachability to any node in the network in maximum 2 steps or can reach any node with maximum path length (\(c_{{close}} (3) = {\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {(2 + 1 + 1 + 2)}}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${(2 + 1 + 1 + 2)}$}} = 0.16\)). The notion here is how much a vertex can communicate with other nodes without the help of in-between nodes to propagate the message. However, the problem that persists with this centrality measure is if the graph is disconnected, then this centrality measure fails. For example, in Fig. 2.4 shown, the centrality score calculation for any node will be undefined as the distance of any node ‘x’ with a disconnected ‘y’ will be defined as \(\infty\).

Fig. 2.4
figure 4

Disconnected graph

The closeness centrality measure for node A will be

$$c_{{close~}} \left( A \right) = ~\frac{1}{{1 + 1 + ~\infty }} = 0$$

To counter this, the measure was remodeled by replacing the average distance with the harmonic mean of all the distances.

$${\mathbf{c}}_{{{\mathbf{harm}}~}} \left( {\mathbf{x}} \right) = ~\mathop \sum \limits_{{{\mathbf{d}}\left( {{\mathbf{y}},{\mathbf{x}}} \right) < ~\infty ,~~~y \ne x}} \frac{1}{{{\mathbf{d}}\left( {{\mathbf{y}},{\mathbf{x}}} \right)}}$$

This modification helps in addressing the anomaly caused due to non-connected nodes and thus can be applied to graphs that are not strongly connected (Fig. 2.5).

Fig. 2.5
figure 5

Harmonic centrality scores

Kas et al. [17] have proposed an incremental closeness centrality algorithm for dynamic social networks which has continuous addition and removal of edges and nodes. Mateusz et al. [18] used this centrality measure to identify the bus stops common to the several bus lines using the idea of Overlapping Community Structure. Likewise, there are various implications of this centrality measure [19,20,21].

Geometric measures discussed so far account for the node’s importance based on the node’s position in the network. In the next section, the discussion is focused upon how the centrality score of a node depends on the neighborhood nodes and how the centrality scores of the neighbor nodes too get influenced by central nodes.

2.2.2 Spectral Measures

The basic intuition of this class of centrality measure is that the nodes in contact with the central nodes have high centrality scores and those far away from these central nodes are considered to be low significance nodes.

  • Eigenvector Centrality

Unlike degree centrality, the score calculation is done based on the fact that to which kind of nodes, the node ‘x’ is connected. It is better to be connected with a few popular (well connected) nodes than being connected to many nodes of low importance [22]. This measure of influence of a node proposed by Phillip Bonacich, in his 1986 paper Power and Centrality: A Family of Measures [23].

$$\user2{c}_{{\user2{eig~~}}} \left( \user2{x} \right) = ~\frac{1}{\user2{\lambda }}~\mathop \sum \limits_{{\user2{y} \to \user2{x}}} \user2{c}_{{\user2{eig~~}}} \left( \user2{y} \right)$$

where λ is defined as normalization constant = \(\left\| {c_{{eig}} } \right\|_{2}\).

Here \({c}_{eig}\) converges to dominant eigenvector of adjacency matrix, λ converges to the dominant eigenvalue of adjacency matrix A. Initially, each node is assigned a centrality score of 1. Then, in each successive iteration, the score gets revised as per the formula mentioned above. The matrix formulation of the same can be given as

$$AX=\lambda X$$

To understand it more clearly, let us consider an illustration for the graph shown below.

Matrix A for this graph will be defined as \(A = \left[ {\begin{array}{*{20}c} 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 & 0 \\ \end{array} } \right]\) and initial centrality score, \(c = \left[ {\begin{array}{*{20}c} 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ \end{array} } \right]\). So for the first iteration, centrality scores will be evaluated as

$$\user2{Iteration}\,\user2{1}{\mathbf{:}}\,\,\,\,{\mathbf{A}}{\mathbf{.c}}^{{(0)}} = \left[ {\begin{array}{*{20}c} 1 & 0 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 & 0 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} 1 \\ 1 \\ 1 \\ 1 \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} 4 \\ 3 \\ 1 \\ 2 \\ \end{array} } \right]\,\,\underline{\underline{{\underline{\user2{def}}}}} \,\,c^{{(1)}}$$

And, finally defining the normalized scores as

$${\raise0.7ex\hbox{${{\text{c}}^{{\left( 1 \right)}} }$} \!\mathord{\left/ {\vphantom {{{\text{c}}^{{\left( 1 \right)}} } {\left\| {{\text{c}}^{{\left( 1 \right)}} } \right\|_{2} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\left\| {{\text{c}}^{{\left( 1 \right)}} } \right\|_{2} }$}} = \left[ {\begin{array}{*{20}c} {\frac{2}{{\sqrt {\left( {2^{2} + {\text{~}}4^{2} + {\text{~}}3^{2} + {\text{~}}1^{2} + {\text{~}}2^{2} } \right)} }}} \\ {\frac{4}{{\sqrt {\left( {2^{2} + {\text{~}}4^{2} + {\text{~}}3^{2} + {\text{~}}1^{2} + {\text{~}}2^{2} } \right)} }}} \\ {\frac{3}{{\sqrt {\left( {2^{2} + {\text{~}}4^{2} + {\text{~}}3^{2} + {\text{~}}1^{2} + {\text{~}}2^{2} } \right)} }}} \\ {\frac{1}{{\sqrt {\left( {2^{2} + {\text{~}}4^{2} + {\text{~}}3^{2} + {\text{~}}1^{2} + {\text{~}}2^{2} } \right)} }}} \\ {\frac{2}{{\sqrt {\left( {2^{2} + {\text{~}}4^{2} + {\text{~}}3^{2} + {\text{~}}1^{2} + {\text{~}}2^{2} } \right)} }}} \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {0.34} \\ {0.68} \\ {0.51} \\ {0.17} \\ {0.34} \\ \end{array} } \right]$$
$$\user2{Iteration}\,\,\user2{2}{\mathbf{:}}\,\,\,\,\,{\text{A}}.\,{\text{c}}^{{\left( 1 \right)}} \, = \,\left[ {\begin{array}{*{20}c} 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 & 0 \\ \end{array} } \right]\,\,\left[ {\begin{array}{*{20}c} {0.34} \\ {0.68} \\ {0.51} \\ {0.17} \\ {0.34} \\ \end{array} } \right]\, = \,{\text{}}\left[ {\begin{array}{*{20}c} {1.19} \\ {1.36} \\ {1.36} \\ {0.68} \\ {1.19} \\ \end{array} } \right]{\text{}}\,\underline{\underline{\underline{\text{def}}}} \,\left[ {\begin{array}{*{20}c} {0.45} \\ {0.51} \\ {1.36} \\ {0.68} \\ {1.19} \\ \end{array} } \right]$$

Progressing in this manner, the final convergence for the centrality scores attained for the example is \(c=\) \(\left[\begin{array}{c}1\\ 1.41\\ 1.27\\ 0.52\\ 1\end{array}\right]\)

Carreras et al. [24] used this centrality measure to analyze the spread of the epidemic in a highly decentralized mobile network. Baldesi et al. [25] used this centrality measure to have a cooperative distribution of streamlined content efficiently. Determining the centrality scores help in having the idea of the topology of the network. Like this, there are a number of related articles which discuss the use of this centrality measure. However, this centrality measure has its limitations. Eigenvector centrality will only work for connected and undirected graphs. To counter these, the Katz centrality index was proposed by making a slight modification to the centrality calculation measure discussed.

  • Katz’s Centrality

This centrality measure proposed by Leo Katz [26] defines a node’s importance by taking into account the total number of walks between a pair of nodes, defined as

$$c_{{katz}} ~\left( x \right) = ~\beta ~\mathop \sum \limits_{{k = 0}}^{\infty } \mathop \sum \limits_{{x \to y}} \alpha ^{k} ~\left( {A^{k} } \right)_{{xy}}$$

where α is defined as the attenuation factor ranging from \(\left( {0,\frac{1}{\lambda }} \right)\), λ being the largest eigenvalue of A. The attenuation factor penalizes the connection made with distant neighbors by factor k. Ak represents the path between nodes x and y with length k. β is to assign some importance to some particular nodes. Ideally, its value is kept one if none of the nodes in the network is to be assigned some special privilege. For the graph as per Fig. 2.6, the matrix Ak can be defined as

$$\begin{gathered} A^{1} = ~\left[ {\begin{array}{*{20}c} 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 & 0 \\ \end{array} } \right]~,~~\,\,\,\,A^{2} = ~~\left[ {\begin{array}{*{20}c} 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 & 0 \\ \end{array} } \right]^{2} = ~~\left[ {\begin{array}{*{20}c} 2 & 1 & 1 & 1 & 2 \\ 1 & 4 & 2 & 0 & 1 \\ 1 & 2 & 3 & 1 & 1 \\ 1 & 0 & 1 & 1 & 1 \\ 2 & 1 & 1 & 1 & 2 \\ \end{array} } \right] \hfill \\ A^{3} = ~~\left[ {\begin{array}{*{20}c} 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 & 0 \\ \end{array} } \right]^{3} = ~\left[ {\begin{array}{*{20}c} 2 & 6 & 5 & 1 & 2 \\ 6 & 4 & 6 & 4 & 6 \\ 5 & 6 & 4 & 2 & 5 \\ 1 & 4 & 2 & 0 & 1 \\ 2 & 6 & 5 & 1 & 2 \\ \end{array} } \right] \hfill \\ \end{gathered}$$
Fig. 2.6
figure 6

Connected graph

The entry in A3 matrix in second row fifth column indicates there exists 6 paths of length 3 between vertices 2 and 5 [(2,1,3,5), (2,4,2,5), (2,3,2,5), (2,1,2,5), (2,5,3,5), (2,5,2,5)]. So, redefining Katz centrality as

$$c_{{katz}} ~\left( x \right) = ~\alpha ~\mathop \sum \limits_{{y \to x}} \left( {c_{{katz}} ~\left( y \right) + ~\beta } \right)$$

This measure looks suitable for directed acyclic graphs. Since β is to assign a prioritized weightage to the nodes in the graph and is kept constant initially for a graph, it is α over which the centrality score of the node depends:

  • For α \(\approx\) 0, paths with length > 1 have low contribution and are less influential.

  • For a large value of α, Katz scores are more influenced by topology and long paths are penalized gently.

  • Measure diverges at \(\alpha > \frac{1}{\lambda }\) and hence is the limit.

For the graph shown in Fig. 2.7, the initial centrality scores for the nodes are calculated for α = 0.85 and β = 1 (for all nodes). For high α value, we have more paths greater than length 1 ending at node U than V. Changing the value of α = 0.15 will revise the scores making node V’s importance score closer to node U as longer paths will be penalized and shorter paths will be more important. Further, it can also be observed that increasing the β value for node B to 2 will make the centrality scores of node A, U, and all the nodes in contact with node B to rise [27].

Fig. 2.7
figure 7

Instance graph with Katz index for each node

Zhao et al. [28] used this centrality measure to rank the candidate disease gene and protein–protein interaction to predict the disease occurrence. Zhang et al. [29] use Katz's centrality measure to identify important nodes in a graph where each path has a different weightage. The results were found to have close coherence with the local path index. Similarly, there has been a lot of interesting research articles which have utilized Katz's centrality measure to identify nodes of importance and interest in a network. Landherr et al. [30] have given a comprehensive survey over the usage of various centrality measures and algorithm.

  • Page Rank and HITs Centrality Measure

PageRank algorithm developed by Larry Page and Sergey Brin in 1996 at Stanford University is still used by Google to rank web pages. PageRank algorithm assign scores to the nodes in its simplest as

$$r_{j}^{{\left( {t + 1} \right)}} = {\text{}}\mathop \sum \limits_{{i \to j}} \frac{{r_{i}^{{\left( t \right)}} }}{{d_{i} }}$$
(2a)

where rj is the score for the node at time t + 1 and ri is the importance contribution of node i to node j normalized by its outdegree di. Normalization is done due to the fact that the same node i also makes a contribution to other nodes as well. The process assigns each node with an initial score (say 1) and the scores are updated for each node in every iteration till the time scores for the nodes do not converge, where the convergence criteria is given by

$$\sum\limits_{i} {\left| {r_{i}^{{(t + 1)}} - r^{i} } \right| < \epsilon }$$

Based on this, algorithmic steps can be defined as

  • Set \(r_{j} = \frac{1}{{N^{\prime} }}\) where N are the total number of nodes in the graph.

  • 1: \(r^{\prime} _{j} = \sum\nolimits_{{i \to j}} {j\frac{{r_{i} }}{{d_{i} }}}\)

  • 2: \(\user2{r} \leftarrow \user2{r}^{\prime}\)

  • If \(\left| {r - r^{\prime} } \right| > \epsilon :\,\,\user2{goto}\,\,1.\)

Tracing the above algorithm over an example as shown in Fig. 2.8.

Fig. 2.8
figure 8

Graph Instance for PageRank algorithm

Score calculation equations over this graph can be defined as

$$r_{y} = ~{\raise0.7ex\hbox{${r_{y} }$} \!\mathord{\left/ {\vphantom {{r_{y} } 2}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$2$}} + ~{\raise0.7ex\hbox{${r_{a} }$} \!\mathord{\left/ {\vphantom {{r_{a} } 2}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$2$}}$$
(2.1)
$$r_{a} = ~{\raise0.7ex\hbox{${r_{y} }$} \!\mathord{\left/ {\vphantom {{r_{y} } 2}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$2$}} + ~r_{m}$$
(2.2)
$$r_{m} = ~~{\raise0.7ex\hbox{${r_{a} }$} \!\mathord{\left/ {\vphantom {{r_{a} } 2}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$2$}}$$
(2.3)

Based on these flow equations, the algorithm can be run to get the final PageRank scores of the nodes as

$$\left[ {\begin{array}{*{20}c} {r_{y} } \\ {r_{a} } \\ {r_{m} } \\ \end{array} } \right] = ~\begin{array}{*{20}c} {1/3} \\ {1/3} \\ {1/3} \\ {Iteration~0} \\ \end{array} ~~~~\begin{array}{*{20}c} {1/3} \\ {3/6} \\ {1/6} \\ {Iteration~1} \\ \end{array} ~~~\begin{array}{*{20}c} {5/12} \\ {1/3} \\ {3/12} \\ {Iteration~2} \\ \end{array} ~~\begin{array}{*{20}c} {9/24} \\ {11/24} \\ {1/6} \\ {Iteration~3} \\ \end{array} ~~~ \ldots ~~\begin{array}{*{20}c} {6/15} \\ {6/15} \\ {3/15} \\ {Final~Scores} \\ \end{array}$$

Thus, we get the final scores for all the nodes once the algorithm converges. However, the algorithm may not converge under two conditions:

  • The algorithm may get stuck up to dead ends, i.e., the flow equations get stuck up to the nodes having no out links. These pages cause the importance to leak out.

  • Sometimes the flow equations stuck up, sending and receiving all the flow within a constrained group. This is known as the problem of Spider traps. These spider traps absorb all importance.

The solution to these problems was a slight modification to Eq. (2a) as per [31]

$$r_{j} = ~\mathop \sum \limits_{{i \to j}} \beta \frac{{r_{i} }}{{d_{i} }} + ~\left( {1 - \beta } \right)\frac{1}{n}$$

where β being the probability of following a link randomly. Thus, (1 – β) is the probability of teleporting, i.e., jumping to a random page to get out of the stuck. Generally, the values of β range from 0.8 to 0.9. The above equation is equivalent to the dominant eigenvector:

$$r_{j} = ~\beta A_{r} + ~\left( {1 - \beta } \right)1^{T} \frac{1}{n}$$

Here Ar represents graph adjacency matrix, in which rows are normalized to row sum one. Figure 2.9 shows an instance of a graph with PageRank scores inside the nodes.

Fig. 2.9
figure 9

Graph instance with PageRank scores of the nodes

Node B with more in links has a more importance contribution from a greater number of nodes in comparison to others. Thus, it has the highest PageRank score. In contrast, node C although has one in link but it is being referred to by a node of high importance in the network; hence, its popularity score also becomes high. With the same explanation, node E although have a number of in links making a contribution in imparting and enhancing its popularity score but it is being referred to by the nodes of low importance in the network.

The above discussion gives rise to the concept of Hubs and Authorities in a social network and HITS centrality algorithm. The basic ideology behind the concept follows from what we have discussed for the PageRank algorithm so far. The pages of interest hold their importance based upon the kind of links (in links or out links) the node exhibit and thus are categorized into two classes:

  • Authorities are nodes containing useful information (like the homepage of newspapers, course homepages, Wikipedia Web page, etc.). They have high incoming links or visits.

  • Hubs are nodes that link to authorities (like List of newspapers, Course bulletin, etc.). These nodes have high outgoing links or visits made.

These two notions of nodes have a mutually recursive definition given as: A good hub links to many good authorities and a good authority is linked from many good hubs. Based on this, the authority and hub scores for a node can be defined as

$$c_{{aut}} \left( x \right) = ~\mathop \sum \limits_{{y \to x}} c_{{hub}} \left( y \right)~\,\,\,and\,\,\,c_{{hub}} \left( x \right) = ~\mathop \sum \limits_{{x \to y}} c_{{aut}} \left( y \right)$$

Each page i thus has two scores; Authority score: aiand Hub score: hi. Thus, HITs algorithm can be defined as

  • Initialize: \(a_{j}^{{\left( 0 \right)}} = ~{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {\sqrt n }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\sqrt n }$}}~,\,\,~h_{j}^{{\left( 0 \right)}} = ~{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {\sqrt n }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${\sqrt n }$}}\)

  • Keep iterating till convergence:

    • \(\forall ~i:\,\,Authority:~\,\,a_{i}^{{\left( {t + 1} \right)}} = ~\mathop \sum \nolimits_{{j \to i}} h_{j}^{{\left( t \right)}}\)

    • \(\forall ~i:\,\,Hub:~\,\,h_{i}^{{\left( {t + 1} \right)}} = ~\mathop \sum \nolimits_{{j \to i}} a_{j}^{{\left( t \right)}}\)

    • \(\forall ~i:\,\,Normalize:~\,\,~\mathop \sum \nolimits_{i} \left( {a_{i}^{{\left( {t + 1} \right)}} } \right)^{2} = 1,\,\,~\mathop \sum \nolimits_{j} \left( {h_{i}^{{\left( {t + 1} \right)}} } \right)^{2} = 1\)

In vector notation, these formulas can be expressed as per the following explanation:

  • Vector \(\user2{a} = (\user2{a}_{1} ,\user2{a}_{2} , \ldots ,\user2{a}_{n} ),\,\,\,\,\,\user2{h} = (\user2{h}_{1} ,\user2{h}_{2} , \ldots ,\user2{h}_{n} )\)

  • Adjacency matrix A(n x n): Aij = 1 if i \(\to\) j

  • Can rewrite \(h_{i} = ~\mathop \sum \nolimits_{{i \to j}} a_{j} \,\,as\,\,~h_{i} = ~\mathop \sum \nolimits_{j} A_{{ij}} a_{j}\)

  • So: h = A.a and similarly: a = AT.h

An interesting result to note by combining the two expressions is that the authority score a is an eigenvector corresponding to the largest eigenvalue of ATA. Similarly, hub score h is the eigenvector corresponding to the largest eigenvalue of AAT.

Fig. 2.10
figure 10

Graph instance with authority and hub scores of the nodes

Figure 2.10 shows the graphical instance of the nodes having authority and hub scores. Hub scores are accumulated based on the outgoing links to the node. Similarly, authority scores are based on the incoming links to the nodes [27]. Moreover, there are nodes that are acting both as hubs and authorities.

This proposed algorithm has found its importance in several fields. Coppola et al. [32] have used the concept of evaluating PageRank scores to evaluate and optimize the global performance of a swarm-based path evaluation for a robot. Zhao et al. [33] have proposed a motif-based PageRank mechanism to find out the top researchers in a citation network. Yin et al. [34] have proposed a variant of the PageRank algorithm, termed as Signed PageRank algorithm, to include both positive and negative recommendations from neighbors simultaneously for product recommendation.

De Blas et al. [35] used a weighted HITs centrality algorithm to identify and rank the most influential nodes by considering the impact of relations between the DMUs (Decision Making Units). There are few others reported in the literature [36, 37] which express high utility of the concept in social networks and varied fields. The centrality measure is highly popular in social networks analysis in the field of influence maximization, influencer detection, etc. and thus the class of algorithms belonging to it have a high significance in the current scenario.

2.2.3 Path-Based Measures

In this category of centrality measures, the centrality scores are defined based on the fact that how often a particular path or edge contributes for a node to make its information travel from one part of the network to other parts. This measure is often referred to as the betweenness centrality measure which has a close similarity to the closeness centrality. Betweenness centrality is the count of the number of times a given node is encountered in the shortest path between the two nodes. On the contrary, closeness centrality weighs the score based on the shortest path only. For example, if there are three shortest paths from node A to node Z, and node B is along two of them, B will be given two-thirds of a point for A to Z pair.

  • Betweenness Centrality

The notion of betweenness centrality, proposed by Freeman in 1977 [38], has two conjectures: edge betweenness and node betweenness. However, the notion of edge betweenness finally coincides with the latter, but provides a useful insight of path contribution or the number of paths through which a node ‘x’ can reach node ‘y’ [27]. Let us consider an example for the same as per Fig. 2.11. The figure shows the number of shortest paths from node A to all other nodes in the network. Based on this, the node flow can be defined as

$$node~flow = 1 + ~\sum child~edges$$
Fig. 2.11
figure 11

Count of number of the shortest path from node

Further, the flow is split up based on the parent node’s contribution. We have to keep exploring the path using BFS (Breadth First Search) mechanism. Multiple paths in between a given source and destination need to be counted fractionally as shown in Fig. 2.12.

Fig. 2.12
figure 12

Node flows to the path

This edge betweenness centrality can help us leverage the information to evaluate node betweenness centrality as well. The betweenness centrality for node x can be defined as the probability that the shortest path passes through x. Thus, we have node centrality measure defined as

$$c_{{bet}} \left( x \right) = ~\mathop \sum \limits_{{y,~z~ \ne x,~\sigma_{{yz}} ~ \ne 0}} \frac{{\sigma_{{yz}} \left( x \right)}}{{\sigma_{{yz}} }}$$
$$\begin{aligned} & \sigma_{{yz}} :\;\;number\;of\;shortest\;paths\;going\;from\;y\;to\;z \\ & \sigma_{{yz}} \left( x \right):~\;\;number\;of\;such\;paths\;that\;pass\;through\;x \end{aligned}$$

Removal of nodes in betweenness order causes the network to disrupt as removal of a node with high centrality measure acts as a mediator between the nodes.

As per Fig. 2.13,

Fig. 2.13
figure 13

Line graph with betweenness centrality scores of each node

  • A lies between no other two vertices

  • B lies between A and 3 other vertices: C, D, and E

  • C lies between 4 pairs of vertices (A, D), (A, E), (B, E)

There are no alternate paths for these pairs to take without C; thus, C has high betweenness centrality. Consider another example.

Betweenness centrality score for the graph shown in Fig. 2.14 can be done as follows:

$$\begin{aligned} Betweenness\left( E \right) = & A \to F + A \to G + A \to D + A \to C \\ & + \,B \to F + B \to G + B \to D + B \to C = 8 \\ \end{aligned}$$
Fig. 2.14
figure 14

Graph with Betweenness centrality scores of each node

Similarly,

$$\begin{aligned} Betweenness\left( F \right) = & A \to G + A \to D + A \to C + B \to G + B \to D \\ & + B \to C + \,E \to G + E \to C + E \to D = 9 \\ \end{aligned}$$

In the same manner, betweenness centrality score calculations for every node of the graph can be done. Being one of the powerful centrality measure, a lot of applications have used this as a metric to develop a problem-solving approach where the interest is to find out the bridges of the network. Daly et al. [39] used this metric to find out routes in a MANET environment by mapping the concept of small-world dynamics to find out the best message delivery routes. Kazerani et al. [40] discussed how betweenness centrality can be used to model the traffic flow of the cities. Haghir et al. [41] proposed a novel k-path betweenness centrality measure where start and endpoints are sampled for path evaluation until we have enough samples to converge. The method is found to have superior performance over the conventional algorithm. Likewise, there are many papers citing the importance of the metric to identify influential or highly important entities in a network that governs the flow of information.

Apart from this categorization of centrality measures, there exists modified versions like applying betweenness and PageRank centrality measure in combination. Then, there exists a notion of Induced Centrality measure which is explained at the end of Katz Centrality measure which suggests that the importance score of a node raises as soon as it comes in contact with an influential node. Likewise, there are derived versions and variations possible over these centrality measures which provide new evaluation metrics to judge for importance. In the next section, we will see the evaluation of these centrality metrics over real-world graph networks using SNAP (Stanford Network Analysis Platform).

2.3 Experimental Results and Analysis

To conduct experimental simulations, we have considered gemsec_facebook_dataset [42], which contains datasets of 8 different categories of Facebook Page network. The data was collected in November, 2017 through a framework Graph Embedding with Self Clustering: Facebook proposed in [43]. The dataset contains a network of various government websites, TV shows’ actors, etc. Here the nodes represent the individual entities while the edges between the nodes represent the mutual likes. These edge networks have edge lists stored in CSV files where the nodes have been number from index value zero to maintain anonymity. For the purpose of comparative analysis, we considered the graphical network of TV shows where the file contains the edge list and the two TV shows are connected if they are mutually liked upon (undirected graph). Graph contains 3,892 nodes and 17,662 edges. The top-10 central nodes identified from various measures are as follows:

These results have been evaluated using SNAP centrality functions. From this score’s table, few interesting facts can be determined:

  • Node with node id 2008 has high centrality scores rated by Degree centrality, Closeness centrality, Betweenness centrality, and PageRank centrality measure. Thus, it can be inferred that the TV show is being liked upon the most.

  • Eigenvector Centrality scores and HITs centrality scores for the graph have the same top-10 nodes with identical scores. The obvious reason is due to the fact that the graph is undirected and the number of nodes in the shortest path coincides with the hub scores of the node.

  • There are a number of nodes in closeness and betweenness centrality that appear in the top-10 central nodes. This is in relation to the first point where the nodes may be ranked.

Different centrality measures have different implications and meanings in the context of the network. In this case, high degree centrality refers to that the node has mutual liking with any other nodes, i.e., a TV show is being mutually liked with many other TV shows. Closeness centrality refers to the close association of the TV shows that have more likings together. Betweenness centrality refers to the shows that are more central in the graph and share likings from one kind of shows to other kinds of shows. In some cases, the centralities too may have a correlation with each other. However, this notion cannot be specific as it entirely depends upon the topology of the graphical network. However, to study upon a highly dense network like this, the centrality trends may be beneficial to identify influential nodes depending upon the objective to be attained. Like high degree nodes will transmit the information and cover the span of the graph. If we want to make the information to pass through particular nodes in maximum routes, betweenness centrality is to be weighted high. If we want to have information localization fast, closeness and eigenvector centrality measures are of high importance. Based upon the scores as per Table 2.1, a scatter plot of Node ids versus centrality scores can be determined as per Figs. 2.15 and 2.16.

Table 2.1 Top-10 central nodes based on various centrality measures
Fig. 2.15
figure 15

Scatter plot for degree centrality

Fig. 2.16
figure 16

Scatter plot for closeness centrality

The degree centrality distribution plot indicates that there are nodes in different regions of the graph having a high degree but are few that lies in the top region of the curve. The majority of the graph settles to the bottom. Closeness centrality seems to have uniform distribution as the closeness centrality takes into account the node's access in minimum distance to other nodes. The curve of the betweenness centrality measure has a smooth increasing trend which suggests there are nodes after every local structure to communicate information from one local region to another. The same is suggested by eigenvector centrality but the increasing trend is rapid as there are a high number of nodes with the shortest path to the majority of nodes in the network. PageRank and HITs centrality have similar trends (Figs. 2.17, 2.18, 2.19, and 2.20).

Fig. 2.17
figure 17

Scatter plot for eigenvector centrality

Fig. 2.18
figure 18

Scatter plot for PageRank centrality

Fig. 2.19
figure 19

Scatter plot for HITS centrality

Fig. 2.20
figure 20

Scatter plot for betweenness centrality

Another analysis carried out over these centrality measures is how well they are correlated for this graph to each other. Table 2.2 represents the Spearman correlation matrix between the centrality measures. Each cell represents the correlation measure along with the p-value. Correlation between two factors under study is defined in the range [–1, 1]. The strength of the correlation is defined as per the following rules [44]:

Table 2.2 Spearman correlation matrix
  • 0.00–0.19very weak

  • 0.20–0.39weak

  • 0.40–0.59moderate

  • 0.60–0.79strong

  • 0.80–1.0“very strong”

The choice for Spearman correlation is due to the fact that it is observed that the centrality distributions are not necessarily normal. The matrix values have been evaluated with the p-value being zero or approximately zero. Degree Centrality has a strong association with the Eigenvector and PageRank Centrality matrix (in the case of undirected network). Similarly, Closeness Centrality has a very high correlation with HITs centrality which suggests that as more nodes accumulate closer, there are more chances of having more hits. There is a strong correlation between the Betweenness as well as Eigenvector Centrality which means that nodes having high betweenness in the network emerge out to be the most liked nodes. Being an undirected graph, Eigenvector and PageRank centrality stand out to be a similar concept as the in links and out links are equated. However, there exists a very weak correlation between the PageRank and HITs Centrality.

2.4 Conclusions

Social Networks being one of the prime sources of connecting real world virtually, the information over it is vast and can be utilized in various ways to earn value from it. The information flow in any network is governed by the number of high importance nodes in the network, and the importance of a particular node is measured on the basis of its position, linking, and its capacity to deliberate the information flow to maximum nodes in the network. This notion gives rise to the concept of network centrality.

This chapter focuses on various centrality measures and deciding criteria to certify a node’s importance. Various centrality measures have been categorized into three categories depending upon the referential idea of importance. A detailed investigation has been presented with algorithms and examples for all centrality measures. Further, how a particular centrality measure has been investigated and used by various researchers to solve a particular problem of various domains is also mentioned as and when needed. To understand the concept and significance of centrality, the chapter takes into consideration real-world network’s graph (edge list) over which each centrality measure is evaluated, and the results are analyzed over SNAP graphical simulation tool. This detailed analysis and description of the concepts motivate to utilize the knowledge in various domains like protein–protein interaction network, road traffic network, social networks, etc. to evaluate results of significance and identify hotspots of the network. Further, as discussed previously, various combinations of the centrality measures, variation in the conventional centrality measure, etc. can be exploited to identify nodes of high significance and help in building a decision model.