Keywords

1 Introduction

Different polarization metrics have been proposed in the literature from several vantage points, including network topology [13, 14, 17], content semantics and sentiment [5, 7]. Current network-based polarization measures [13, 14, 17] are tailored based on the assumption that a polarized network consists of two opposing communities. According to Esteban et al. [11], individuals can be grouped into multiple, antagonistic communities in a polarized society. Most efforts on measuring polarization assume that the polarized networks consist of exactly two antagonistic groups, and thus need to ignore the neutral nodes, or add them to one of the extreme groups. Our metrics address this limitation by acknowledging the existence of multiple communities.

This paper proposes a heterophily-based polarization metric called “cross-community affinity” that can be applied to networks with two or more communities with conflicting positions, goals, and viewpoints. We consider these communities are placed equi-distantly in a one-dimensional space. This assumption is supported by two facts: First, it allows us to compare with other metrics in the literature. Second, it reflects the datasets we use for our empirical evaluation. The cross-community affinity of a node represents the node’s affinity to communities with a different ideology than its own. Our proposed metric measures the node-level value that can be aggregated to any higher level, such as the community level, the network level, or any sub-network level. With this approach, we can understand which nodes or communities contribute most to polarization, enabling a more detailed picture and the possibility of directing interventions to particular nodes.

The rest of the paper is structured as follows. Section 2 presents the relevant works in this area. Section 3 explains the metric we propose. Section 4 describes the datasets used in this study and reports the results of experiments performed. Section 5 summarizes the results and discusses the future work.

2 Polarization Metrics in the Literature

Measuring polarization using structural characteristics inferred from network representations of social or political systems is a common topic in the literature, along with two other approaches: survey-based approaches [9], which measure distributional properties of public opinion through surveys; and content-based approaches [4, 8, 21], which use NLP tools to identify opposing groups on the network.

Conover et al. [6] suggest that polarization has a significant impact on the structures of social networks because it results in the formation of two groups that are well connected within themselves but have few ties to one another. Guerra et al. [14] present a polarization metric that centers on investigating nodes that belong to the community boundary, which captures the concepts of antagonism and polarization. Another polarization metric, the Polarization index [17], measures how far apart two groups are in terms of ideology, assuming their populations are equal. Garimella et al. [13] established the Random Walk Controversy (RWC) metric, which uses the random walk to see how likely information is to stay inside or reach out to other groups. Salloum et al. [20] examine the polarization measures mentioned above via simulations and demonstrate that all of them produce high polarization scores even for random networks with density and degree distributions close to typical real-world networks.

However, these metrics are developed based on the assumption that the polarized network consists of exactly two communities. In this paper, we propose a heterophily-based polarization metric called cross-community affinity, which measures the affinity of a node to other clusters rather than its own.

3 Cross-community Affinity: A Heterophily-Based Polarization Metric

We propose a new polarization metric called cross-community affinity to serve two specific objectives. First, it should adapt to a variable number of ideological groups connected by different antagonizing forces. Second, we want this metric to be applicable to different granularity, from node to full network and other network-based groupings in between.

As in previous work [13, 14], the basis of this polarization metric is a node’s connectivity with groups other than its own. In order to capture that, we introduce a heterophily-based metric consistent with its definition [15] that captures how a node is connected to different groups via both direct and indirect links. We assume the polarization of a network is the inverse of the average cross-community affinity, that is:

$$Polarization = -\text { \textit{Avg. cross-community affinity}}$$

In order to define the metric, we use the following intuition. First, a network can have multiple communities. We assume a constant between 0 and 1 represents an ideological distance between different groups. Intuitively, a connection with an ideologically opposite node should weigh differently than a connection with an ideologically similar node. In a political system, one could consider the difference between a far right—far left connection vs. a leaning right and center political positioning connection. To account for the ideological difference between communities, we define communities as being in a one-dimensional space and equally spaced apart. The datasets(described in Sect. 4.1) we looked at implicitly position themselves on the one-dimensional space. Specifically, our use of the VoterFraud2020 dataset labeled using Media Bias Fact Check considers political orientation in a uni-dimensional space. We are providing a weight factor to represent the ideological distance. For simplicity, we consider that the distance between consecutive communities is constant and equal to \(\frac{1}{|C|-1}\), where C is the number of communities (as shown in Appendix A.4). This assumption can, of course, be relaxed in a scenario in which, for example, the ideological distance between extreme left and leaning left in smaller than between leaning left and center.

Second, we assume that both direct and indirect connections can have an impact on a node’s cross-community affinity. However, as well accepted in the literature [12], indirect connections have a much smaller impact on one’s beliefs than direct connections. It has been empirically observed by Friedkin [12] that people’s awareness of others’ actions is restricted to people who were either in direct contact or had at least one contact in common. Moreover, the impact of such connections is typically a function of the overall number of connections a node has: the more neighbors, the less the impact of any one neighbor may be. To implement this, we assume that the ideological difference between nodes from the same community is −1. This value was chosen such that a node’s affinity for its community reduces its cross-community affinity.

We thus define cross-community affinity(CCA) of a node i as the sum between the effects of its direct neighbors and indirect neighbors on its ideological openness:

$$\begin{aligned} CCA(i) = DNE(i) + \alpha \times INE(i) \end{aligned}$$
(1)

where DNE(i) is the direct neighbor effect on node i and INE(i) the indirect neighbor effect on node i. \(\alpha \) is the impact factor of the indirect neighbor effect. For simplicity we consider \(\alpha = 1/h\), where h is the number of social hops between node i and the given set of nodes (in this case h = 2).

We consider the direct neighbor effect on node i as the sum of the relative impact of i’s direct neighbors as follows:

$$\begin{aligned} DNE(i) = \sum _{c\epsilon C}^{} w_{(s(i),c)} \times \frac{k_c(i)}{k(i)} \end{aligned}$$
(2)

where C is the set of communities in the network, s(i) is the community to which node i belongs, \(w_{(s(i),c)}\) is the ideology based distance between i’s community and community c. \(k_c(i)\) denotes the number of neighbors of i in the community c and k(i) denotes the total number of neighbors of i. Similarly, we consider the indirect neighbor effect on i as the average of the relative effects of its 2-hop neighbors over all different communities.

$$\begin{aligned} INE(i) = \frac{1}{|C_{N(i)}|} \sum _{c\epsilon C_N(i)}ANE_c(i) \end{aligned}$$
(3)

where \(C_{N(i)}\) is the set of communities in the i’s neighborhood and \(|C_{N(i)}|\) is the number of communities in the i’s neighborhood. \(ANE_c\) represents the average neighbor effect of i’s immediate neighbors by examining neighbors’ neighborhood. We calculate the individual neighbor effect of each neighbor of node i to determine how their neighbors are distributed throughout the communities. To determine the impact of neighbor j on node i we calculate neighbor effect(NE) j on i as follows:

$$\begin{aligned} NE(j,i) = \sum _{g\epsilon C}^{} w_{(s(i),g)} \times \frac{k_g(j)}{k(j)-1} \end{aligned}$$
(4)

where g is the community to which node j’s neighbors belong, \(w_{(s(i),g)}\) is the ideology distance between i’s community and community g, \(k_g(j)\) represents the number of j’s neighbors in community g and k(j) is the total number of j’s neighbors, from which we exclude i.

CCA(i) has a value ranging from −1.5 to 1.5. The CCA(i) is minimum (\(CCA(i) = -1.5\)) if all nodes in the immediate and two-hop neighborhood belong to the same community as node i. If all neighbors up to two hops away are in the node’s extreme opposite community, the cross-community affinity is maximum (\(CCA(i) = 1.5\)). Cross-community affinity can thus be aggregated at different granularities, from node-specific to any grouping of nodes in the networks, whether connected or not by, for example, averaging the node-specific affinity. A node-specific cross-community affinity can tell whether the node contributes to the network polarization. The network-level polarization P can thus be obtained as the negative average cross-community affinity:

$$\begin{aligned} P = - \frac{1}{|N|}\sum _{i\epsilon N}CCA (i) \end{aligned}$$
(5)

where N is the set of nodes in the network. Appendix A.3 shows the different scenarios of a network and their respective CCA.

4 Empirical Evaluation

We evaluate our proposed metric on networks with different numbers of ideological groups. We use three datasets: Polblogs [3] and White Helmet Twitter interaction network [19] which each have two antagonistic communities, and the VoterFraud2020 domain network [18], with five communities.

4.1 Datasets

The Polblogs network [3] is a publicly available network of hyperlinks between political blogs about politics leading up to the 2004 United States presidential election. Each node in this network is labelled as either conservative (right) or liberal (left). Edges are the interaction between blogs such as citation, blogroll links etc. We consider the network as an undirected labelled network.

White Helmets Twitter dataset is the interaction network [19] based the tweets on White Helmets for a period from April 2018 to April 2019. Each node in this network is labelled as either pro-White Helmets or anti-White Helmets.

The VoterFraud2020 domain network [18] is derived from the publicly available VoterFraud2020 dataset [2], a Twitter dataset related to voter fraud claims about the US 2020 Presidential election. In this network, nodes are the web domains of URLs posted in tweets, and links connect domains that were tweeted by the same user. This network of websites is structurally divided into communities. Each node is labeled based on its media bias and credibility using publicly available source Media Bias Fact Check (MBFC) [1]. The labels are: right, right-center, center, left-center, and left. However, after this labeling strategy, 75.6% of the nodes remained unknown because they are not included in the MBFC database. To assign labels to the ‘unknown’ nodes, we relabelled them as the dominant label in the node’s direct neighborhood. That is, we started with unlabeled nodes with the largest proportion of labeled nodes in their one-hop neighborhood and labeled them as the majority. We recursively applied this methodology until all nodes were labeled. Edge distribution of this network is depicted in Appendix A.1 Table 1 shows the network properties of Polblogs, White Helmets twitter network and VoterFraud2020 domain network. Appendix A.2 depicts the visual representation of these datasets.

Table 1. Network properties of Polblogs, White Helmets twitter network and the VoterFraud2020 domain network.

4.2 Cross-community Affinity in the Polblogs Network

Polblogs networks has two communities: conservative (right) and liberal (left). The edges connecting two communities are only 9%. 50.9% nodes (623 nodes) have connections to the opposite community. First, the ideology-based distances between these two communities are defined. As discussed in the preceding section, the ideology-based distance between the same communities is −1. In contrast, the connection to the most polar community gets the maximum weight of 1. In the Polblogs network, the weights between conservative-conservative and liberal-liberal are −1 and the weight between conservative and liberal is 1.

We compute the average cross-community affinity value across each community and the entire network to determine cross-community affinity at the community and network levels. Using Eq. 5, the polarization score of conservative is 1.13, liberal is 1.0, and the network is 1.07. These values indicate that the communities and the whole network are polarized. One of the key benefits of having a metric that captures polarization at the node level is that, (Appendix A.2 Fig. 3a) we can determine which nodes contribute to the polarization.

Next, we evaluate how the metric works on a random graph. Our intuition is that randomizing the network should reduce polarization [20]. We generates a set of random networks using dK series [16]. dK-series generate random graphs that preserve desired prescribed properties of the original. 0K (d = 0) creates the Erdös-Rényi network with the same average node degree as the original graph. 1K (d = 1) creates the configuration model, fixing the degree sequence of the original graph. As compared to the polarization score of 1.07, the average polarization value for generated 0K is 0.58 and 1K is 0.02. These networks have lower polarization score than the original Polblogs network, which means that they are less polarized. This observation gives us confidence that measuring polarization using the methodology we proposed captures random behavior.

4.3 Cross-community Affinity in the White Helmets Twitter Interaction Network

We conducted a similar experiment on the White Helmets Twitter network. Around 73% of users are anti-White Helmets, and 27% are pro-White Helmets. The size of the communities is significantly different from the Polblogs, where it has an almost similar size for communities (52% and 48%). The connection between anti-White Helmets and pro-White Helmets users is 0.3%. Only 0.2% of users have interaction with the opposite community. As in the Polblogs experiment, we used −1 for the ideology-based distance between the same communities and 1 for the opposing communities. The polarization score for the network and each community is 1.49. The score indicates that the network is highly polarized.

Table 2. Comparison of polarization value calculated by P, PI, and RWC

Next, we created a set of 0K and 1K graphs for the White Helmets Twitter dataset. The average polarization score for 0K graphs is 0.94 and for 1K graphs is 0.35. Consistent with the Polblogs results, the random graphs generated for White Helmets also yield lower polarization score indicating they are less polarized.

4.4 Cross-community Affinity in the VoterFraud2020 Domain Network

Next the experiment is conducted on the VoterFraud2020 domain network with five communities. Given that there are five communities in the network, the distance between two adjacent communities is defined as \(1/(|C|-1)\), or 1/4 (Appendix A.4). The polarization scores computed using Eq. 5 for each community are: right: 0.88; right-center: −0.34; center: −0.19; left-center: 0.62; left: 0.05; and for the entire network: 0.61. The right-center and center communities are less polarized compared to other communities. That is because the right-center has a comparatively higher number of edges to the right community. Similarly, the center community contains more links to left-center and right. Even if the network-level polarization score shows that the network is polarized, the community-level polarization score reveals that two communities do not contribute to the polarization state of the network. Using only a network-level polarization score, it is impossible to determine how different communities contribute to polarization, thus obscuring information that might be useful in limiting damage or directing intervention.

We also created random networks via dK-distributions. A set of random graphs with same number of nodes and same average degree (0K) are generated. The average polarization score for 0K graphs is \(-0.06\) and 1K graphs is 0.02 compared to the original network’s score is 0.61. The polarization value dropped for the random networks even when the network’s degree sequence was preserved. More experiments on VoterFraud2020 domain network are shown in Appendix A.5

4.5 Comparision with Exisiting Polarization Metrics

In this section, we compare our cross-community affinity metric with two widely used polarization metrics: Guerra’s polarization index (PI) [14] and random walk controversy score (RWC) [13]. The RWC score has been described as state-of-the-art [10, 22]. The range of polarization values for PI is −0.5 to 0.5, and RWC is −1 to 1. Our metric P ranges from −1.5 to 1.5. The higher the value, the higher the polarization. Table 2 shows the polarization value calculated using P, PI, and RWC. We can see that polarization values reduce consistently for Polblogs random networks. The PI value for White Helmets-0K increased compared to the original White Helmets dataset. This shows that the PI failed to capture the randomness of the network. P and RWC show a consistent drop in value, indicating that random networks show low or no polarization. The results also show that our metric works consistently as the current state-of-the-art metric, RWC. PI and RWC for VoterFraud2020 are N/A because of multiple communities.

According to Salloum et al. [20], RWC displays a severe problem related to hubs. RWC captures how likely a random user on either side is to be exposed to an authoritative user (higher degree node) from the opposing side. Even in a non-polarized network, a random network with one or more hubs can keep the random walker confined to its community, producing a high polarization value. CCA calculates a polarity score for each node separately. So, having one or more hubs will not affect our metric. Another issue with RWC is that we need to specify the parameter ‘k,’ which represents the number of authoritative users in each group. While doing experiments, we noticed that the same graph produces different polarization scores with a value ‘k’ change. So we need to be extra mindful while using RWC. Another limitation of RWC acknowledged by the author [13] is that it reports low controversy score for the Karate Club network with 34 nodes and 78 edges. The author mentions that the graph may be too small for random-walk-based measures to function correctly. According to the literature the RWC score for the Karate Club network is 0.11 whereas our polarization metric shows 1.02. Our polarization metric performs appropriately for networks with small size.

5 Summary

This paper proposes the cross-community affinity polarization metric as a new way to measure polarization. The cross-community affinity is a heterophily-based measure that captures the connectedness of nodes to groups other than their own. It has two specific goals. First, it adapts to a different number of ideological groups. Second, it applies to various levels of granularity, ranging from individual nodes to entire networks, as well as other network-based groups in between. The network-level polarization score can be obtained as the negative of average cross-community affinity. We evaluate our proposed metric on networks with multiple ideological groups. In addition, we compared them to randomized versions of our network datasets generated using dK distributions. The results show lower polarization values for the randomized networks. We also compared our metric with two widely used existing polarization metrics.

Our work has limits that merit mentioning. First, for simplicity we consider ideological difference to be a one-dimensional space. Second, our metric is now tailored to undirected unweighted networks. These are essential agenda items for future research. With a metric that captures polarization at the node level, it is possible to determine which nodes or communities contribute to the polarization. Assessing how distinct communities contribute to polarization is feasible, providing knowledge that may be valuable for limiting damage or directing intervention.