Keywords

1 Introduction

The study of socio-political dysfunctions or disorders unfolding in digital social media and social networks [7] has raised to prominence in the past decade, including studies of algorithmic bias [3], extremism [20], or echo chambers [5]. These studies hinge on ontologies for the political positions or stances of online users. Bakshy et al. [3], for example, classified users and content on Facebook as Democrat- or Republican-leaning to analyze cross-cutting recommendations, and Barbera et al. [5] positioned Twitter users on liberal-to-conservative continuous scales to investigate the so-called echo chambers. These categorical (e.g., user classification) or geometrical (e.g., opinion scales) have been leveraged in several studies. In many settings, including that of the US, categorical approaches are limited to binary classifications, while geometrical ones are often limited to one-dimensional opinion scales. This stems from the fact that the US is in essence a two-party system, and that the US has undergone several decades of issue alignment [22]: the alignment of issue positions on several matters, resulting in highly-correlated views on gun control, abortion, racial issues, among several others [18]. Methods for embedding social networks in higher dimensional opinion spaces (with dimensions standing for opinion indicators for different issues of social cleavage) have been developed only recently. These methods allow to tackle, for example, several European socio-political settings, known to require higher dimensionality [2] to account for observed social choice data, from roll call voting [10] to online social network activity [28]. Accordingly, recent methods have proposed to embed online social graphs in empirical geometrical spaces of several dimensions, where dimensions stand for indicators of opinions on immigration, income redistribution, or ideological stances such as left- and right-wing positions [27]. In settings such as that of the US and that display a main dominant dimension of social cleavage, there is comparatively less development on methods to mine additional dimensions. In this context, cleavages refers to historically-determined political, social or cultural dimensions dividing society in the sense that individuals into groups that structure political debate [15]. The study of online social networks in higher opinion dimensions calls nonetheless for renewed efforts, as recent research highlights the importance of new additional cleavages, in particular relating to anti-elite sentiment and populism, as driving forces in many relevant political and informational phenomena in the US [32].

This article builds on recent ideological scaling [14] and aforementioned graph embedding methods for spatializing social graphs in multi-dimensional ideological spaces [27]. Exploiting graph embedding and NLP methods, it proposes a method to mine cleavage dimensions linked to cultural, policy, social, and ideological groups and preferences in social graphs. The method is applied to Twitter social graph data of nearly 2 million users strongly connected to the online political debate in the US. The proposed method shows that several cleavage dimensions traditionally considered in social network analysis (e.g., conservatism, gun control, patriotism, religion) are indeed strongly related, as most studies find, but further allows to quantitatively measure the relative alignment of these issues. More importantly, the method identifies and computes positions of large numbers of users in emerging, non-aligned dimensions identified in political science research as emerging cleavages, and that several works in social science have hypothesized and put forward in recent years. These results extend previous research works showing that it is possible to spatialize users in the US in several space dimensions that are not highly correlated. These results also shed light on the ubiquitous practice of analyzing social networks with ideological scaling methods, by showing ways to assess its validity and improving existing methods. While traditional ideological scaling methods yield a scale that is highly correlated with Democrat-Republican cleavages (resulting from an issue alignment process), party cleavage in specifically can be more accurately determined among many other aligned cleavages.

2 Related Work

Binary categorical classification of social media and network users counts numerous works, from exploitation of self-reporting and surveys [3] to sophisticated methods using neural networks on heterogeneous graphs [33]. More related to this article, ideological spatialization of relational behavioral data can be traced back at least to the Nominal Three-Step Estimation (NOMINATE) method [21]. The NOMINATE method is used to position parliamentarians and bills in one-dimensional liberal-conservative scales. It assumes (1) the existence of unobservable geometrical ideological parameters or positions, and (2) an underlying homophilic generative process (i.e., parliamentarians vote to approve bills that are ideologically close to them in the unobservable space), and uses roll call data in Bayesian inference of the unobservable parameters. This process, called ideological scaling, has been used in numerous settings involving social choice: court rulings, campaign donations, parliamentary vote, and social network activity among others [14]. In its most wide-spread form, the homophilic probabilistic generative process is modeled as [4]:

$$\begin{aligned} P\left( i \leftarrow j | \alpha ,\beta , \mathbf {\phi }_i, \mathbf {\phi }_j \right) = \text {logit}^{-1}\left( \alpha -\beta \Vert \mathbf {\phi }_i - \mathbf {\phi }_j \Vert ^2 \right) , \end{aligned}$$
(1)

where the probability of observing user j interacting with user i (i.e., \(i \leftarrow j\)) depends on position and scale parameters \(\alpha \) and \(\beta \), and, most importantly, on the distance between the unobservable positions \(\mathbf {\phi }_i\) and \(\mathbf {\phi }_j\) of users i and j. Social choice data (i.e., pairs \(i \leftarrow j\)) can then be used to infer positions \(\mathbf {\phi }_i\) for all users i. A large number of works assumes that one dimension is enough to retrieve the main social cleavage in the US, implicitly assuming it is the liberal-conservative one, and uses social network choice data (e.g., Facebook likes [8] or Twitter following link [4]) to compute the position of users in this dimension. These works often rely on ex post validation using text cues to argue that the latent dimension indeed represents the political positions of users. Multi-dimensional inference for \(\mathbf {\phi }_i\) can be achieved in a computationally-tractable manner with Correspondence Analysis [12], as it has been shown to approximate the inference of unobservable parameters of (1), both theoretically [16] and empirically [5].

Because (1) depends on unobservable parameters \(\mathbf {\phi }\) through pairwise distances, their inference is invariant to isometric transformations. In particular rotation transformations mean that retrieved dimensions cannot be assured to be aligned with social cleavages that might be structuring social choice. In European settings, using the position of referential users such as politicians of known political parties, and party positions in reference issue spaces (provided, e.g., by political polls or surveys), it has been shown how inferred dimensions display different levels of alignment with issues of public political debate [28]. This means that, in general, it cannot be assured that a single-dimensional ideological scaling model will yield a political opinion scale completely aligned with some presumed main left-right or liberal-conservative cleavage. Using the position of several political parties, this fact has been leveraged in embedding large numbers of users in multi-dimensional space where dimensions stand for identifiable and separate political issues, not requiring ex post interpretation or validation [27]. Because the US is in essence a two-party system, and because of the difficulty in obtaining reference users that have both (1) known positions \(\mathbf {\phi }\) in latent space and (2) positions known through political surveys (in particular for issues that might not be highly correlated with liberal-conservative cleavages), similar methods have proved to be more challenging in this setting. This article thus proposes a method for constructing such referential positions for large numbers of users, relating to hypothesized cleavages, based on text descriptions written by users in their online profiles on social network sites. Using both, multidimensional scaling and embedding based on (1), and NLP methods for constructing groups of referential users, this article proposes a method to mine several spatial opinion dimensions, and most importantly, emergent opinions that are not highly-correlated with liberal-conservative cleavages.

3 Social Network Data

Following multidimensional ideological scaling works in Europe [28] and in the US [5], we select a bipartite sub-graph of the Twitter social graph. To capture online social choices that might be revealing of several social and political preferences we take as reference users members of the US Congress. We manually annotate the Twitter accounts of 550 members of the 116th United States Congress (looking for verified accounts corresponding to each congressperson), and collected their 17 952 824 followers (collection performed using Twitter’s API in October 27th, 2020, see the Acknowledgements section for privacy-compliance information and references). To minimize the probability of followers being bots, we follow criteria adopted by several studies [23, 25, 28, 29] and further identified followers with more than 25 followers (7 325 940), and users that have posted more at least 100 tweets (7 471 365). See [4] for further details behind the rationale for these parameters. To identify users that are strongly connected to political debate and that follow spatial preference models we identify followers that follow at least three members of congress (3 846 925) [4]. We select the 1 821 272 unique followers that satisfy all three conditions.

To establish reference points in latent space, we collect the text self-descriptions made by users in their Twitter profiles (also on October 27th, 2020). Out of 1 821 272 users, 1 442 716 had written any text entry in their Twitter profiles. This collection, performed in the days leading to the 2020 US Presidential Election allows us to investigate cleavages in candidate preferences. The strategy presented in this article consists in finding space dimensions capable of dichotomizing pairs of groups of users identified with keywords used in their profiles, and revealing of social cleavages. We follow several studies to identify several possible cleavages of interest, including [6, 21, 27, 32]: party, candidate, racial cleavages, cleavages in regional politics (urban vs rural, or state vs federal positions), religious cleavages, cleavages relating to gun control, the long-standing issue of communism in the US, cleavages related to liberal “life-styles” [1] (e.g., homosexuality, feminism), welfare, military, patriotism, globalization, conspirationism and mistrust (including mistrust of the establishment and experts), and cleavages around attitudes towards elites. Table 1 summarizes proposed issues of partition according to the surveyed literature, with the keywords identified for the classification of users in binary partitions. User classification relies on a keyword-based strategy of minimalist choice: keywords do not intend to portrait the diverse forms of expression of position on each issue, but rather to identify groups of users with low ambiguity of position. In addition to keywords, we rely on sentiment analysis of profile text to distinguish positive and negative mentions of keywords (using a pre-trained BERT base model for uncased words [11]), assigning to each profile text a sentiment from 1 (very negative) to 5 (very positive). We label text profiles as negative [\(-\)] if sentiment is equal to 1, and as positive [\(+\)] if sentiment is equal to 5. In Table 1 we also distinguish users whose profiles are not negative . This is needed, for example, to identify users that might use the word “republican” in their profiles, but in order to utter critique (e.g., “I hate republicans!”).

Table 1. Summary of the proposed issue partitions of users into minimal groups for mining spatial cleavage dimensions capable of classifying them. For each issue we identify two disjoint groups defined by queries of the Twitter profile text descriptions, including keywords (case insensitive, here all written in lowercase), and sentiment: positive [\(+\)], negative [\(-\)], and non-negative .

4 Homophily Network Embedding

To identify dimensions that might be revealing of cleavages, we first produce a multi-dimensional space embedding in which these dimensions might emerge as directions. For this, we take the bipartite social subgraph of the 550 members of congress and their 1 821 272 followers to produce an homophily embedding using Correspondence Analysis on the adjacency matrix to compute values \(\phi \) of (1) following the procedure in [28, Section IV]. In this multi-dimensional space, dimensions \(\delta _i\) (\(i=1,2,...\)) are ranked according to the information they contain about choices represented in the bipartite social graph, as measured by the inertia. Figure 1 shows the inertia of each dimension and their relative gain, showing that at most the three first dimensions are relatively more informative than the rest. Figure 1 also shows the embedding positions of both, congressional members and followers, and the marginal density on these first three dimensions, estimated for the purposes of visualization with kernel density estimation. We can compute party positions as the mean position of congressional members. As expected, the first—most explicative—dimension, \(\delta _1\), stands as a good candidate of dimension for attitudes towards parties (or related cleavages, such as liberal-conservative). However, the question remains whether isometric transformations can improve the distinction made by any dimension between Democrat- and Republican-leaning followers. In particular, we do not know if a rotation might improve the ability of a classifier to distinguish between Democrat and Republicans on the first dimension. We know that \(\delta _1\) stands for a latent cleavage, and we know that it is highly aligned with Democrat-Republican cleavages, but we do not know if it is the best spatial direction for distinguishing these two groups. More broadly, it is not trivial to attribute an inductive meaning to what \(\delta _2\) and \(\delta _3\) might stand for, or to any other space direction for that matter.

Fig. 1.
figure 1

Multi-dimensional homophily embedding of the collected Twitter network. Dimensions ranked by inertia, and relative gain of including each dimension (top left). Scatter plot and estimated density of the conditional probabilities for the position of users in first three dimensions (top right). Density of followers and positions of members of congress colored by party, and party positions as mean of groups shown in first three dimensions (bottom).

5 Mining Cleavage Dimensions

To extract cleavage dimensions as directions in this latent ideological space, we leverage our proposed set of binary labels from Sect. 3. We first illustrate this on party cleavages. Among users our users, 7895 use the word “republican” and 14 481 the word “democrat” in their profile without negative sentiment. To measure the degree to which \(\delta _1\), \(\delta _2\) or \(\delta _3\) might be good candidate directions for distinguishing these two groups, we fit a logistic regression model on each dimension. With this model, we can compute users that might be rightly or wrongly classified with it, and thus compute a precision, recall and F1-score metric. Figure 2 shows these values and the distribution of these two groups along \(\delta _1\), \(\delta _2\) and \(\delta _3\), showing that \(\delta _1\) indeed is the only dimension among the three to produce a meaningful distinction. Alternatively, instead of using a given dimension, we can fit a multivariate logistic regression model, and identify the direction perpendicular to the decision hyper-plane boundary. In the case of our three-dimensional model, the decision boundary will be a plane and the direction a three-dimensional vector (see in Fig. 2). This discovered direction separating these two groups of users is well aligned with \(\delta _1\), but it does not produce an improvement in the F1-score. Still, ideological scaling cannot rely on the a priori assumption that this will always be the case as it is standard practice in many works in several disciplines, especially in light of research suggesting a decline in left-right cleavages structuring social choice [13, 28, 32].

Fig. 2.
figure 2

Distribution Republican- or Democrat-leaning according to their Twitter text profile description, their distribution along the first 3 latent space dimensions, and the accuracy of logistic regression models fitted on each dimension (left). Conditional distributions and positions of labeled users in three-dimensions, and distributions along the direction perpendicular to the boundary of a multivariate logistic regression (right).

This method can be further used to test all proposed binary groups of Table 1, hypothesized to represent potentially relevant cleavages. We fit a multivariate logistic regression model for each pair of groups, and measure the classification accuracy of the model, reported in Table 2, highlighting the cases with F1-score accuracy equal or greater to 0.6. As some pairs are highly imbalanced (e.g., for religious cleavages), we systematically sub-sample the majority group with a Near-Miss strategy [17]. Figure 3 shows a selection of groups of pairs of labeled users, with the decision boundary and discovered orthogonal direction of the fitted multivariate decision model. This selection highlights the different qualities in the accuracy of the multivariate logistic regression classifier, corresponding to different strengths of cleavages for the pairs in each labeled group, under the assumption that the chosen criteria identify a relevant group of users.

Table 2. Groups of pairs of labeled users (according to criteria of Table 1), naming of the mined dimension perpendicular to the decision boundary of a multivariate logistic regression classification model, and the accuracy of the fitted model.
Fig. 3.
figure 3

Selection of groups of pairs of labeled users, with the decision boundary and orthogonal direction of the fitted multivariate decision model.

The identification of different plausible dimensions of social cleavage presents us with several interesting possibilities. First, the coordinate of each user along these new identified directions can be used to disentangle issue alignment: e.g., we can produce separate and explicit dimensions for party cleavages and attitudes towards gun control. In contrast, \(\delta _1\) (usually reported in works using ideological scaling) is a proxy for party cleavages, but also for other positions on correlated issues (e.g., racial or religious issues, see Fig. 3) in non-explicit ways. Similarly, by inspecting the alignment between different cleavage dimensions we can identify and quantify this issue polarization. Figure 4 shows the mined cleavage dimensions (i.e., with F1-score \(\ge \) 0.6) and their pairwise angular distance, with metrical clusters computed with Un-weighted Pair Group Method with Arithmetic (UPGMA) mean [31]. The mined cleavages can be organized in four ideologies (in the sense of issue alignment): (1) a dominant ideology comprising party, candidate, and other stances correlated with \(\delta _1\), (2) an ideology separating people defining themselves using the words “local” and “global”, (3) an ideology separating those defining themselves using the words “welfare” and “libertarian”, and (4) an ideology separating those with positive and negative mentions of issues relating to sexual diversity and feminism, and the use of the word “communism”. Four directions cannot be perfectly orthogonal in three-dimensions, but any two cleavages belonging to two different identified ideological groups will display enough angular distance, so as to not be considered as highly correlated. Being able to disentangle issues in separate dimensions, allows us to conduct different investigations against the map positions of actors in now identifiable axes. Because we can also measure the position of reference users (politicians) in mined cleavage dimensions, we can investigate intra-party diversity on separate issues: e.g., of support for their presidential candidate, or attitudes towards welfare, religious diversity, or diversity of views on racial issues. Figure 5 shows, for example, that Republicans are more heterogeneous in their support for Donald Trump than the Democrats in their support for Joseph Biden, both the members of congress (in crosses in Fig. 5) and the followers (density shown in light blue in Fig. 5).

Fig. 4.
figure 4

Mined cleavage dimensions (left) can be organized in four ideologies (in the sense of issue polarization), shown in four groups in blue in the angular distance matrix (right).

Fig. 5.
figure 5

Density of Twitter users and positions of members of congress along mined dimensions. The distribution of members of congress shows the intra-party diversity of stances towards presidential candidates and welfare.

6 Discussion and Conclusions

This article argued for the importance of re-examining the assumptions implicitly leveraged by studies that use ideological scaling, showing that the dominant latent dimensions cannot be assumed to be the direction that is most aligned with traditional Democrat-Republican cleavages in the US. The degree of alignment between latent dimension and different cleavages can and must be determined and measured. This article further presented a way of mining dimensions of social cleavage with explicit meaning using both network embedding and NLP methods. Furthermore, using this combination of methods, this article analyzed the case of a political Twitter friend network in the US, identifying the main dominant cleavage, but also additional ones hypothesized as relevant by recent studies in social sciences [32]. In particular, four ideologies, or bundled groups of cleavage dimensions were identified. These groups of dimensions are not highly aligned among them, and represent new cleavage dimensions that can be used in further studies. This method also offers the possibility of developing new applications for explicitly measuring issue polarization as the alignment of bundled social cleavages, as well as a method for projecting large numbers of users onto space dimensions with explicit meaning in terms of the issues to which it measures positive and negative views. This new possibility, it was argued, opens interesting paths for research, which was illustrated with a brief example. By measuring positions of Democrat and Republican congressional members on both, party cleavage and candidate cleavage dimensions with data collected in the days leading to the 2020 US presidential election, this article showed that, when compared with Democrats, it may be proved that Republicans display higher heterogeneity in their support for their candidate. Beyond this example, many others could leverage these results and methods. In particular, having multidimensional distributions of political attitudes could be leveraged in the study of social mobilization by analyzing the determinants and characteristics of political movements (see for example [10, 24, 26]). Additionally, by leveraging information consumption practices and media diets, attitudinal positions could be attributed to news media articles and outlets, allowing for the study of diversity, or lack thereof, in information consumption patterns [19, 30]. This, in turn, presents interesting possibilities for large-scale analysis of wide news and informational ecosystems [9].