1 Introduction

Political inclination refers to the political stance of an individual. Polling and surveying to understand the political leanings of people within a particular community, in a particular geopolitical region, or a specific context is a common approach. However, the manual polling mechanisms used today are hard to scale. Also, there is a significant chance of biased sampling as the samples are often too small in terms of the number of individuals surveyed and localized. On the other hand, if a survey or poll is conducted on online social platforms, it is impossible to control voters’ distribution to calibrate it to resemble a random sample of opinions. Often the voters in these polls are limited to being the active audience of the pollsters sharing similar political inclination. Thus the result of the same poll can be completely different if introduced by a different pollster. Therefore, algorithmic labeling of people chosen from a controllable distribution is important, rather than asking for bias-prone active participation by influencers sharing particular political inclination or cost-inefficient manual polling.

Most of the existing approaches of political inclination detection (PID) on social networks focus on probabilistic models [8, 9, 13, 14], which are in turn based on the texts generated by users. Researchers have also tried to exploit the network structure by making use of GCNs [24] (Graph Convolutional Networks). This method uses all second-degree features (neighbors of neighbors of the node/user to be classified in the graph) for a rich representation which makes the classification more accurate at the expense of speed. The data collection process involves collecting features of followers of followers of the user whose political inclination needs to be detected. As the collection of followers and all their tweets itself is a slow process limited by TwitterFootnote 1, the time required for the collection of the features of the second-degree neighbours increases quadratically in terms of average unique neighbors per node.

Also, the GCN-based models need to store the huge Twitter-subnetwork involving political persons and their followers. This severely violates the users’ right to erasure as per the Article 17 of GDPR (See footnote 2) which reads as follows:

figure a

Further, these models are trained on a huge number of annotated examples. This makes the approaches hard to scale for newer settings and countries. In contrast, we show that certain easy to collect features plugged into a novel self-attentive framework can be very accurate in predicting the political inclination even if trained on a handful of annotated examples.

Our main contributions are as follows:

  • (1). Graph-based methods used previously raise many ethical questions [15, 17, 21, 23]. The users on social media platforms have the rightFootnote 2to deletion of their data from other storage systems which are dependent on social media as data source whenever their public profile on social media is deleted. Graph-based methods violate this by storing their information such as retweets, mentions, likes, and the follower-followee network. The time required to build and update such networks is huge as it will require everyday monitoring for (i) the existence of each connection and (ii) the arrival of new connections. So, the only way to use these features at inference time is to store them permanently in memory. We eliminate the need for storing such large relational graphs from past social media data of a huge number of users. We achieve this by using richer first-degree features that we collect directly at inference time along with their second degree neighbors which can be collected from the tweets of the concerned user/person to be classified directly (e.g., we collect the hashtags used by the retweeted user as it is readily available with the retweet, same for replies). Using smart augmentation of these features we beat the performance obtained by the graph-based approaches [24] at a reduced inference time.

  • (2). We propose a novel Fast Self-attentive Semi-supervised Political Inclination Predictor FSSPIP (Fig. 1). The experimental results show that even without using any gold annotation, we can achieve high accuracy of \(\sim \) 94% using weak supervision. The model is highly scalable and free from manual intervention unlike Darwish et al. (2020) [10] which needs human supervision or cluster inspection.

  • (3). We bring on board multiple additional datasets to show that our model can be used in many other similar settings for political inclination detection with a handful of labeled examples (or even without it). In specific, we present several case studies on media bias and political polarization using our classifier in zero-shot settings.

2 Related Work

Stefanov et al. (2020) [20] and Baly et al. (2020) [2] used Wikipedia, Twitter, YouTube, and other channels of information to detect the political leanings of the media houses. This approach is not scalable in context of persons. Conover et al. (2011) [9] had used a corpus of 1000 annotated data points to test the supervised approaches based on bag of words. Iyyer et al. (2014) [13] used advanced neural techniques like RNNs on a labeled corpus of sentences taken from speeches of democratic and republican parliamentarians. Chen et al. (2017) [8] used graph-based approaches to show the efficacy of using an opinion-aware knowledge graph. However, these techniques fail to take the richer network features into account. They also completely rely on annotated data failing to take advantage of domain knowledge of the task in hand.

Aldayel et al. (2019) [1] analyzed the features responsible for higher accuracy in stance detection setups using network features, tweet texts and text derived features. Darwish et al. (2020) [10] on the other hand used clustering based unsupervised setup to detect stance of users mainly relying on three channels of features: retweeted tweets, retweeted accounts and hashtags. Xiao et al. (2020) [24] approached the same task using manual annotation and collecting a large dataset of non-politician social media users and politicians on Twitter. They relied on variants of relational GNNs coupled with multi-task learning. However, given the need for explicit storage of information in the graph structures, even after the training phase, the graph-based algorithms often violate privacy rights of a large section of users.

Therefore, in this paper, we attempt to solve political inclination detection in a resource-constrained setup with no storage of user data after model training. We use several task-dependent augmentation techniques and unsupervised learning methods which have not been used in this context earlier thus making our model robust, easily adaptable and scalable without any human help/supervision. We only use public data available at the time of inference.

3 Model Architecture

The Base Architecture: Like previous state-of-the-art approaches [24] we too use a GCN-like framework. However, we, in contrast, do not store the user/feature graphs nor do we need a list of politicians in the country of the users to be classified with huge set of labelled examples. We use a long list of different feature types derived from follow, mention, reply, retweet, tweets and likes. We hypothesize that for a good representation of the political inclination, there are many important but easy to collect features which can be retrieved from the web directly during inference time with no need of storage. We describe these features in details below:

Fig. 1.
figure 1

The FSSPIP base architecture: The Twitter profile of a user is taken as input from which 22 different feature types are extracted and processed to predict political inclination.

Base Features

User Descriptions: We collected user descriptions of users retweeted and quoted, forming two separate documents. These user descriptions/bios often contain key information like the user’s occupation, gender, religion etc.

Hashtags: Hashtags are important as similar hashtags are used to express opinions for/against a polarizing topic by users of different leanings.

Mentions: IDs mentioned in tweets are used as features.

Media Domains: It is no secret that users of different political leanings share different sets of news items that fit their ideological perspective. Considering their importance in our task, we collect domain names and domain + co-domain names from users’ tweets. We use them as separate features.

Textual Content: We use pre-trained models like BERTweet [18], and Google’s Universal Sentence Encoder [4] to convert the content of tweets of a user into embeddings. In our experiments, we found that BERTweet performs better (possibly because BERTweet is trained on text with vocabulary more similar to ours). Thus we report BERTweet numbers only.

Features Connected to Neighborhood:

We mentioned a total of 6 features till now. We repeat the same features for retweets and replies separately. So, the total number of features become 6+6+6=18.

In addition to all these features, we also use friend ids, follower ids, mention ids, ids replied to, and ids retweeted as features collected at test time.

So, in total we use 18+4=22 features.

Attending to Different Modalities

We use \(|R|=22\) features in our architecture. For each feature type r and user i, we obtain an embedding \(e_{ir}\) of size \(d=8\) as follows.

$$\begin{aligned} \begin{aligned} e_{ir}&= W_r \times BERTweet(T_{ir}) ,\ \text {if} \ r \in T \\&= W_r A_{ir} H_r ,\ if \ r \in T' \\ \end{aligned} \end{aligned}$$
(1)

where \(A_{ir} \in \{0,1\}^{1 \times Vlen_r}\) is the feature presence-absence vector for the \(r^{\textrm{th}}\) feature and \(H_r \in \mathbb {R}^{Vlen_r \times d}\) is the embedding matrix containing embeddings of all the features for feature type r. While pre-processing we chose only those features in the vocabulary which appeared at least in five instances of the training data to ensure having enough training instances. The length of the vocabulary for \(r^{\textrm{th}}\) feature type is represented as \(Vlen_r\). \(T'\) and T are respectively the set of non-textual and textual feature types. For each feature type r, \(T_r\) denotes the textual content of that feature for user i, \(W_r \in \mathbb {R}^{d \times d_{em}}\), where \(d_{em}\) is the embedding dimension of the output of BERTweet [18]. Now, we calculate \(h_i\), the final embedding for the \(i^\text {th}\) user as follows.

$$\begin{aligned} \small h_i=\sum _r \alpha _{ir} \times \frac{e_{ir}}{|e_{ir}|} \end{aligned}$$
(2)

FSSPIP uses a dynamic dot-product self-attention mechanism to calculate the weights for each of the feature types to finally get a weighted sum of the normalized embeddings of each feature type. We use learnable parameters \(p,q,k \in [0,1]\) to allow some flexibility in attention calculation. Learnable parameters \(q_r\) and \(k_r \in R^d\) are queries and keys, respectively, for each feature type r (Here a feature type is specific social media attribute, so a collection of hashtags coming from tweets is a feature type different from the collection of hashtags coming from the retweets/replies. Please refer to the list of features mentioned at the start of the section for a broader understanding). So,

$$\begin{aligned} \alpha _{ir} =p*\frac{e^{q_{ir} \times k_{ir}}}{\sum _r e^{q_{ir} \times k_{ir}}}+(1-p)*|e_{ir}| \end{aligned}$$
(3)
$$\begin{aligned} q_{ir} =q \times e_{ir}+(1-q)\times q_r \end{aligned}$$
(4)
$$\begin{aligned} k_{ir} =k \times e_{ir}+(1-k) \times k_r \end{aligned}$$
(5)

An illustration of this base architecture is presented in Fig. 1. It shows how input from each feature type goes through different transformation functions (BERTweet in case of textual data, trainable embeddings in case of follower ids etc.) to transform into embeddings which are then weighted by attention values calculated through three different architectural scheme as mentioned. The weighted summation of the embeddings (vector size 768) denote representation of the node/person to be classified. This embedding is further multiplied with a vector of size 768\(\,\times \,\)1 and passed through a sigmoid function to obtain probability of a person being a republican in a binary classification setup. We use binary cross entropy as the loss function for supervision.

4 Augmented Semi-supervision for Superior Representation Learning

To make our architecture ready for few-shot learning, we make the model robust using regularization and multi task learning. We also use weak supervision producing high accuracy without any labelled example. Specifically, we use three different categories of techniques which are described below.

Dynamic Augmentation

Mixup: Mixup [25] is a technique that enforces linear change in output given linear change in input by training a neural network on convex combinations of pairs of examples and their annotated labels for a particular task. We adopt the method to our network data by mixing two random users for each channel (e.g. : hashtags, domains, retweetees from both users are present for the augmented user) increasing the diversity of data points and regularizing the model for unseen data points.

Sampling: Twitter users can be imagined as generative agents who generate tweets on selective issues and follow/reply/mention/interact with other users following some implicit probability distribution. So, if some of the points from the distribution are sampled out uniformly, the distribution will not change. So, we uniformly(chosen from a random uniform distribution for each feature type) sample out features from labeled examples for augmentation masking out 0–15% of the features randomly during training.

Feature Channel Dropout: While some feature types may influence the result more than others, it is important to learn to predict from the cues available if one influential feature type (e.g. hashtags, followers, retweets etc.) is absent. So, we randomly drop random feature types while training for better performance through adversarial training.

Weak Supervision: We hypothesize that the followers/retweeters of a particular political party often share the bias of having that particular political inclination. So, they are statistically more likely to follow the leaning of that particular political party which they are following on social media than any other. This provides some silver labels in the Twitter space for weak supervision. We crawled the Twitter handles of each political party (i.e., the official Twitter handles of The Democratic & The Republican party in case of US and AAP, Congress & BJP party in case of India) to collect the last 75,000 (set heuristically to contain enough examples) followers and the last 75,000 retweeters for each of these parties. We randomly selected 2,500 from each of them to get a sample representative of the timeline (as the most recent followers appear first, so collecting a big pool and resampling may help) and collected their relevant data for training. Users following both parties were removed.

Self-supervision: Self-supervision is a semi-supervised learning technique that trains the model in a new dummy task predicting part of input data using the other part of the data [18]. While in case of textual data masked language modelling and next sentence predictions [11, 18] are the most frequently used pre-training technique, graph neural nets use predicting masked edges between nodes as the pre-training task. Following these methods, we pretrain our model with the task of prediction of the non-textual features which are masked in the dynamic augmentation phase during sampling the input features. We use self-supervision as pre-training method while performing few-shot learning and fine-tune later on the annotated data points. Hyperparameter details and loss function of the pretraining phase has been put in the Appendix available at https://tinyurl.com/icadlappendix).

5 Data Preparation

Dataset for the Main Task: As provided by Xiao et al. (2020) [24], we have 2,976 labeled data points (labelled republican or democratic) along with 583 politicians’ data in the US setting. For a nuanced analysis, we retain the partition of the data points, used in the dataset – PurePFootnote 3, P50Footnote 4, P20\(\sim \)50Footnote 5 and P+allFootnote 6.

Table 1. Descriptive statistics of the labeled dataset.

We collected the Twitter ids and labels provided by Xiao et al. (2020) [24]. We crawled the last 3,200 tweets (some tweets got deleted, some tweets were retweets, quotes, and replies), follower ids and friend ids of each labeled id in November, 2020 using the Twitter APIFootnote 7. We also collected the user objects (containing bios) for each id. So, after pre-processing, we have data for each feature type described in the previous section. We extracted the domain and co-domain names from the URLs shared using the tldextractFootnote 8 library. Out of 2,976 labeled users, 2,665 users were available on Twitter at the time of crawling of the tweets (November, 2020). We report our results on this dataset. A major point to be noted here is that we do not store this data once the training is over, nor do we need to collect neighborhood data at inference time making the inferece faster and memory efficient. A detailed statistics of this dataset with the count of unique features for some feature types is provided in Table 1.

Additional Datasets for Lateral Verification 

We collect several other datasets to demonstrate the usefulness of FSSPIP in zero/few-shot setting. The statistics of these datasets are detailed in Table 2.

Table 2. Descriptive statistics of the collected datasets. MB: MediaBias; C: Community; MP: Multiparty; S: Statewise TPC: Topicwise; HTU: HashTagUsers (4 hashtags subset as mentioned in Fig. 3b , details in Appendix).

The Media Bias Dataset: Following Stefanov et al. (2020) [20], we use the crowdsourced labelsFootnote 9 for media bias prediction. There are 806 labeled instances in the dataset with labels left, center-left, least biased, center-right and right. In order to binarize the label space (to fit in our classification model which is a binary classifier), we first discard the instances with label least biased; next, we merge left, center-left to a single label left and center-right, right to a single label right. We collect the friend ids, follower ids, and the last 3,200 tweets of these media houses to employ the FSSPIP classifier for prediction.

The Ethnic Community Dataset: Many post-poll surveys establish how different communities vote differently. We try to use our model to identify such divisions. We first sample recent tweets using the Twitter API (See footnote 7) mentioning names of any of the communities. Among the users tweeting, we select only those who mention their community as one (or more) of the communities/ethnicities being probed (‘black’, ‘white’, ‘hispanic/latino’, ‘asian’), in their bio. We put a user to a particular community if that community is mentioned in their bio.

Multi-party Leaning Dataset: In order to collect a set of users residing in a multi-party democratic system, we filter the latest 10,000 tweets (and tweet-ers) containing the term ‘Delhi Election’ using Twitter API on 17th March, 2021. We annotate random 1000 Twitter users from this list into followers of three political parties: AAP: Aam Aadmi Party, BJP: Bhartiya Janta Party and Congress/INC: Indian National Congress (AAP: 203 users, INC: 435 users, BJP: 362 users) to form the multi-party inclination dataset. We check the residency of the users and confirm it to be India through the self-declared location tag in twitter while annotating each user.

Statewise Inclination Dataset: Here, we use the Twitter API to collect tweets against a politically neutral query ‘election’ (all datapoints are collected before 17.05.2021). If the user has a state’s name mentioned in the Twitter’s location tag, we categorise that user to that particular state. We collected 100 users for each state in India for representative sample collection.

Hashtags User Data: In order to find out the inclination distribution behind each hashtag, we collect the tweets containing some trending hashtags (on or before 17.05.2021). We collect 1000 (\(\times \) 30) tweets, excluding the retweets and replies containing each trending hashtags using the Twitter API. For a manual verification, we annotate 30 hashtags with tags Congress, BJP, and Neutral (In the date of collection, we could not find hashtags which can be attributed to be inclined towards AAP. Moreover, politically unmotivated hashtags are termed as ‘neutral’). This annotation was done by a PhD student, expert in Indian Politics by reading the tweets with the hashtags.

6 Main Task: Experiments and Analysis

Baselines: We use the best performing models provided by Aldayel et al. (2019) [1] and Darwish et al. (2020) [10] (UMAP+DBSCAN with tweets containing chosen hashtags included in the Appendix). NTF [1] uses network/graph and textual features together in its model just like our model without attention. UUS [10] on the other hand uses weak supervision (a quite different method compared to ours) through dimensionality reduction and clustering, manual inspection (which also makes the algorithm less scalable) and labelling of the clusters with only three features (retweeted tweets, retweeted accounts and hashtags). We also added a modified version of the UUS algorithm for a fair comparison with our fine-tuned model as the UUS algorithm is completely unsupervised and incapable of using any supervisory signal for few-shot learning. We took the unsupervised UUS model and fine-tuned it using annotated data points, terming it UUS+.

We also add non-neural baselines like SVM, Logistic regression (LR) and Random Forest (RF) as we are interested to show how simple algorithms with smaller inference time tally with our methods. Here, we use the concatenation of tweets for each user as input. We used TIMME-hierarchial [24] and its other two variants as the other baselines using self-supervision on graphs with higher inference time due to second order data collection on large graph. However, we only report TIMME-hier results as it was the best-performing variant (hyperparameter stats and details on other TIMME variants in Appendix). A qualitative comparison of the baselines is added in Table 3.

Table 3. A pointwise comparison of the models used as baselines. {NNeur : Non Neural baselines}.
Table 4. Results of few-shot learning {NTF: Model proposed by [1]; UUS: Model proposed by [10]; UUS+: Model proposed by [10] fine-tuned on annotated data points; TIMME: TIMME-hier (other TIMME variants’ result in Appendix); FSSPIP: FSSPIP base architecture with the few-shot learning framework; #T: Number of training datapoints; TTI: Time Taken for Inference per datapoint with Twitter ids as inputs. For each framework, it also includes the time taken to collect the data}.
Table 5. Ablation study of different model variants {F1: FSSPIP-fixedattn; F2: FSSPIP-auto; FSSPIP- - -: FSSPIP base architecture; FSSPIP- -: FSSPIP without weak supervision and self supervision; FSSPIP-: FSSPIP without self supervision.}

Results: In Table 4, we show that our best performing model FSSPIPFootnote 10 fairly beats other baselines for all datasets. We see that we gain most compared to other models when very few training datapoints (50) are presentFootnote 11. In the case of the non-politician datasets, i.e., P50, P20–50 and P+all, the performance obtained by our model is significantly higher than other baselines, even with only 50 training data points. This may be because the non-politician datasets do not purely contain political features unlike the PureP dataset making the feature learning task less straight-forward needing finer features like domain names a user is interested in or the tweets from the retweetees.

Our model performs better than other models in terms of time required to predict for a single user. Compared to the networks using second order relational data (TIMME) we are at least \(\sim \) 10x faster as shown in Table 4.

Also, our model performs better than NTF [1] and UUS [10] by a significant margin using weak supervisionFootnote 12 with better augmentation while utilizing carefully extracted network features like NTF inputs [1].

Ablation Study

Model Variants - In order to ablate our attention mechanism we employ two other varieties of attention in place of ours in FSSPIP base architecture which are as follows: We recall Eq. 2 here to understand the two new attention mechanisms: FSSPIP-fixedattn (F1): FSSPIP-fixedattn uses fixed learnable attention to calculate a weighted sum of embeddings of each feature type. Thus here the Eq. 2 \(\alpha _{r}\) values are learnable parameters and \(\alpha _{ir}=\alpha _{r}\), \(\forall i\). FSSPIP-auto (F2): FSSPIP-auto simply sums up each of the normalized embeddings of each feature type, assuming equal attention to all the feature types while computing the final embedding vector. So, here we assume \(\alpha _{ir}=1\) \(\forall i,r\).

Table 6. Important features and feature types for the predictions.

To test the few-shot learning framework we used incrementally powerful models in Table 5, where FSSPIP- - - is the base architecture without the few-shot learning framework and then each component of the framework is added sequentially to the base model (terming those intermediate models FSSPIP- -, FSSPIP-, and finally FSSPIP).

We find that the dynamic attention mechanism produces significantlyFootnote 13 higher gains compared to the other two attention variants. We find that the gains produced are higher when fewer data points are used and weak supervision has a higher impact than adding dynamic augmentation further to the weakly supervised model. This can be explained as weak supervision already trains the model with a large number of real data points which makes the model regularized enough. However, dynamic augmentation helps in regularizing the model, specially in fewshot settings to avoid over-fitting. Similarly, self-supervision also seems more useful in case of fewer training data points. Moreover, we can see that the attention variants of the model perform very close to the original model but falls short with low number of data points.

Most Important Feature Types - To determine the most important feature types, we drop each feature channel and measure the information loss by calculating the deviation in performance of the classifier (FSSPIP) trained on the combined dataset (train:test:validation datapoints = 80:10:10). The results are reported in Table 6. The highest drop is witnessed when the relevant hashtags are dropped.

Fig. 2.
figure 2

Distribution of political inclinations in the USA by topic/racial demographics.

Zero-Shot Gain: Inspired by the significant improvement by weak supervision as shown in Table 5, we trained our model FSSPIP on the weak supervision dataset only, which is collected without any manual annotation. We then use the whole annotated dataset for testing this model. We obtain a zero-shot accuracy of 93.7% (TIMME models are based on list of politicians of each party, and thus cannot be zero-shot. UUS, which is not easily scalable due to its clustering, purity checking by experts and soft labelling methodology, performed the best among other baselines at 91.9%). This tells that the social media followers of a political party are indeed, most of the time, followers of the party in real life also. So, training a model to classify a social media user to be a follower of one party over the other on social media also trains the model for the similar task of classifying the user to be a follower of one political party over the other in real life. We verify this conclusion again in a multi-party scenario for a diverse non-English speaking democracy like India in the next section.

7 Additional Task: Experiments and Analysis

We use the additionally collected datasets to show the efficacy of the zero-shot classifier. The research questions selected for this section are easy to test but important for social scientists. They had been mostly analyzed through manual surveys till now.

Media Bias Prediction: We use the trained FSSPIP classifier on the media bias dataset collected by us, taking each of the Twitter handles of the media houses as the node to be classified. We obtain an accuracy of 72.6% on the task, while we do not explicitly train for this taskFootnote 14 and rely on the assumption that \(\{\text {democrat}\equiv \text {left}\}\) and \(\{\text {republic}\equiv \text {right}\}\).

Topical Polarization - Bone(s) of Contention: In order to poll users for specific contexts and issues, we collect some hashtags (see Appendix) supporting each issue mentioned in Fig. 2a. We then use the model to classify each user and plot the % of users for each leaning in the US setting, i.e., The Democrats & The Republicans.

Multi-party Inclination Prediction: US political system is binary consisting of only two political parties: The Democrats & The Republicans. In principle, our system can work for other countries and other kinds of political systems as well. In this section, we test zero-shot property classification of our model on the diverse multi-party democracy like India. We take Twitter handles of three national parties in India, namely, Aam Aadmi Party (AAP), Indian National Congress (INC) and Bharatiya Janata Party (BJP). We use the weak supervision method to train our model with sampling, mixup & feature channel dropout strategy as discussed earlier. On a random sample of 1000 Twitter accounts (AAP: 203, INC: 435, BJP: 362), we obtain an accuracy of 81.9%. The highest confusion scores between classes (see Appendix) were between AAP & INC. This is fairly intuitive since both these parties are left-leaning and in opposition, while BJP is known to be subscribing to a right-wing leaning and is currently in power.

Statewise Leaning: In Fig. 3a, we plot the relative distribution of political leanings for each state of India (on a scale of 0–1, signifying the percentage of users in a state leaning toward BJP. We get the average of political leanings for each person in the state’s data predicted using the aforementioned classifier). This correlates quite well (Pearson’s corr coef: 0.52 with high significance and low p-value, \(p<0.01\)) with the vote percentage received by BJP in each state in the 2019 general election.Footnote 15

A Leopard Cannot Change Its Spots: In order to check if the political inclination changes with time, we reuse the same dataset described in the last paragraph with a temporal filtering strategy. We only use the tweets and tweet derived features for this experiment which means the bio is always left as blank and same is done for followers/retweeters. We collect all the 3,200 tweets (limit set by the twitter API) of each user ID, directly available from Twitter. For reliable prediction, we filter the users who have tweeted at least 100 times before 2017 and at least 100 times after 2018Footnote 16 This leaves us with 2,893 users. We then predict the inclination of these Twitter users twice. Once we use the features collected from tweets before 2017 and once we use the tweets after 2018. We observe that the predictions match for 91% of the cases, which tells that political leanings are temporally (almost) invariant.

Hidden Agenda - Inclination Behind Promoted Hashtags: To find out the inclination behind each hashtag, we obtain the political leanings of the users in the collected hashtag-specific dataset using the zero-shot classifier trained on followers of Congress and BJP. We plot the percentage of users leaning toward each party for each hashtag. We correctly predicted the leaning in 25 of the 30 cases using the classifier (considering a percentage distribution of 40–60% as the neutral/apolitical zone). We plot the leanings on four different India-specific issue – #WeAreWithYouPmModiJi, #BengalBurning, #CycloneTaukte, #JusticeForAsif in Fig. 3b. While we see the disaster hashtag (#CycloneTaukte) is non-polarizing, other trending hashtags are evidently promoted by people of particular ideologies. We include the list of other hashtags in Appendix.

Fig. 3.
figure 3

Inclination distribution in India.

8 Limitations and Future Work

Our work is limited by the availability of social media data. If a country does not have enough political participation in the social media, then training a model will not be possible. Moreover, if the profile of a person is kept private, the classifier will not be able to assign any label. We have discussed the related ethical implications of our work separately in Appendix.

Lastly, here we only evaluated our method on a dataset of users with high degree of political connection to very low degree of political connection. Collection of a dataset of users with no political links online but inclination toward a particular political party is a challenging task. In fact, Twitter matched voter registration data [3] also shows high partisanship evident in tweets and political connections. Research toward implicit(not explicitly tweeted/mentioned) political inclination detection (like implicit hate speech detection  [5, 12] or implicit aspect specific sentiment detection [6, 16]) is an interesting future research direction.

9 Conclusions

We present an efficient, fast, and scalable few-shot learning framework for Twitter profiles for political inclination detection (FSSPIP). We showed that our model is explainable and learns features that humans find meaningful. Moreover, our model does not store any personal data of users unlike graph based models. It is also shown to be faster than graph-based methods. With the scalable representation learning framework, we achieve state-of-the-art accuracy, gaining significantly in unlabelled or few-shot learning setups on non-politician users. Enabling zero-shot political inclination detection with high fidelity, we provide a method to easily re-target this work to new countries and languages without any manual intervention/supervision unlike previous methods. We believe this will make a large-scale analysis of the political landscape throughout the globe easier and more accurate.