Introduction

It has been reported that, as of January 2017, approximately 2.8 billion people actively use social media worldwide (Kemp 2017). Social media sites such as Twitter, Facebook and YouTube serve as important platforms for people to communicate, interact, consume and disseminate information (Kalloubi et al. 2017).

Twitter is the most popular microblogging social media site of all the current online social media applications; it has 328 million active users per month (Escamilla et al. 2016). There are several features that the user can utilize and track. Twitter is a popular social media site for researchers to investigate, as the company allows developers to collect a portion of tweets freely through their API (application programming interface). Twitter data have been collected and analysed by various scientific studies, from predicting election results (Beauchamp 2017), surveying public health (Bates 2017) and identifying community sentiments corresponding to specific events (Jarwar et al. 2017) to analysing news reports during important events such as the Egyptian Arab Spring revolution (Lotan et al. 2011). In addition, Twitter has become a prominent platform used in altmetrics research to track the activities surrounding scientific documents. A recent study of over 1.1 million publications (with at least one citation count) shows that Twitter’s coverage exceeds 91% of the total social media activities received by altmetrics (Hassan et al. 2017).

Given the vital role of Twitter’s platform in generating altmetrics-related data, it is extremely important to learn more about its users who are sharing, liking, disseminating and communicating information related to scientific documents. Certain Twitter accounts have more influence in this context (Quercia et al. 2011), perhaps because the account holders are celebrities, politicians, social workers or companies and thus have more followers than others. Posts by these popular Twitter users may influence or impact on others within the network more than most (Hussain et al. 2012). Thus, in the context of altmetrics, it becomes extremely important to identify who, or what (e.g. journal, bot, lab, etc.), is tweeting about science.

The main objective of this paper is to detect the influential Twitter accounts among the tweeters in a certain discipline and to investigate their influence on the accumulation of scholarly citations. The research questions are as follows: (a) Who are the most influential tweeters in the field of information sciences during July 2011 to December 2015 (present in Altmertic.com version dataset-jun-4-2016.tar.gz)? (b) Are influential tweeters capable to discriminate highly cited articles from non-highly cited articles? and (c) More specifically, what is the role of influential tweeters in terms of discriminating highly cited articles?

The rest of the paper is organized as follows: the following section presents a literature review of altmetrics-related work. Next, we present our approach to the identification of influential tweeters, followed by the results and discussion section. Finally, we present our concluding remarks, along with future research directions.

Literature review

We review the literature in two parts: first, we make a brief review of altmetric studies, showcasing the importance of social media platforms, including Twitter, in disseminating research outputs. Second, we review the important studies on identifying influential tweeters or tweets on Twitter platform, in chronological order.

A brief review of altmetrics as an early predictor of citations

During the initial phase of altmetric research, the focus of many studies was finding out whether social media acts correlated to citations. Priem et al. (2011) studied the potential of altmetric data using linear regression to assess if altmetric counts contributed to the prediction of Web of Science citation counts. The authors found that altmetric events were a contributor to citation predictions. More recently, Costas et al. (2015) and Hassan et al. (2017) provided an extensive analysis of altmetric indicators and their relationship to citations across multiple disciplines; they found a weak correlation between citation and altmetric counts and that the number of altmetric acts for scientific publications were still catching up. Similar observations were made in further works, including by Thelwall et al. (2013), Sud and Thelwall (2014), Haustein et al. (2014) and Zahedi et al. (2014). Thus, altmetric researchers have turned their attention to the broader impact of research outside academia (Bornmann 2014, 2016; Bornmann and Haunschild 2016). To answer these questions, researchers must go beyond examining correlations to citations, and must start to examine who and why actors (both individuals and organizations) are sharing and discussing scientific publications on social media platforms.

In addition to these studies, others (see Sugimoto et al. 2017, for a thorough review of the literature) have examined various topics, including the impact on scholarly communication from sharing scientific documents in online environments (Shrivastava and Mahajan 2016), the identification of communities engaging with scientific documents online (Tsou et al. 2015), the variation in activity across online platforms (Haustein 2016) and the formulation of a theoretical lens with which to view altmetric activities (Haustein et al. 2015).

A brief review of detecting influential Twitter users

In this section, we present a brief review of studies pertaining to the identification of influential users or influential tweets on Twitter.

Alonso et al. (2010) made use of crowdsourcing, along with machine-learning algorithms, to find the interesting tweets in a batch of a thousand tweets. With the help of workers on Amazon Mechanical Turk (AMT), who took part in the evaluation procedure by labelling the tweets as ‘interesting’ or ‘not interesting’, they analysed whether the presence of hyperlinks plays an important part in such classification. By training the data using a decision tree classifier, the model classified 85% tweets correctly; about 15% were misclassified. Anger and Kittl (2011), working on a dataset of Australian-based users, introduced a new measure of social network potential (SNP) to discover the most influential users. The authors analysed four features (retweet, mention, follower and following) as key factors to label users as influential or non-influential. They utilized the score from Klout, an online web-based service that tells users about their influential measures in social media sites, alongside three other major parameters that they defined as Follower/Following Ration (rf), Retweet and Mention Ratio (rRT) and Interactor Ratio (ri). They found that the mention and retweet ratios were of great importance when focusing on content-oriented interactions, while the interaction ratio was important in the case of conversation-oriented interactions. The SNP value was calculated using the sum of the rRt and ri ratios divided by two, where a result of 100% meant that all the tweets of a user were either mentioned or retweeted.

Yang et al. (2012) took a different approach to the identification of influential tweets on Twitter by focusing on the ranking of interesting tweets. In their analyses, users and tweets were considered nodes and the retweet relationship was considered an edge between these nodes; this approach differed from the standard approach of using HITS, as it was more influenced by the retweet behaviour of users. They found that HITSprop demonstrated even better results than standard HITSorig algorithm, and concluded that user authority is an important component of determining interesting tweets. More recently, Lee et al. (2017) used eigenvectors to find the influential users within a network. Working in the field of digital humanities, they formed a network of tweeters using the official AoIR (Association of Internet Researchers) conference hashtags from 2014, 2015 and 2016. Using degree centrality, PageRank and eigenvectors, they found the most influential users in the network. The authors then segmented tweets in the form of replies, mentions and tweets. They found that mentions represented the highest percentage of tweets, indicating that members of the AoIR used Twitter primarily to converse among themselves at conferences.

The above review summarizes scientific studies in two ways: first, we discussed the literature related to altmetrics as an early predictor of scholarly citations, then we presented studies pertaining to the identification of influential tweeters or tweets. To the best of our knowledge, no existing studies comprehensively discuss the role of influential tweeters in relation to scholarly citations. Thus, our work contributes significantly to a long debate dating back to the manifesto on determining the relationship between social mentions and scholarly citations that was initiated by Priem et al. (2010).

Data and methodology

Dataset

The datasetFootnote 1 used in this paper was obtained from Altmetric.com (version dataset-jun-4-2016.tar.gz). It contains over 4.5 million records, and the publications belong to various disciplines. Each record contains traces of the altmetric acts of a single scientific publication (article or dataset), as well as bibliometric information (such as DOI, number of authors, publication date, journal name, etc.). For this research, we selected publications from journals indexed in Scopus under the sub-discipline of Library and Information Sciences. Note that Scopus makes use of the All Science Journal Classification (ASJC) scheme to index journals. The ASJC classification maps journal and conferences across 27 broader disciplines, such as Agricultural and Biological Sciences, Chemistry, Computer Science and Social Sciences, along with more than 300 sub-disciplines including Artificial Intelligence, Human Computer Interaction, Safety Research, and Library and Information Sciences.

A total of 820 journal and conference publications (with at least one citation and associated tweet activity), indexed under Library and Information Science, were retrieved from the Scopus database during July 2011 to December 2015. Furthermore, all the tweets associated with these publications were procured. A combination of tweet ids (from Altmetric.com) and the Twitter API was then used to collect tweet text, resulting in a dataset of 10,345 tweets made by 5490 unique users. These tweets contained 8373 mentions and 4061 retweets. All tweet information was then stored in a relational database. A summary of the retrieved Twitter dataset is shown in  Table 1.

Table 1 Descriptive statistics of the Twitter dataset

Identification of influential Twitter users

The first step was to identify the relationships between users by means of a social network graph. In our network, nodes represent Twitter users and edges represent three types of associations: mentions, retweets or follow relationships. The resulting undirected simple graph is represented by an adjacency matrix. A small subset of this graph is plotted in Fig. 1, and the mass of each node is indicative of its influence in the network. We found that institutional accounts, such as figshare, write4Research, SAGElibrarynew, and so on, lead in the chosen field of Library and Information Sciences.

Fig. 1
figure 1

Connection between tweeters in our dataset

It is well known that the network connectivity parameters and the spectral properties of the adjacency matrix of the corresponding graph are correlated. In particular, Chakrabarti et al. (2008) and Chung (1997) have reported that the maximum eigenvalue of a graph adjacency matrix is a good measure of its connectedness. This idea has been exploited by others to identify crucial nodes in a graph; if the removal of a node results in a significant difference in the largest eigenvalue, it may imply that the removed node was an influential entity in the original network. We used this observation and related algorithms in the literature to identify influential users in the Twitter social network who were tweeting about a particular set of publications.

For a given network G = (V, E) with adjacency matrix A, the subset S of V with \({\text{argmax}}_{{S \in \left( {\begin{array}{*{20}c} V \\ k \\ \end{array} } \right)}} \left[ {\lambda_{1} \left( A \right) - \lambda_{1} \left( {A\left[ { - S} \right]} \right)} \right]\) where \(\left( {\begin{array}{*{20}c} V \\ k \\ \end{array} } \right)\) is the set of all k-subsets of vertices. Once rows and columns representing the nodes in the set S are removed, A[− S] is the updated adjacency matrix and \(\lambda_{1} \left( X \right)\) is a largest eigenvalue of a matrix X. Note that \(\lambda_{1} \left( X \right)\) may not be unique, though this bears no relevance to the problem presented here.

The straightforward algorithm to solve this problem of finding k most influential nodes takes \(O\left( {n^{\text{k}} \cdot n^{\sigma } } \right)\) time, where \(O(n^{\sigma } )\) is the running time, to evaluate the spectrum of an n x n matrix. This running time is quite impractical for all useful values of k. Indeed, the problem was shown to be NP-Complete by a simple reduction from minimum vertex. To address this, an efficient approximation algorithm presented by Tariq et al. (2017) was used. For the sake of completeness, their algorithm/approximation technique is summarized below.

Given an undirected graph G =(V, E) with adjacency matrix A, associate a score \(\psi = {\text{tr}}\left( {A^{\text{p}} } \right) - {\text{tr}}\left( {A\left[ { - S} \right]^{\text{p}} } \right)\) with a subset S of vertices. Here p is a suitable constant (the larger the better, but larger is more time consuming) and tr is the classical trace function, which is defined in Eq. 1.

$${\text{tr}}\left( X \right) = \mathop \sum \limits_{i = 1}^{n} X\left[ {i,i} \right]$$
(1)

It turns out that \({\text{tr}}\left( {A^{\text{p}} } \right)\) is just the count of closed walks of length p in the given graph. Recall that a closed walk is a sequence \(v_{1} ,v_{2} ,v_{3} \ldots v_{l}\) where \(v_{1} = v_{l}\) and each consecutive pair is an edge; that is, \(v_{i} ,v_{i + 1} \in E\). Based on other results in graph theory, the following approximation is used, which can be computed for each vertex individually, as shown in Eq. 2.

$$\psi_{\text{G}} \left( v \right)^{\prime } = 2 d_{\text{G}}^{2} \left( v \right) + 4 \left( {\mathop \sum \limits_{u \ne v} d_{\text{G}} \left( {u,v} \right)} \right)^{2}$$
(2)

Here, \(d_{\text{G}} \left( v \right)\) is the degree of the vertex \(v\) in the graph and \(d_{\text{G}} \left( {u,v} \right)\) is the co-degree of vertices \(u\) and \(v\). The degrees \(d_{\text{G}} \left( v \right)\) are readily available and are traditionally stored along the diagonal entries of the adjacency matrix, while co-degrees \(d_{G} \left( {u,v} \right)\) can be computed by taking the intersection of characteristic vectors of corresponding rows in the matrix. Hence, the score \(\psi_{\text{G}} \left( v \right)^{\prime }\) can be computed in linear time from the number of vertices and edges, which provides a measure of influence of a node in the graph.

Experiments and results

In this section, we use descriptive analysis and classification modelling techniques to highlight the importance of influential Twitter users in distinguishing highly cited articles. We divided the dataset of 820 articles into the top 50% articles with reference to their citation counts and labelled them as ‘1’, while the rest of the articles were given the label ‘0’.

Descriptive analysis

The distribution of our data is revealed here through their mean and median. We next constructed a histogram of our data to analyse the relationship between the influential Twitter users and the number of citations in publications, both for the top 50% highly cited articles (HC) and the rest.

Table 2 shows that top 50% (HC) articles have a higher mean and median than the rest. This indicates that articles tweeted by influential users receive more citations. We found that the top 50% (HC) articles have, overall, greater mean and median scores than the total mean and median of the complete dataset. Furthermore, the histogram of user scores for both those publications in the top 50% (HC) and the rest (see Fig. 2) shows that for HC articles the sum of user scores is more spread out, indicating that articles tweeted by influential users achieve more citations than the rest, for which the sum of the user scores tends towards small values with a high frequency.

Table 2 Description of data used in study
Fig. 2
figure 2

User scores for articles in both the top 50% (HC) and the rest

Supervised machine-learning models

Our goal was to design a supervised machine-learning model to distinguish the highly cited articles from the rest, using an array of known altmetric features along with the score for user influence. The features shown in Table 3 were chosen to train our classification model. The features relating to articles were extracted from the altmetric data: the number of authors (F2); the types of users in altmetrics (F3–F6); the most significant social media post count (F7–F11), also employed by Hassan et al. (2017) and Costas et al. (2015) and others; but not User Influence (F1), which is an accumulated value of the rank scores of all the Twitter users who tweeted a given publication. The Twitter users were ranked 1 through n, where n is the total number of users; the most influential tweeter was given the n rank score. Combining the list of top users with the mentioned and retweeted articles list obtained from Altmetric.com, we obtained a new dataset comprising all those articles that were retweeted or mentioned by the top tweeters.

Table 3 Features for classification

For training and testing, the data were divided using a 10-fold cross-validation technique. Further, ROC and PR curves were used to evaluate the performance of the model. Important features were identified by using the Extra-Tree classifier, along with the PR area value of individual features. The classifiers applied to this study were Support Vector Machine (SVM) and Random Forest. Grid search algorithm was used to find the best parameter (Hesterman et al. 2010). The primary objective was to identify the relationship between the network of influential users and highly cited articles. For this purpose, a baseline was calculated, followed by a measure of the effectiveness of the results compared to the baseline.

Baseline model For the baseline experiment, we used all our features from F2 to F11, excluding F1; that is, User Influence. The performance of the baseline was evaluated by using both ROC and PR curves with that of the proposed model.

Figure 3 (left) indicates the ROC curve results using two different classifiers. The SVM displays the lowest area (0.66), while Random Forest achieved a score of 0.74. Figure 3 (right) also displays the results of the PR curve. Similar to the ROC results, Random Forest performed the best, with an PR area of 0.84, followed by SVM, which gave an area under the curve of 0.84.

Fig. 3
figure 3

Mean ROC and PR curves on baseline model

Proposed model with user influence In this experiment, using all the features F1 through F11, a 10-fold cross-validation technique was used to train the classifiers. The technique divided the data into k subsets, resulting in 10 subsets. Each time one of the subsets was used as test data, the remainder of the k − 1 subsets were used in training. This method was repeated k times. It fit the model to 90% of the data, with 10% on test data. Two different classifiers, Random Forest and SVM, were implemented in a Python program using the SciKit Learn machine-learning library. SVM was useful as it ascertains a hyperplane, which was used in this case to separate the input class label in the case as highly cited or not; an RBF kernel was used for SVM in this instance. Multiple decision trees were used and ranked to receive an output using the Random Forest classifier; a maximum depth of 10 was set in Random Forest, to represent the number of questions to be asked before reaching an answer.

Again, the Grid search algorithm was used to obtain the optimal parameters to train the classifiers (Hesterman et al. 2010). Figure 4 (left) displays the mean ROC curves of the two different classifiers. It indicates the mean AUC using 10 folds, along with a mean AUROC value for each classifier. SVM showed an area of 0.67, whereas Random Forest gave an area of 0.79. The models performed better than the baseline, with Random Forest giving the best results of values 0.79.

Fig. 4
figure 4

Mean ROC and PR curves of proposed models

To measure the effectiveness of the model at different recall levels, a PR curve was used. Figure 4 (right) displays the mean PR curves of the two classifiers. Similar to ROC, SVM gave an area with a curve value of 0.66, and Random Forest demonstrated good performance, exhibiting an area of 0.90. The models performed better than the baseline, with Random Forest providing the best results, with values of 0.90.

We compared the baseline results with the results obtained in the above section. The new model provided much better results for both PR analysis and ROC analysis. The best score of the baseline model in this experiment (0.84) was achieved by the Random Forest classifier, whereas the best score was (0.90) by Random Forest for PR. Similar results were obtained by ROC and, while Random Forest gave the best result (area = 0.74) in the case of the baseline, the worst results from the new model were also with Random Forest (area = 0.79).

Discussion on feature importance

Based on the results obtained in this work, Random Forest was chosen to obtain the most important features as it yielded better results than SVM. We examined the features’ importance using the Extra-Tree Classifier. Along with PR curve analysis, it was employed to examine the effectiveness of individual features.

The Extra-Tree classifier was employed to rank the features by importance. Note that the Extra-Tree classification produces piece-wise multi-linear approximations (from a functional point of view), in contrast to the piece-wise constant ones of Random Forest. Our SciKit-Learn-based implementation assigned a score to each feature in such a way that the sum of all the features is always equal to 1. Table 4 shows the importance of individual features. The User influence (F1) feature proved to be the most important feature, with a score of 0.27. Interestingly, the Twitter post count achieved the best result of all the social media post counts, topping the list at 0.12.

Table 4 Importance score for each feature of the extra-tree classifier

In addition to examining features’ importance using the Extra-Tree Classifier, PR curve analysis was employed to examine the individual features using the Random Forest (see Fig. 5) classifier. Predictably, User influence gave the best PR area of 0.81, followed by Researcher and Twitter post count, each resulting in an area of 0.63. Four features achieved a similar performance (0.58): Number of authors; Practitioners; Blog post count; and News post count. Interestingly, both the analyses presented in Table 4 and Fig. 5 reveal User influence to be the most important feature in classification.

Fig. 5
figure 5

PR curves of individual features with Random Forest

Concluding remarks

In this article, we investigated the relation between influential tweeters and highly cited articles. The tweets and user mentions, retweets and followers’ links were modelled as an undirected graph, and this was used to find the most influential Twitter users in the dataset. We discovered that the features around influential users were highly competent in discriminating between highly cited and non-highly cited articles. We found that the score for influential users was the most important feature in the dataset, using the Extra-Tree Classifier and PR curves, when tested on individual features. In future, we seek to examine the influence of time to explore the duration for which a user remains influential within a subject area. The impact of time can also be used to examine differences across disciplines in order to investigate how influential users contribute to predicting citations across disciplines. In addition, in future studies we will consider the effect of bot Twitter accounts in identifying influential users in Twitter networks.

Last but not the least, instead of using social media as black box that generate social usage data, more research in needed to better study the underlying network structure of tweets and mentions that can directly or indirectly influence altmetric scores originated by Twitter platform.