Introduction

Clustering scientific documents aims to organise the set of documents into groups, such that documents in a single group are similar to each other in comparison to the documents in other groups (Lawrence et al. 1999; Thijs and Glänzel 2018). The clustering of scientific documents is crucial for several tasks, such as summarisation (Karimi et al. 2018), recommendation systems (Habib and Afzal 2019), semantic understanding of scientific research (Shardlow et al. 2018), classification of scientific documents (Heffernan and Teufel 2018), and information retrieval systems for digital libraries (Safder and Hassan 2019). However, the clustering of related scientific documents in growing scholar big data is a challenging task (Hassan and Haddawy 2013, 2015). There are several known classic approaches to cluster similar scientific documents such as Bibliographic Coupling (Martyn 1964), co-citation (Small 1973) or Amsler (1972) approach. These existing approaches cluster similar scientific documents using the meta-data of the scientific documents’ references, venues, authors, keywords among other features.

The limitations of the classic approaches are two-fold:

  1. 1.

    They do not leverage the user perspective on the scientific literature. As a result, the most relevant documents against a cluster are often missed out, that actually best match in accordance to users’ perception (Mesbah et al. 2017).

  2. 2.

    The classic citation-based methods come along with the inherent issue of publication and citation time lags.

We claim that these limitations can be addressed by clustering the publications based on the real-time usage of scientific publications or discussion of scientific literature on social media platforms. People are increasingly going online to find and share the information about science. Specifically, researchers are using the social media platforms to engage with each other. Altmetrics offers innovative tools for researchers to explore the public engagement with science in social media platforms. Consequently, new possibilities are emerging to analyse the interaction between researchers and research articles on social media platforms (Hellsten and Leydesdorff 2017; Hellsten et al. 2019; Joubert and Costas 2019; Robinson-Garcia et al. 2019).

In order to address the previous drawbacks, in this paper, we present Tweet Coupling, which is a new methodology to measure the similarity of documents by leveraging the social usage of scientific documents on Twitter platform. The main advantage of tapping user engagements pertaining to the scientific publications on social media plateforms is that they are much faster than citation counts, which at least take a few years after the publication of an article to be ready for the evaluation purpose (Costas et al. 2015; Haustein et al. 2015a, b; Shu et al. 2018; Ananiadou et al. 2013).

Tweet Coupling is similar to classic Bibliographic Coupling approach. According to Martyn (1964), two scientific papers are bibliographically coupled if they have at least one common reference. If paper A and B refer paper C, it indicates a potential relationship between paper A and B, therefore, paper A and B are said to be bibliographically coupled. Thus, documents would have more coupling strength if they have a large number of common references. Similarly, Tweet Coupling is defined as follows: if a Twitter user mentions paper A and B in either same or two different tweets, then we assume this reflects a relationship between the papers and we called the papers as ‘tweet coupled’. In other words, two papers are tweet coupled if they have at least one common Twitter user. Thus, with a large number of common Twitter users reflect a high ‘Tweet Coupling’ strength.

Since our employed solution relies on analysing user engagements on scientific documents under the umbrella of altmetrics, we briefly describe the phenomenon of altmetrics in the context clustering similar scientific documents (see Sect. “A brief review on altmetric studies and social network analysis” for detailed discussion). Altmetrics term was introduced in 2010 by “Jason Priem” as an abstraction of social web metrics (Priem 2010). Nowadays, altmetricsFootnote 1 becomes a novel source to measure the social activities regarding scientific literature as well as it provides futuristic metrics which complement conventional bibliometric that solely depend on the citation counts, number of publications and peer review (Butler et al. 2017). Altmetrics uses various social media as a data source such as Twitter, Facebook, Google+ , Linked-in, etc. It tracks all relevant event such as like, comment, share, and retweet on any research article which gives us usage metrics of that article (Priem and Costello 2010; Haustein et al. 2015a, b; Zahedi et al. 2014; Hassan et al. 2017; Said et al. 2019). As mentioned earlier, the major advantage of altmetrics is that they are much faster than citation counts which at least take a few years after the publication of an article, to be ready for the evaluation purpose (Costas et al. 2015; Haustein et al. 2015a, b; Shu et al. 2018).

Recently, Twitter has received significant attention with plenty of opinions about scientific documents. Specifically, researchers share their work on Twitter, discuss modern topics and talk about the research informally by commenting, liking and retweeting on certain posts (Adie and Roe 2013; Thelwall et al. 2013a, b). Note that among all the altmetric platforms, Twitter has the highest coverage i.e. 87.1% (Robinson-García et al. 2014, 2017). Thus, it makes Twitter a significant and well-suited platform to obtain user engagement statistics, but any other social media platform could be used to conduct this investigation, e.g., Mendeley. To conduct experiments, we utilize the dataset of scientific documents from the field of Library and Information Sciences from Scopus. At first, we cluster the scientific documents using Bibliographic Coupling and Tweet Coupling, respectively. Further, we find similarity between bibliographic and tweet coupled document. Next, we visualize and compare the relationship of bibliographic and Tweet Coupling using VOSviewer. Finally, we discuss the implication of our employed Tweet Coupling measure and its applications for the scientific document search applications such as classification of scientific documents, recommendation systems, and information retrieval systems for digital libraries.

The contributions of this paper are:

  • The description of Tweet Coupling, which is a new methodology to measure the similarity of documents by leveraging the social usage of scientific documents on Twitter platform.

  • The study of the relation among Tweet Coupling and traditional citation-based metrics.

The rest of the paper is organized as follows: Sect. “Background” describes the detailed literature review, including existing coupling techniques for document clustering. Section “The Tweet Coupling methodology” presents our method for collecting data, employed Tweet Coupling approach for document clustering and similarities between Tweet Coupling and Bibliographic Coupling. In Sect. “Results and discussion” we present the result of our experiments and detailed comparison between Bibliographic Coupling, and Tweet Coupling. Finally, Sect. “Concluding remarks” presents some concluding remarks and indicate future directions of this research.

Background

In this section, we review the relevant literature on Bibliographic Coupling in Sect. “A brief review on bibliographic coupling”, the use of altmetrics data in bibliographic studies in Sect. “A brief review on altmetric studies and social network analysis”, and other works related to our proposal.

A brief review on bibliographic coupling

The practicality and success of scientific work are often measured by the attraction it receives from the scientific community as well as the quantitative measure of the scientific work that extends it (Garfield 1979; Batista-Navarro et al. 2013). In order to find related work, there are different approaches exist to determine the similarity of scientific documents. Most of the time, citation analysis gives excellent result to find document similarity. There are number of citation analysis techniques that are used for the identification of similar scientific documents. Amongst them, co-citation, Bibliographic Coupling, citation proximity, and Amsler method are the most widely and easily applicable citation techniques however, each one has their own pros and cons. In co-citation, two documents are co-cited if both document cited by at least one paper in common (Small 1973). For example, paper A and paper B are co-cited if both A and B paper appear in the references of third paper (Gipp and Beel 2009). It is used to find out the semantic similarity between research publications. If two papers received more co-citation, there citation strength is higher, and they are more likely to be semantically relevant as well. Co-citation is a forward-looking assessment technique. The drawback of this technique is that if the paper is recently published and it has no citation, so it is hard to find out the semantic relationship with other papers using co-citation. This technique is useful for those papers only which have a high citation rate.

It is in contrast to co-citation, two documents are bibliographically coupled if they are sharing at least one common reference in a bibliography (Kessler 1963). For example, paper A and B are bibliographically coupled if paper C is in the bibliography of both A and B (Gipp and Beel 2009). Similar to co-citation, a number of studies have used Bibliographic Coupling as a measure of semantic similarity between the scientific documents (Trueger et al. 2015; Zhao and Strotmann 2014). If there are large number of common references in papers, their bibliography strength is high and they are more likely semantic related. Bibliographic Coupling is backward-looking assessment technique. The advantage of this method is that we can also find a semantic relationship of newly published papers with others.

Amsler (1972) proposed a measure of similarity between two documents that combine both co-citation and Bibliographic Coupling. According to Amsler, two papers A and B are related if A and B are cited by the same paper, A and B cite to the same paper. Let d is the document and Pd is the set of parents (cite papers) of P and Cd is the set of children (citatuons) of d. The Amsler similarity between two documents measures as shown in Eq. 1:

$${\text{Amsler}}\left( {D_{1} ,D_{2} } \right) = \frac{{\left( {P_{D1} \cup C_{D1} } \right) \cap \left( {P_{D2} \cup C_{D2} } \right)}}{{\left| {\left( {P_{D1} \cup C_{D1} } \right) \cup \left( {P_{D2} \cup C_{D2} } \right)} \right|}}$$
(1)

Citation proximity analysis is the enhancement of co-citation analysis, consider the proximity of citation to each other within an article full-text (Gipp and Beel 2009). Citation proximity index can be (CPI) calculated in three steps. In the first step, documents are parsed and position of citation in the document is analysed. In the second step, each citation is assigned to the corresponding items in the bibliography. In the last step, the proximity between each pair of citation is analysed, if they are closer to each other than there are more chances that they are related to each other. For example, two citations are given in the same sentence their CPI is 1 as if they are in the same paragraph, CPI is 1/2. If it is in the same chapter, CPI is 1/4.

Yan and Ding (2012) explored the similarity between six types of scholarly network including co-citation network, Bibliographic Coupling network, co-authorship network, co-word networks and topical networks. Cosine distance was chosen to find the similarity between all these networks. They found that citation network and co-citation network; Bibliographic Coupling network and co-citation network; and co-word networks and topical networks have high similarity whereas, topical network and co-authorship network have low similarity. They recommended using hybrid network to analyze research interaction and scholarly communication. Since this investigation relies on the use of user perception of scientific literature on social media, the following subsection reviews on existing almetric studies in the context of clustering scientific documents.

A brief review on altmetric studies and social network analysis

Citation counts are frequently used for the evaluation of scientific research. However, the disadvantage of using citation counts to evaluate the scientific research is that they are quite slow. Altmetrics is an alternative indicator which is derived from social media and provide quicker scientific impact (Mohammadi and Thelwall 2014; Nawaz et al. 2012). People are increasingly going online to find and share the information about science (Hellsten and Leydesdorff 2017; Hellsten et al. 2019; Joubert and Costas 2019; Robinson-Garcia et al. 2019). Specifically, the researchers have been urged to consider how they can use the social media platforms to engage with each other. Altmetrics offers innovative tools for researchers to explore the public engagement with science in social media platforms. Consequently, new possibilities are emerging to analyse the interaction between researchers and research articles on social media platforms.

Several studies can be found that are focused on socio-semantic analysis of the scientific publications (Hellsten and Leydesdorff 2017; Hellsten et al. 2019; Joubert and Costas 2019; Robinson-Garcia et al. 2019). Joubert and Costas (2019) conducted an investigation to expand the understanding of the relationships and interactions between social media users and scientific outputs. They explored the identities, characteristics and activities of South African science tweeters—i.e. Twitter users in South Africa who tweet about research articles. The growing number of science tweeters, both overall and in relative terms, suggests that Twitter users are increasingly using this social media platform as a tool to share and discuss scientific outputs. The science tweeters are actively contributing to the sharing of information about new research articles. Moreover, several studies can be found that focused on identifying the topics of interests and the communities of users using altmetrics data (Hellsten and Leydesdorff 2017; Hellsten et al. 2019; Joubert and Costas 2019; Robinson-Garcia et al. 2019). For example, Robinson-Garcia et al. (2019) identified the topics of interest within the field of Microbiology and identify the main sources driving such attention. Specifically, they combined the data from Web of Science and altmetric.com to conduct their investigation. They found that a central area of the network is formed by papers discussed by the three outlets. Their topic analysis shows that the thematic focus of papers mentioned varies by outlet.

The application of altmetrics and social networks are expanding significantly, however, the novelty of this work is the usage of altmetrics for the clustering of scientific documents. To the best of our knowledge, no attempt has been made to use social media contents such as tweets on scientific publications as a proxy to measure the similarity among the papers.

Among all the altmetrics data sources, Twitter is the most widely used platforms by the scientific community. Priem et al. (2010) investigated 46,515 tweets from the sample of 28 scholars and concluded that Twitter citations are much faster as compared to traditional citation measures (Melero 2015). In addition, Priem et al. (2012), analyzed the correlation of altmetrics with citation count and showed that there exists a significant contribution of altmetrics in citation prediction of research. An analysis across more than 40 cross metric validation studies presented a weak correlation between citation count and altmetrics ranging from 0.08 to 0.5% (Erdt et al. 2016).

Hassan and Gillani (2016), measured the impact of the altmetric field. They collected data from social media sites including Twitter, Facebook, Mendeley, CiteUlike and Wikipedia for the years 2010 to 2014. The information gathered was only related to authors working in the field of altmetric. All scholarly information is gathered from Google Scholars database. Dataset consists of relevant information on a total of 47 distinct scholars. They introduced alt-index similar to h-index, based on altmetric count of the scholars. They observed that Pearson’s correlation of ρ = 0.247 between h-index and alt-index. A relatively high correlation was observed between social citation and scholarly citation with ρ = 0.646. Moreover, Peoples et al. (2016) find the relationship between traditional metrics of research impact and modern altmetrics specifically twitter activities to measure the research impact of a research article. They used the dataset of 1599 research article from 20 ecology journal published from 2012 to 2014 and found a strong positive correlation between citation count and unique tweet count on research publications. According to them, twitter activities were not dependent on the impact factor of journal, the highest impact journals were not compulsory the most tweets on twitter. Their results concluded that altmetrics and traditional metrics can be useful to find research impact and closely similar to each other but not exactly the same.

Liu and Fang (2017) investigated 79,441 English written tweets of top 100 research article published in 2015. They categorized the tweet among different categories and recommended that tweet written by those involved in the publication of paper should not be considered to measure the impact of the research article. They proposed to omit the tweets with the context that is irrelevant to the paper and tweets with a negative opinion should also be omitted. Tweets with positive sentiments and neutral tweets which also represent agreement towards paper to a certain degree should be considered only while evaluating twitter impact. After analyzing the tweet text, comprehensive list of positive and negative words or phrases were presented that are majorly used among researcher, while sharing their opinion about research work. They verify its correctness by searching these terms in a large data set of tweets. These words were then also added in SentiStrength lexicons (Thelwall et al. 2013a, b). More recently, Didegah and Thelwal (2018) presented a comparative study by investigating network level differences between citations, Mendeley saves, and tweets for research articles. They surprisingly found minor overlap between these three phenomena.

Older publications have lower coverage of altmetrics scores due to the less prevalent use of social web at the time of publication. Comparatively, more recent research publications have much higher altmetrics counts (Thelwall et al. 2013a, b; Haustein et al. 2016). Additionally, Holmberg and Thelwall (2014) examined the cross-disciplinary usage of twitter, how and why they use twitter and to see whether there exist a common pattern of usage among different fields. Different discipline(s) tweets were analyzed and categorized in different groups. Their result showed that a clear difference in twitter usage among scholars in these disciplines. Zahedi et al. (2017) examined the characteristics of scientific literature and types of people that share and discuss their research work on social media. Dataset on which they worked contained 1.3 million records having combined, both scholarly and social information. After that different document features (document type, number of pages, cited sources, characters in the title, number of authors, countries of origin, and affiliated institutions) were computed. Based on their result, Social media coverage is very low, with 22.6% of papers receiving at least one tweet, 5.2% publically shared on Facebook, 2.3% mentioned in a blog post and 1.1% discussed by mainstream media (Zahedi et al. 2014).

Summary and comparison with our work

The literature review presents an array of studies that use citations among scientific publications to determine their semantic relatedness. As discussed earlier in Sect. “Introduction”, the limitations of the Bibliographic Coupling techniques are twofold: (a) these methods do not leverage the user engagements on the scientific documents. As a result, most relevant documents against a cluster are often missed out, that actually best match in accordance to users’ perception. (b) The classic citation-based methods come along with the inherent issue of publication and citation time lags.

In this paper we introduce the methodology Tweet Coupling which is built upon a methodology for clustering scientific publications according to their real-time usage on Twitter. One of the main advantages of exploiting user engagements of scientific publications on Twitter plateform is that they are much faster than citation counts which at least take a few years after the publication of an article, to be ready for the evaluation purpose (Costas et al. 2015; Haustein et al. 2015a, b; Shu et al. 2018; Ananiadou et al. 2013). To the best of our knowledge, no attempt has been made to use social media contents such as tweets on scientific publications as a proxy to measure the similarity among the papers. Next section elaborates the employed measure of Tweet Coupling and compares it with conventional bibliography coupling.

The Tweet Coupling methodology

In this section, we describe the Tweet Coupling methodology that is depicted in Fig. 1. The methodology is composed of two steps, which are the building of a coupling incidence matrix described in Sect. “Coupling incidence matrices”, and the building of the adjacency matrix from the incidence matrices detailed in Sect. “Bibliographic and Tweet Coupling”.

Fig. 1
figure 1

Flow diagram of data inputs and processing, Pi is ith paper and Ri is ith reference

Two papers are tweet coupled if a Twitter user mentions paper A and B in either same or two different tweets, then we assume this reflects a relationship between the papers. In other words, two papers are tweet coupled if they have at least one common Twitter user. Thus, with a large number of common Twitter users reflect a high ‘Tweet Coupling’ strength. Formally, ‘Tweet Coupling’ is described as follows: Let \(U = \left\{ {u_{1} ,\; u_{2} , \;u_{3} , \ldots ,\;u_{n} } \right\}\) be the set of Twitter users, \(T = \left\{ {t_{1} ,\; t_{2} , \;t_{3} , \ldots ,\;t_{n} } \right\}\) be the set of tweet text by tweet users, and \(D = \left\{ {d_{1} , \;d_{2} ,\; d_{3} , \ldots ,\;d_{n} } \right\}\) be the set of scientific documents mentioned in \(T_{i}\) by \(U_{i}\). Let \(D_{{u_{i} }} = \left\{ {d_{{u_{1} }} ,\; d_{{u_{2} }} ,\; d_{{u_{3} }} , \ldots ,\;d_{{u_{n} }} } \right\}\) be the set of documents that a given user \(u_{i}\) mentions in tweet \(t_{i}\). Formally, two set of documents are tweet coupled iff\(D_{{u_{i} }} \cap D_{{u^{\prime}_{j} }} \ne \emptyset\) and \(u \ne u^{\prime}\).

The application of the Tweet Coupling methodology begins with the identification of the scientific papers which are tweet coupled and bibliographically coupled using altmetric and Scopus reference list respectively. Subsequently, we identify a reference list of all 1537 papers from the Scopus database to compute bibliographically coupled papers. Next, we tap the social activities of these papers on twitter platform using the altmetric database to compute tweet coupled papers. Finally, we measure the Jaccard similarity between bibliographically and tweet coupled papers to study their relationship.

Coupling incidence matrices

In order to compute Bibliographic Coupling, we generate an incidence matrix between scientific papers and their references. Similarly, to compute Tweet Coupling we generate incidence matrix between scientific papers and twitter users. Incidence matrix gives the relation between two classes of objects. One class along the rows of matrix (i.e. scientific papers) other class along the column (i.e. references or twitter users). Each row represents the single research article and each column represent the single reference. If a reference occurs in the bibliography of a given paper then the intersection of row (scientific paper) and column (reference or twitter user) is placed with ‘1’ while we placed ‘0’ on the intersection of row (scientific paper) and column (reference or twitter user) otherwise.

Bibliographic and Tweet Coupling

In the next step, we compute adjacency matrices from incidence matrices which give us bibliographic and Tweet Coupling matrices. An adjacency matrix is a square matrix which gives us the connection between two objects of the same class. In the case of a graph adjacency matrix, rows and column are labeled with graph vertices in the matrix and their intersection represents the connection or an edge between these two vertices. The diagonal of the adjacency matrix is traditionally labeled as 0, for a simple graph. We will construct adjacency matrices from the relevant incidence matrices defined above i.e. 1. The square matrix is defined in Eq. 2.

$$A_{\text{squareMatrix}} = B*B^{T}$$
(2)

Entries in matrix A represents the relation between a pair of scientific papers. The value represents the status of the connection, if the value is 0 on intersection its means that these two scientific papers are not bibliographically or tweet coupled. If the value is greater than 0 its means these two papers are bibliographically or tweet coupled. Larger intersection value signifies strong semantic relation between the scientific papers. In our square matrix, diagonal values represent the total references or twitter users on each scientific paper.

Further, in order to measure a meaningful correlation between bibliographic and Tweet Coupling square matrices, we convert the incidence matrices to binary matrices by replacing all non-zero values of square matrices with 1. Furthermore, we also connect all those papers which are directly not connected but indirectly connected via any other paper in both tweet coupled and bibliographically coupled matrices. Using the Jaccard measure, we calculate the similarity between the two matrices. The Jaccard measures similarity score by taking a ratio between a common and distinct member of the tweet and Bibliographic Coupling matrix. Given two scientific papers \(P_{1}\) and \(P_{2}\), their Jaccard similarity can be computed as shown in Eq. 3.

$${\text{SIM}}_{j} = \frac{{P_{1} {\bigcup } P_{2} }}{{P_{1} {\bigcap } P_{2} }}$$
(3)

The Jaccard similarity coefficient ranges from 0 to 1. It is 1 when \(P_{1}\) and \(P_{2}\) are similar to each other and 0 when they are completely different (Huang 2008).

Results and discussion

This section presents the dataset (see Sect. “Data and pre-processing”), evaluation measures (see Sect. “Evaluation measures”), and the comparison of the results among Tweet Coupling and Bibliographic Coupling (see Sect. “Bibliographic and Tweet Coupling comparison”) between Bibliographic Coupling and Tweet Coupling.

Data and pre-processing

The data used in the experimentation was given by altmetric.com, on June 14, 2016. There was a total of 4.5 million JSON files in the dataset. Each file contains information about the single article and respective articles can be identified uniquely by an altmetric id. Our dataset contains all altmetric data from July 2011 to June 2016 and there was a total of 3081 scientific publications. From this initial dataset, we filtered out the publications that belong to the Library and Information Sciences Journals, using All Science Journal Classification adopted by Scopus. Since altmetric data provides information about the online web indices, so references were collected from Scopus using Scopus API by using article DOI (or article title in cases where DOI’s were not available). To get the tweet details, we used the tweet-id which is given in altmetric data for every 3081 publications. We used twitter API to fetch details of each tweet such as tweet text, name, screen name, follower counts, description, retweet count, favorite count, friends count, status count, etc. By using screen name as a unique identifier, we found that a total 8299 tweet users tweeted 3081 publications.Footnote 2 Table 1 shows the statistics of the dataset used in the experimentation.

Table 1 Descriptive statistics of the Twitter dataset

There are a significant number of papers for which we find no tweets in our selected dataset. We decided to keep only those papers which have at least one tweet. Based on our cross-matching between references and tweet data set, we were left with 1537 papers which have a complete reference list and at least one tweet user interaction. The final dataset consists of 6272 references that were cited in at least one paper and 1551 twitter users that interact with at least one paper.

Evaluation measures

In order to evaluate our methodology, we compute the confusion matrix. The confusion matrix is given in Table 2. A confusion matrix contains four entries including (1) True Negative (TN); (2) True Positive (TP); (3) False Negative (FN); and (4) False Positive (FP). In the context of Bibliographic Coupling and Tweet Coupling, we define these terms as follows (see Table 2).

Table 2 Bibliographic Coupling and Tweet Coupling comparison confusion matrix

When the publications are actually bibliographic- and tweet coupled (i.e., Actual “YES”) and:

  1. (a)

    True Positive (TP) Our methodology predicted “YES” (i.e., they are bibliographic-and tweet coupled);

  2. (b)

    False Positive (FP) Our methodology predicted “NO” (i.e., they are not bibliographic-and tweet coupled).

When the publications are actually not bibliographic- and tweet coupled (i.e., Actual “NO”) and:

  1. (c)

    True Negative (TN) Our methodology predicted “NO” (i.e., they are not bibliographic-and tweet coupled);

  2. (d)

    False Negative (FP) Our methodology predicted “YES” (i.e., they are bibliographic-and tweet coupled).

Once we obtained the confusion matrix, we evaluated the performance of our solution using the following seven evaluation measures which can be derived from confusion matrix.

  1. 1.

    Accuracy The accuracy indicates that, overall how often our methodology predicts correctly (i.e., (TP + TN)/Total).

  2. 2.

    Misclassification Rate (*MR) The *MR indicates that, how often our methodology is wrong (i.e., (FP + FN)/Total).

  3. 3.

    True Positive Rate (*TPR) The *TPR indicates that, when the publications are actually bibliographic- and tweet coupled (i.e., yes), how often does our methodology predicts yes (i.e., TP/Actual Yes).

  4. 4.

    False Positive Rate (*FPR) The *FPR indicates that, when the publications are actually not bibliographic- and tweet coupled (i.e., no), how often does our methodology predicts Yes (i.e., FP/Actual No).

  5. 5.

    Specificity The specificity indicates that, when the publications are actually not bibliographic- and tweet coupled (i.e., no), how often does our methodology predicts no (i.e., TN/Actual No).

  6. 6.

    Precision The precision indicates that, when our classifier methodology yes, how often it is correct (i.e., TP/Predicted Yes).

  7. 7.

    Prevalence The prevalence indicates that how often does the yes condition actually occurs in our dataset (i.e., Actual Yes/Total).

Bibliographic and Tweet Coupling comparison

Table 2 shows the confusion matrix from which we obtain as a result of the Jaccard similarity between Bibliographic Coupling and Tweet Coupling. The total 0 elements in Bibliographic Coupling matrix are 2,346,508 where total of 0 elements in Tweet Coupling are 2,306,376. Count of nonzero items is respectively 15,861 and 55,993.

Table 3 shows the values of binary classifier from our confusion matrix. While the similarity results show high accuracy of 97%, we observe low True Positive Rate (TPR) and Precision. In order to further investigate the relation between bibliographic coupled and tweet coupled papers, we empirically apply different thresholds on a number of common twitter users and references.

Table 3 Results of comparison between Bibliographic Coupling and Tweet Coupling

Table 4 shows the evaluation results of Bibliographic Coupling and Tweet Coupling for different thresholds. With at least 10 common references and 10 common tweet users between papers, the reported accuracy is 94% and 75% for Bibliographic Coupling and Tweet Coupling, respectively. For this purpose, we set the threshold value to 5 for Bibliographic Coupling and Tweet Coupling. As shown in Table 4, the accuracy of Bibliographic Coupling and Tweet Coupling with the threshold value of 5 to increase 94% to 96% and 75% to 89% respectively. To maximize the value of accuracy and true positive rate in Bibliographic Coupling and Tweet Coupling, we set the threshold value for Tweet Coupling is more than 3 common twitter user tweets about the paper and for Bibliographic Coupling at least 3 common references in each paper. Our empirical evaluation suggests that the best similarity match between bibliographic coupled and tweet coupled is achieved at a threshold value of at least 3 references and 3 tweet users interaction per coupled paper. The accuracy of Bibliographic Coupling does not change but true positive rate drops to 1% and also other values Misclassification rate, False positive rate, and specificity not significantly change. On the other hand, as for Tweet Coupling, the Accuracy increased to 92% with true positive rate of 74%. Misclassification rate and false positive rate decreased to 7.9% and 7.6% respectively and specificity increased to 92%.

Table 4 Evaluation results of Bibliographic Coupling (BC) and Tweet Coupling (TC) and different thresholds

Bibliographic and Tweet Coupling network comparison

Further, we create a network of Bibliographic Coupling matrix and Tweet Coupling matrix using VOSviewer software.Footnote 3 VOSviewer is a software tool for constructing and visualizing bibliometric networks. These networks (clusters) can be constructed based on Bibliographic Coupling, Tweet Coupling. From our bibliographic and Tweet Coupling matrix (see Sect. 3.3 and 3.4), we visualise the relationship among papers in Figs. 2 and 3, respectively. Note that each paper is represented with the source title (journal or conference they published in) concatenated with a system generated unique paper identification number. Using the dataset of 1537 publications that are both bibliographically and tweet coupled, the visualisation approach helps to understand how papers are clustered with respect to source titles.

Fig. 2
figure 2

A Visualization of Bibliographic Coupling network of publications

Fig. 3
figure 3

A visualization of Tweet Coupling network of publications

Figure 2 shows the Bibliographic Coupling network grouped in 26 clusters. The maximum value of publications in a cluster is 141 and the minimum value is 2. Further, Fig. 3 is a visualisation of Tweet Coupling network graph of publications, grouped in 17 clusters, where a maximum number of publications in a cluster are 201 and minimum in a cluster are 4 in numbers. Drilling down to these created networks further, Figs. 4 and 5 demonstrate the clustering using bar graphs by Bibliographic Coupling and Tweet Coupling respectively. We removed all the journals from the cluster if they have < 4 papers in a cluster, then we are left with 22 clusters out of 26 in Bibliographic Coupling and with 16 out of 17 clusters in Tweet Coupling.

Fig. 4
figure 4

Result of Papers clustering by journals using Bibliographic Coupling

Fig. 5
figure 5

Result of papers clustering by journals using Tweet Coupling

The analysis shows a number of similar clusters, in terms of the presence of journals, both in using Bibliographic Coupling and Tweet Coupling, respectively: C-2, C-3, C-4, C-15, and C-23. In Tweet Coupling, scientific communities working in similar field are more connected as compare to Bibliographic Coupling. In contrast to Tweet Coupling, Bibliographic Coupling-based clusters show journals from different subfields within the Library and Information Sciences e.g. in Bibliographic Coupling “Collection Building” journal and “Journal of Health Communication” fall together in cluster C-15, but in Tweet Coupling “Collection Building Journal” grouped with core journals of Library and Information Sciences in cluster C-7. Similarly, bibliographic based clustering shows “Journal of Health Communication” in cluster C-15, grouped with journals associated with core Library and Information Sciences journals, in contrast, Tweet Coupling based clustering shows the same journal grouped with other journals in the subfield of health informatics, in cluster C-2. We also see that using bibliographic based clustering, Electronic Markets Journal appears in C-8 and C-18 with the journals related to different subfields of Library Information Science, but in Tweet Coupling based clustering it appears in a single cluster.

Overall, the clustering results show that Bibliographic Coupling and Tweet Coupling based clustering complement each other in terms of grouping similar papers in a respective cluster. However, the Tweet Coupling based clustering highlights an interesting phenomenon i.e. the tweet user on social media networks are interested in similar subfields within Library and Information Sciences, in contrast to bibliographic based clustering, which groups cross-disciplinary journals within a cluster.

Concluding remarks

In this study, we have examined the similarity of documents on behalf of their social usage by online communities on twitter platform and cited reference by the authors of the publications. We propose the concept of Tweet Coupling, which is a methodology for clustering scientific documents taking into account their social usage, whereas, we used a Bibliographic Coupling to find the similarity among the publications from the author’s perspective. Our analysis shows that journals associated within a subfield strongly connected with each other in Tweet Coupling—whereas bibliographic based clustering shows cross-disciplinary journals within a group. We believe that tapping the advancements of crowdsourcing data provides a unique perspective of online social media community engaged with the scientific publications. More specifically, in contrast to conventional approaches like Bibliographic Coupling or co-citation that comes along with the inherent issue of publication and citation time lags, the Tweet Coupling has the ability to determine the similarity between papers based on real-time usage or discussion of scientific literature on social media platforms. Nevertheless, the phenomenon of Tweet Coupling is well suited since user perception is important to group publications for scholarly data management point of views such as clustering, classification or information retrieval. Also, in contrast to traditional bibliographic based approaches, the Tweet Coupling based method can group fine grained clustering down to sub-disciplines with a broader discipline for improved document management.

While reporting a significantly reliable accuracy, there are some limitations of this method. We found that not all the publications are discussed on Twitter, so a portion of publication dataset has to be discarded before the comparison can be performed. Prevalence of corrupted DOI’s in altmetrics data set also hinder wider applications of this method. In the future, we plan to find similarity between publications by incorporating tweet text and document title and abstract text to compute tweet and Bibliographic Coupling metrics. We believe that by co-word analysis of tweets and papers title and abstract can produce an interesting result to figure out the semantic relation between social usage and bibliographic usage of references.

Further studies can also look for tweet sentiments such as positive, negative and natural, papers with higher positive sentiments tweets can be assigned higher weight while evaluating the research impact of publications which may improve the citation prediction results. It is possible that most recent publications have received more attention on social media as the usage of social media increased among scholars, but these publications may receive less citation count due to less time since published, therefore considering the time span while predicting the citation count may improve the result by considering tweet sentiments.

Specific to discipline, social usage helps us to determine the communication and writing style of the discipline. Semantic analysis of those tweets which belong to influential network nodes produces interesting results. Social network analysis can be used to establish a relationship between influential tweeters and relational structure of social media. Last but not the least, in our current approach only considered Twitter to find the relationship between social citation and academic citation, we can expand on this by including multiple social media platform like Facebook, Google+ , etc. and potentially improve the results.

We believe that Tweet Coupling can further be exploited in future studies for the scientific document search applications such as classification of scientific documents, recommendation systems, and information retrieval systems for digital libraries.