Introduction

Social media has been recognized as “the most pervasive form of communication in all fields today” (McCaughey et al. 2014), profoundly changing the way people interact with one another. Social media is also influencing and changing the way how science and academic topics are being communicated nowadays (Sugimoto et al. 2017). According to the estimation of the company Altmetric.com, around 15,000 unique research outputs are shared or mentioned online each day and a research output is mentioned online every 1.8 s (Altmetric 2016). Some scholars (e.g., Rowlands et al. 2011; Van Noorden 2014; Haustein 2016) argue that social media can promote openness and transparency, making the process of peer-review more visible, and with scholarly ideas and results being more openly discussed and scrutinized in the social media realm. In addition, social media attention to scholarly research can help increase the public attention to science. The academic social media users (especially the researchers) can quickly disseminate their studies and publications, pushing knowledge to their audiences straightly (Allen et al. 2013).

The transformative power of social media in scholarly communication, opens up a way for the study of social media impact (i.e. popularity, attention, visibility, etc.) of scientific research, making it a whole new research area in the field of Scientometrics (Bornmann 2014; Bornmann and Haunschild 2017). The analysis and study of the interactions between social media and scholarly agents and products (Haustein et al. 2016), popularly known as “altmetrics” and more specifically as “Social Media Metrics” (SMM) of science, have opened a new analytical scientometric perspective, with the potential to complement the more traditional citation-based indicators, expanding the understanding of how scientific ideas and topics are discussed and disseminated across multiple diverse communities (Costas 2018).

An important characteristic of SMM of science is their large source and metric heterogeneity. This heterogeneity goes from studies of the mentions to scientific articles on microblogging platforms like Twitter and Weibo, to posts about scientific research on social network sites such as Facebook and Google+ , saves of scientific references on online reference managers like Mendeley and CiteULike, reviews on F1000Prime, Publons or PubPeer, as well as mentions in scholarly blogs, news and mainstream media (e.g., Haunschild and Bornmann 2015; Haustein et al. 2015; Thelwall 2017; Maflahi and Thelwall 2018; Robinson-Garcia et al. 2019). Previous research in the field have also focused on studying the most important sources providing altmetric data (e.g., Thelwall et al. 2013; Wouters and Costas 2012; Zahedi et al. 2014), the coverage of scientific publications across altmetric sources (i.e. the percentage of documents with at least one mention on a particular social media platform) (e.g., Alperin 2015; Costas et al. 2015; Haustein et al. 2015), and the correlation between these new metrics and the traditional bibliometric indicators as well, particularly with citation impact (e.g., Costas et al. 2015; Haustein et al. 2014; Thelwall et al. 2013).

In addition to the role of social media in increasing the visibility of scholars and their work, research around SMM of science have also attempted to trace the public perceptions and opinions from online communities about specific scientific fields or topics, for instance, “climate change” (e.g., An et al. 2014; Pearce et al. 2014; Haustein et al. 2014), “Rio + 20Footnote 1” (Hellsten and Leydesdorff 2017), and “migrant crisis” (Nerghes and Lee 2018). In a recent study, Haunschild (2019) and his colleagues explored a novel network approach to compare topics between researchers and Twitter users based on author keywords and Twitter hashtags, offering insights that publications being tweeted can clearly be distinguished from those that are not tweeted. This type of studies put the emphasis in the “inherently social” nature characteristic of the altmetric sources like Blog or Twitter (Walker 2006), where the forwarding and commenting functionalities make it possible for “the shift from public understanding to public engagement with science” (Kouper 2010; Sugimoto et al. 2017).

As highlighted by Sugimoto (2017) and her colleagues, the broader social impacts should not be conceived merely as a distinction of the audiences who receive the work, or as a recognition of the work that catches the attention of audiences, but rather as the amplification of different voices which are disseminating and attracting the attention. In fact, social media is more than just marketing for academic work. It can inform every step of the research process: helping researchers get a pulse on the different movements in the fields or topics they are interested in, assisting in the promotion of published work, and also contributing to harvest helpful feedback for further research (Alampi 2012).

Accordingly, we argue that in addition to focusing on the potential alternative role of social media in assessing research impact, exploring their role in the dynamics and patterns of cross-platform or cross-community shift of academic topics is also of great value. This paper will contribute to this aim. Taking the research area of Big Data as a case study, we attempt to investigate the semantic similarity between topics from publications and those from the discussions of audiences mentioning and disseminating publications across different altmetric sources, including Blogs, News, Policy documents,Footnote 2 Wikipedia and Twitter. To be more specific, we want to answer the following questions:

  1. 1.

    What are the most important academic topics represented by the high-frequency author keywords in Big Data publications?

  2. 2.

    How do online audiences from different altmetric sources deal with the academic topics in their online discussions? In essence, how (dis)similar are the terms used by both communities (academic and online) in representing the same publications?

  3. 3.

    More specifically, on which platform are the audiences’ terms more consistent with those of Big Data publications (i.e. author keywords)? And, in the online community, on which platforms do the online audiences use more similar terms in their discussions?

Methodology

We used the Web of Science (WoS) and altmetric data from the Centre for Science and Technology Studies (CWTS) in-house databases derived from the Science Citation Index Expanded (SCI-E), Social Sciences Citation Index (SSCI), and Arts and Humanities Citation Index (AHCI), as well as Altmetric.com.Footnote 3 A comprehensive list of 9596 scholarly documents (i.e. Article, Review, and Letter) related with the research area of Big Data was obtained (we refer to them as Big Data publications) by using the search terms “big data” or “bigdata” in title, abstract and keywords of publications. About 90% (that is 8626) of all the publications have a Digital Object Identifier (DOI) in the WoS database, which enable us to match these publications with the altmetric data. Although not all publications related to the research area of Big Data can be covered with our search strategy, such a narrow but precise approach is the most efficient in terms of unambiguously identifying publications that have the most unambiguous alignment with the core concept of “Big Data”.

From a social media metric point of view, once a publication is mentioned in a post on an altmetric platform, a publication-post linkage is established. The online user who published this post can be seen as the audience of the publication mentioned. We propose a conceptual model of the process of topic spreading from academia to different altmetric sources (see Fig. 1).

Fig. 1
figure 1

Instance of topic spreading model across altmetric sources. (Color figure online)

In this model, the online audiences from the five platforms (i.e. Twitter, Blog, Policy, News, and Wikipedia) mentioning at least one Big Data publication are the Big Data audiences of the publications. That is to say, these audiences wrote and posted online events referencing these publications, which constitute the online discussions (the blue circle) about the research area of Big Data. In this way, the online events can be seen as a channel, through which the academic topics are spread and potentially amplified from the academic community to the online community. In order to further explore the topic similarity between the two communities, the author keywords from the publications and textual terms from the online events were extracted and processed. Technically, concerning the differences in text structure, title or summary terms of blogs, news, policy documents, and Wikipedia articles, and hashtags of tweets, are extracted separately, which also divides the online audiences into two groups. The concepts in the model are detailed as follows:

  • Big Data publications scientific publications included in our dataset directly using “big data” in title, abstract and author keywords. The authors of these Big Data publications are simply referred to as Big Data authors.

  • Big Data audiences users across the five platforms (Twitter, Blog, Policy, News, and Wikipedia) who have mentioned at least one Big Data publication, and are further divided into two groups:

    • Audiences on Twitter

    • Audiences on Blog, News, Policy and WikipediaFootnote 4

  • Big Data topics high-frequency author keywords (K) from publications and terms from social media events. Specifically, two approaches are applied to acquire the terms from the two audience groups:

    • Text terms (T) terms generated from titles of blogs, news and policy documents, as well as the first sentence in summaries of Wikipedia articles.

    • Hashtags (H) terms starting with the # sign from tweets, which is a system of categorization within Twitter and has a similar function of the author keywords in publications (Haustein 2016).Footnote 5

Of the 8626 publications with DOIs, 3563 (41.3%) have been mentioned at least once on any of the five altmetric sources, of which 3493 (40.5%) have been tweeted by Twitter users, 697 (8.1%) by users from any of the other four platforms, and 627 (7.3%) by audiences in both of the two groups (Table 1).

Table 1 Statistic description of data used in the study

According to the model, we divide our research process into several steps:

  • 1.Identification of topics of publications and online audiences. VOSviewer (Van Eck and Waltman 2009) was used for extracting high-frequency author keywords, hashtags and textual terms as topics of the three groups, respectively. Considering the differences in the numbers of topics in each group, we uniformly selected the top-100 topics with the highest frequency. The text mining functionality of VOSviewer provides support for creating term maps based on a corpus of documents with the following steps (Van Eck and Waltman 2011):

    1. 1.

      Identification of noun phrases with an approach developed by Van Eck et al. (2010a). The linguistic filter which selects all word sequences that consist exclusively of nouns and adjectives and that end with a noun was used to identify noun phrases.

    2. 2.

      Selection of the most relevant noun phrases. The selected noun phrases are referred to as terms. For each noun phrase, the distribution of (second-order) co-occurrences over all noun phrases is determined. The larger the difference between the two distributions, the higher the relevance of a noun phrase. Then, noun phrases with a high relevance are grouped together into clusters.

    3. 3.

      Mapping and clustering of the terms. The unified framework for mapping and clustering (Van Eck et al. 2010b; Waltman et al. 2010) is used in this step.

    4. 4.

      Visualization of the mapping and clustering results.

  • 2.Similarity measurement. Cosine similarity measurement was applied to quantitatively investigate the degree of (dis)similarity among topic sets of different groups, and is formulated as follows:

    $${\text{Similarity}} = \frac{A \cdot B}{{\left\| A \right\|\left\| B \right\|}} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} A_{i} B_{i} }}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} A_{i}^{2} } \sqrt {\mathop \sum \nolimits_{i = 1}^{n} B_{i}^{2} } }}$$
    (1)

    In Eq. (1), Ai and Bi are components of vector A and B, respectively (different topic sets in our study). The resulting similarity ranges from − 1 meaning exactly opposite, to 1 meaning precisely the same, with 0 indicating orthogonality or decorrelation, while in-between values indicate intermediate similarity or dissimilarity (Huang 2008).

  • 3.Comparison of different types of topics. All the topics can be classified into four non-overlapping types on the basis of their occurrences in groups:

    • KTH topics that appear in all groups as author keywords, terms, and hashtags, which can be considered as the common topics of both publications and online audiences;

    • K topics that appear only as author keywords, and can be considered as the pure academic topics;

    • T/H/TH topics that appear only as terms and hashtags, which can be regarded as the pure audience topics, alternatively, one can say that they are to some extent the amplification of academic topicsFootnote 6 in online communities;

    • KT/KH topics that appear in author keywords and any other group of terms (i.e. hashtags or text terms).

The analysis of the different types of topics helps to comprehend and interpret the tendency of focus of publications and online audiences around the research area of Big Data, as well as the pattern of how the topics shift from academia to the online community.

Results

A number of different analyses are performed in order to answer the research questions stated above. This section presents the results of these analyses, including topic identification, similarity analysis, and comparison among topics of groups.

Identification of topics

Author keywords

Of all the 8626 publications, 6689 (about 78%) have a total of 19,065 author keywords with a sum of 36,362 occurrences in total. The top-100 author keywords as the topics of Big Data publications account for approximately 22.6% over all the occurrences. Figure 2 shows the cluster mapFootnote 7 of these author keywords based on their co-occurrences in Big Data publications. Each item represents an author keyword. The size of an item indicates the number of total occurrences of the corresponding item. The color of an item represents the main cluster to which it belongs. The distance between two items offers an approximate indication of the relatedness in terms of their co-occurrences.

Fig. 2
figure 2

Cluster map of high-frequency author keywords of publications. (Color figure online)

This term map provides us a clear overview of the main author keywords of the Big Data publications. Three different clusters can be identified. The red cluster is the largest group containing the most author keywords (i.e. 44), of which many are related to social issues from industrial development to social media, such as “Internet of things”, “Social media”, and “Industry 4.0”. The green cluster contains terms of the applications of data analytical technologies in bioscience and medicine, for instance, “Bioinformatics” and “Precision medicine”. This is the second largest cluster consisting of 31 terms. The blue cluster, is the smallest one, it is mainly focused on core technologies with technical terms, especially machine learning and cloud computing-related techniques (e.g., “Cloud computing”, “Hadoop”, and “Mapreduce”).Footnote 8 Although the keyword “Machine learning” locates in the green cluster, it is quite close to the technology cluster. It follows that the top-100 author keywords seem to cover from core technologies of Big Data to major applications and social impact.

Table 2 details the top-10 author keywords with the highest frequency. It is remarkable that the search term “Big data” only appears in 44% of all the publications as an author keyword, indicating that instead of tagging their publications with this term as an author keyword straightly, most Big Data publications just mentioned it in title or abstract. The second to fifth places on the list are all technology-related terms (i.e. “Cloud computing”, “Machine learning”, “Data mining”, and “Mapreduce”). However, these four topics only appear in about 3.5% of publications on average, demonstrating the diverse and scattered topicality around the research area of Big Data. The high frequency of “Social media”, “Internet of Things”, and “Privacy” implies that, the opportunities and challenges brought by the explosion of massive data have aroused great concern and discussion among scholars, especially those in the social sciences.

Table 2 Top-10 high-frequency author keywords

Title or summary terms

A total of 3063 titles or summaries of posts mentioning Big Data publications in blogs, news, policy documents, and Wikipedia articles, are obtained. Among all the items, 1855 (60.6%) are from Blogs, 973 (32.1%) are News titles, while Wikipedia and Policy only account for 4.8% (146 summaries) and 2.9% (89 titles) respectively. Altogether, 5512 terms with 9447 occurrences are extracted by VOSviewer with the same approach as we did for author keywords. Figure 3 shows the map of the top-100 high-frequency terms divided into four clusters.

Fig. 3
figure 3

Cluster map of high-frequency terms from blog, news, policy, and Wikipedia. (Color figure online)

The largest cluster containing almost half (45, red) of all the terms is related to general issues, typically of medical science and health care (e.g., “Patient”, “Mental health”, “Disease”, and “Depression”). In addition, some social media related events like “Tweet”, “App” and “Instagram” also have received a lot of attention. The green one covers terms associated with scientists and research, for instance, “Scientist”, “Study” and “Publication”, and is the second largest with a total of 42 terms. Terms about interpersonal relationships and political affairs are distributed across the other two smaller clusters (i.e. blue and yellow). Besides, “Facebook” has the most links in the network, far more than “Big data”, illustrating its popularity among the online audiences. Nonetheless, due to the skewed distribution of links, “Facebook” is the center of the cluster it belongs to, but not the center of the whole network.

“Facebook” ranks first among the top-10 high-frequency terms, appearing in 273 (8.94%) entries in all, surpassing “Big Data” ranking second (184, 5.89%). It may signal to some extent the shift in the focus of the online community around Big Data publications, compared to the focus among the academic scholars. The high frequency of “Study”, “Research” and “Science” highlights the importance of scientific literature as a main information source of these posts. In addition, mental health-related terms, like “Depression” and “Emotion”, also have gained substantial attention from the audiences, which is one of the main application fields of Big Data analysis technologies closely related to individuals (Table 3).

Table 3 Top-10 high-frequency text terms

The overlay maps in Fig. 4 further display the sources of these terms, as well as their occurrences on each platform. The overlay scores used in these maps are normalized by dividing by the mean, so that the four sources can be compared with each other. The color depth of a term is based on its overlay score. That is to say, the higher the frequency, the darker the color. The gray term means that it does not appear in the corresponding source.

Fig. 4
figure 4

Overlay maps of terms from blog, wikipedia, news, and policy. In brackets is the number of terms finally selected. (Color figure online)

It is revealing that blogs and news contribute more terms due to their larger numbers of involved titles, among which topics related to social media, health care, and science are the common interest of the users on these two platforms (e.g., “Facebook”, “Emotion” and “Research”). Besides, news have a more extensive range of focus than blogs, covering more diverse terms ranging from medicine (e.g., “Mental health” and “Alzheimer”), to technologies and some social issues (e.g., “Nanotechnology” and “Poverty). By comparison, policy documents and Wikipedia entries have a more limited focus on Big Data publications with fewer publications mentioned. Specifically, the high-frequency terms in these two groups suggest a quite a different concern of topics on these platforms. Wikipedia entries are more oriented towards the research and application of technologies on internet and web, while policy documents have an obvious orientation to more general issues related to social progress like “EU law” and “Climate change”.

In the “Appendix”, we also provide four cluster maps of terms extracted from titles of blogs, news, policy documents, and first sentences of summaries of Wikipedia articles, separately (Fig. 13). Because of the quantity variance of entities, the minimum number of occurrence for being plotted is 3 for terms from Blogs and News, and 2 for terms from Policy documents and Wikipedia articles. The results shown in these figures differ rarely from those obtained by the approach described above. Blogs and News media mentioning Big Data publications have a stronger semantic relationship with topics around medicine, health care, social media research, and technologies. Policy documents citing Big Data publications tend to focus more on political, legal or social issues related with Big Data (e.g., “eu law”, “privacy”, or “policy”), while mentions of Big Data publications from Wikipedia are more oriented towards academic, technical and more theoretical topics (e.g., “university”, “cloud computing”, or “theory”).

Hashtags of tweets

A total of 4566 hashtags from 42,341 distinct tweets are obtained. These hashtags have a sum of 41,412 occurrences in all. Like other groups, the cluster map is provided in Fig. 5 with four clusters integrated by the top-100 high-frequency hashtags. The red cluster contains various terms related to bioscience and medicine, such as “#Genomics”, “#Genetics”, “#Cancer”, “#Bioinformatics” and “#Precisionmedicine”. The green one covers not only core technologies like “#Machinelearning” and “#AI”, but also terms about health care (e.g., “#Healthit” or “#Digitalhealth”). The blue cluster contains topics mostly related to social media and social networks, typically as “#Facebook”. The yellow cluster is focused on economic development and social management.

Fig. 5
figure 5

Cluster map of high-frequency hashtags. (Color figure online)

Table 4 lists the top 10 high-frequency hashtags and their occurrences. “#Bigdata” tops the list with over 4000 (9.65%) tweets, contributing to almost 10% of all the information provided by hashtags, far ahead of the others. Following is “#Datascience” with frequency around 500, which is also a popular concept in recent years. It primarily involves the processes for extracting and discerning valuable knowledge from complex data, as well as the development and use of related tools (Leek 2013; Waller and Fawcett 2013), so is quite associated with “Big Data”. The third and fourth topics are both technical terms of emerging and popular technologies for data mining and data analysis (“#MachineLearning” and “#AI”). Moreover, as mentioned above, health care relevant topics (“#Health”, “#Genomics”, and “#Healthcare”) are also prominent among Twitter users. In addition, compared with top-10 terms, the coverage of top-10 hashtags in tweets is relatively low, indicating a broader range of topics discussed by the Twitter audiences around Big Data publications.

Table 4 Top-10 high-frequency hashtags

Similarity measurement

After simple integration, for example, unifying the plural and singular forms of words, replacing abbreviations with full names, removing hyphens, etc., the author keywords, textual terms, and hashtags appeared in the Figs. 2, 3, 4 and 5 can form a list of 235 distinct topics. In other words, the topic list covers all the top-100 author keywords, terms, and hashtags, ranging in frequency from one to three (with one meaning that the given topic only appears in one group, while three implies that it occurs in all the three groups as a common topic). All the 235 topics and their occurrence in each group can be seen in Table 8 in the “Appendix”.

Venn diagram in Fig. 6 shows the layout of the 235 topics divided into seven parts with different colors. The numbers of topics in each part have been marked in the figure. Taking the group of author keywords (red) as an example, the 100 author keywords are separated into four parts: 60 occur as keywords only, 11 are in common with both other two groups (i.e. hashtags, blue, and terms, green), 27 also appear in hashtags and two in terms. Table 5 provides the result of the similarity measurement between group pairs. Of all the topics, only 11 (5.15%) are duplicated in all the three groups, demonstrating that nearly one in ten of the academic topics from Big Data publications are also highly concerned by the online audiences. Hashtags and author keywords have the largest number of common topics and the largest cosine similarity (38, 0.38). Following are hashtags and terms (25, 0.25), whereas terms and author keywords have the least similarity (13, 0.13).

Fig. 6
figure 6

Venn diagram of topic sets. (Color figure online)

Table 5 Cosine similarity of topic pairs

By breaking down the second audience group into four sub-groups according to the platforms they used (Blogs, News, Policy, and Wikipedia), we further investigated the topic similarity among them. The results are shown in Table 6 and Fig. 7. Blogs and News have the strongest similarity (0.9587) due to their larger number of topics included, which increases the possibility of having a common topic. Overall, News covers all the terms in Blogs and Wikipedia, and almost all the terms in Policy (27/28). The similarity between Blogs and Wikipedia ranks second (0.7703), and all the terms in Wikipedia are covered by those in Blog. Policy and Wikipedia are the least similar (0.4629) on topics among these platforms, which means they have different semantic orientations in the terms they used. Besides, when considering all the six groups together, topic sets from Blogs and News also have a higher degree of similarity to those from Twitter and publications (see Fig. 14 and Table 9 in Appendix).

Table 6 Cosine similarity of topic pairs
Fig. 7
figure 7

Venn diagram of topic sets. (Color figure online)

Comparison of topic sets

Common and different topics

The word cloudFootnote 9 in Fig. 8 displays the 11 common topics (KTH) of the three groups, that is, the central part in Fig. 6. The size of each word (topic) is based on its total frequency of occurrence in the three groups. Therefore, the bigger the size, the more frequently it appears, and the more attention it has received from both academic authors and online audiences. Apparently, the 11 common topics illustrate that emerging technologies, especially “Artificial intelligence” and “Machine learning”, are highly relevant terms in Big Data publications and online discussions as well, which are quite conspicuous in this figure. In fact, as new technologies that require a considerable volume of information in the form of big data to function, practical applications of Artificial Intelligence (AI) and Machine Learning (ML) have been on the rise in all business areas and daily life (Zhang et al. 2019). Therefore, they are common topics both in academia and online communities. Besides, some general topics which are closely related to the development of human society (e.g., “Health care”, “Climate change” and “Privacy”) also have been frequently used, highlighting the opportunities and challenges we are facing in the era of Big Data.

Fig. 8
figure 8

Common topics of scholars and audiences (KTH, 11). (Color figure online)

Further observation on the rankings (i.e. importance) of the common topics in each group reveals different degrees of attention of these topics by the subjects (Table 7). The numbers in the table represent the order of each topic in different ranking groups. Taking the ranking of author keywords as the baseline, the arrows represent the change trend of rankings of these topics in other two groups. Compared with the baseline, 6 topics (i.e. “Data”, “Artificial intelligence”, “Twitter”, “Health care”, “Technology” and “Climate change”) have increased their status significantly in hashtag ranking on Twitter, among which “Data” and “Artificial intelligence” jump from the middle in keywords to top-10 in hashtags, while “Climate change” is the biggest mover in the list (from 95 to 49th). Three topics (i.e. “Data”, “Technology”, and “Climate change”) have also improved their positions in term ranking. In addition, more topics (8) have slipped places to varying degrees in the ranking of terms than in hashtags (3), among which the high frequency of “Social media”, “Data mining” and “Privacy” as author keywords decreased in their ranking in the online discussions. Besides, “Big Data” and “Machine learning” keep ahead in ranking of hashtags with wide mention, but not the case in the other platforms in general.

Table 7 Rankings of common topics in the three groups

As for the different topics of publications or audiences (i.e. K or H/T/HT), the pure academic focus (K) are more technical and professional, of which most are scientific jargon not easily understood by the public or ordinary laymen, such as “Hdfs” (the Hadoop Distributed File System), “Surveillance” or “GPU” (the Graphics Processing Unit). Other business-related topics have also been the focus of authors but not online audiences, for instance, “Business intelligence”, “Resource allocation” and “Supply chain management”, which may to some extent indicate the prosperity of information economy with the development of Big Data applications and the Internet of Things (Fig. 9).

Fig. 9
figure 9

Pure academic topics of publications (K, 60). (Color figure online)

With regards to the pure audience topics (H/T/TH), we further divide them into three parts based on their occurrences in the two audiences groups: pure hashtags (H, 48, 35.1%, orange), pure terms (T, 73, 54.5%, blue), and the common ones (HT, 14, 10.4%, green). Comparison of pure hashtags and terms provides evidence that Twitter audiences discuss more topics related to academic research in various disciplines, such as “Neuro-science”, “Genetics”, “Plosbiology” and “Gahitec”, of which biology and health are the most widely covered themes. As mass media disseminating social hotspots and news anecdotes, Blogs, Policy, Wikipedia, and News tend to report general social events or technological advances, so the users’ concerns are generally less technical and more comprehensible (e.g., “Study”, “Researcher” and “Scientist”). Additionally, the common topics between these two audiences groups emphasize that, in addition to scientific research as an essential information source, mental health-related event draws great attention in the online community at present (Fig. 10).

Fig. 10
figure 10

Pure online audience topics (135) divided into three parts. (Color figure online)

Shift of academic topics

The relationship between the online posts and the mentioned Big Data publications enables us to establish two-way linkages between author keywords and audience terms (i.e. hashtags and terms). In this section, we examined the top-5 highly-mentioned author keywords and their linked audience topics on Twitter and the other four platforms, respectively. Such one-to-many linkages can reflect not only the diverse discussions but also the shift pattern around the specific topics among social media users from a thematic perspective.

Figure 11 shows the top-5 highly-mentioned author keywords (green) by audiences on Blogs, News, Policy and Wikipedia, as well as the top-5 text terms (red, signaled with “T:”) with most links to the keywords. The size of topics and the thickness of lines are both based on the frequency of occurrence. In other words, the bigger the size of the nodes, the thicker the lines connected to it, the higher the frequency of the topic. Obviously, “T: Facebook” is the closest audience concern to these academic topics, which can be mirrored by its high link rate with the academic topics (4/5). Technically, “T: Facebook” contributes nearly 14% of the mention rate to “Social media”, and approximately 8% to “Data mining” and also “Big data”. Moreover, “Machine Learning” and “Social media” are more often used to discuss topics related to mental health by the audiences (e.g., “T: Mental health” and “T: Depression”), while “Privacy” has been interpreted more concretely (e.g., “T: Preserving privacy” and “T: Medical privacy”).

Fig. 11
figure 11

Top-5 highly mentioned author keywords and top-5 terms with most links to them

The same approach is also applied for the top-5 highly-mentioned author keywords in tweets (green) and their linked hashtags (red, signaled with #). The result is displayed in Fig. 12. Compared with the network in Fig. 11, this network has a better connectivity with more items connected to each other. Moreover, “#Bigdata” replaces the central position of “T: Facebook” in terms, linked to all the five academic topics. The frequently mentioned author keyword “Big data” is connected mostly with “#Bigdata” and “#Datascience” on Twitter. The relationship between these two concepts is also a popular debate among scholars in various fields (e.g., Kacfah et al. 2015; Park and Leydesdorff 2013; Phillips 2017; Gupta and Rani 2018), and this analysis shows that these two concepts are also popular among Twitter users. Besides the application of data analysis methods in the field of biomedicine, with more appeals about open data and data sharing, “#Privacy” is also a significant concern closely related to “Big Data”. As technical terms, “Data mining” and “Machine learning” are usually connected with technologies via hashtags, for instance, “#Machinelearning”, “#ArtificialIntelligence”, and “#Deeplearning”, suggesting that Twitter audiences are also quite concerned about the development of core technologies. Discussions related to social media and social networks focus on specific platforms like “#Facebook” and “#Twitter”, as well as general issues, such as “#Healthcare” and “#Privacy”.

Fig. 12
figure 12

Top-5 highly mentioned author keywords and top-5 hashtags with most links to them

Discussion and conclusion

Unlike most previous research on SMM focusing on the impact of publications on social media and their correlation with citation or mention counts, in this paper, we study how academic topics in the research area of Big Data have been transformed across different altmetric sources. More specifically, we examined and measured the degree of similarity between the sets of terms used by publication authors, and the terms used by their online audiences across different platforms. We argue that this approach can open up a new research window to study the role of online audiences in the dissemination of academic topics from academia to the online community from a more semantic perspective.

Based on high-frequency author keywords from publications and textual terms from online events, the main topics in Big Data publications across different communities have been identified separately. It is revealing that there exist different thematic tendencies among these groups. Big Data authors pay more attention to technology development than their online audiences. This is shown by a large cluster of technical terms among the author keywords, like “Cloud computing”, “Mapreduce”, and “Machine learning”. This technical orientation can also be observed among Twitter audiences. Terms used in blogs and news show an interest in popularizing scientific research and discovery, as well as in interpersonal relations. Policy documents tend to focus on more general and political issues, while those on Wikipedia are more related to the application of data analysis technologies on the Internet. Besides, core technologies (i.e. “Artificial Intelligence” and “Machine Learning”) and some general issues (i.e. “health care”, “climate change” and “social media”) are the most important common topics among both authors and online audiences.

Similarity metrics provide us with a more numeric description of the degree of differences in user interests across different platforms, showing that Twitter audiences and Big Data authors have more common topics of interest than the other audience groups. Several possible explanations for this stronger similarity between Twitter hashtags and author keywords are taken into account. First, the substantial number of mechanical interactions with publications on Twitter makes it easy to generate tweets by clicking on the Twitter icon on the pages of journal articles, thus greatly increases the original content from these papers in the online discussion among Twitter users (Robinson-Garcia et al. 2017). Secondly, the large amount of retweets produced by simply copying the original tweets (Boyd et al. 2010) increase the repetition rate of hashtags used on Twitter. Besides, there is a large group of scholars with publications included in the WoS database who are also active on Twitter (Costas et al. 2017; Yu et al. 2019), which means that these scholars may use the same academic terms in their Twitter use of hashtags.

When it comes to the other audience group, Blog and News users have the largest degree of similarity in the terms they used to introduce and interpret Big Data publications, while Policy and Wikipedia show the lowest. One reason that cannot be ignored is that science journalists are a large group of actors in science blogging, aiming at explaining science broadly and educate readers (Bartling and Friesike 2014), so they may post the same or similar content in blogs and news (Fraumann et al. 2015). In our dataset, 97 events from Blog and News have the same headlines, which improves the degree of similarity between the two topic sets, while there are almost no identical titles between other platforms.

Further investigation into the pure academic focus offer an insight of the lower adoption of the more technical and professional terminologies by the online audiences, probably because these more technical terminology are not easily understood by the public and the non-specialists. On the other hand, the pure hashtags and terms that are not commonly used by the authors, can be regarded as a form of expansion and reinterpretation of the academic topics around the research area of Big Data by social media audiences. More specifically, Twitter users have turned to discussing or linking to topics involving medicine, biology, humanities, social sciences and other disciplines, demonstrating to some extent the widespread distribution of its users and the diversity of their opinions and views. Blogs, policy documents, Wikipedia articles, and news tend to report more general topics with terms that are less professional and easy to understand, somehow introducing a more people's daily life perspective.

In conclusion, our case study has proven that there are indeed (dis)similarities between the topics highlighted by authors in their papers and how they are discussed by online audiences. Overall, it can be concluded that the online users tend to mention topics that are more social and general. Simultaneously, they can help to further interpret, spread, and diversify academic topics, contributing to relate the scientific research with more practical problems.

Limitations of this study

The research presented in this study is also bound by some limitations that deserve further discussion. First, we only study papers indexed in WoS with limited types of article, review and letter, which means a large volume of proceeding papers and other papers not included in WoS are excluded. Besides, since a small part (about 10%) of the papers in our dataset do not have a DOI, more comparable identifiers, like arXiv ID or PMID, should be adopted for matching papers with those mentioned on the altmetric sources. Third, considering the difficulty of data analysis and processing, only English events (the overwhelming majority) are taken into account in this paper. In addition, since the Wikipedia titles are just the name of the entry, we chose the first sentence of the entry for term extraction, under the assumption that this sentence tends to provide a preliminary definition of the entry. However, there may be also conceptual differences between the first sentence of summary and the titles of blogs, news and policy documents that need to be studied in future research. Regarding Twitter, we only focused on comparing hashtags and author keywords. This choice has the advantage that we are comparing conceptually special features in both articles and tweets (i.e. hashtags are intendedly “selected” keywords by the Twitter users in order to frame the tweet, conceptually similar to the author keywords of publications). In future research, it would be relevant to also study the full text of tweets in order to better characterize the type of engagement of tweeters with the contents of the publications.

Finally, we would like to point out that there is a wide variability in the use and uptake of social media tools across different communities. Much of the published research has sought to identify factors of differentiation, such as age, academic level, gender, discipline, country and language, as well as the technical level of scholars using such tools (e.g., Nicholas et al. 2014; Mansour 2015; Larivière et al. 2013; Priem et al. 2012; Cronin and Sugimoto 2015). Therefore, according to these factors put forward in previous research, follow-up studies can be conducted to further analyze the (dis)similarity in the degree of attention and promotion of academic topics among different user groups in the online communities.