1 Introduction

In science, it is imperative to identify the importance of research publications as well as the trends of research [3, 17, 55]. Bibliometrics is one such tool that performs quantitative analysis of research publications [19, 33, 34, 55]. Bibliometrics identifies the most influential work in a specific field mostly by utilizing citation counts as a metric. Citations in turn help build new works on top of the existing knowledge, and forms a connection between the novel approaches to those of its predecessors. Other purposes of citations include crediting the peers’ work, providing background information, and contextualizing one’s own work [53]. Citation count is also a good indicator of the influence and visibility of a scientific publication.

Bibliometrics and citation analysis is used in various fields to identify the most influential work and researchers, and in the analysis and evolution of a specific research theme [17, 55]. It is applied in various domains of science such as medicine [19], physics [27], social sciences [34], and computer sciences [24].

Big data research has seen a greater interest in the last decade, and attracted researchers from transdisciplinary areas such as physical sciences, natural sciences, social sciences, and biomedical sciences [2, 4, 13, 28, 38, 42, 49, 50]. The concept of big data originated from the information explosion that occurred because of widespread adaptation of information and communication technologies. This resulted in massive amount of data generation. For instance, the Australian Square Kilometre Array Pathfinder (ASKAP) acquires 7.5 terabytes/second of image data [47]. Likewise in genomics, it is estimated that the data size is doubling every 7 months [52]. Stephens et al. [52] compared the big data phenomenon in three fields namely genomics, astronomy, and online social platforms (YouTube and Twitter). The authors discussed data acquisition, data storage, data distribution, and data analysis aspects of the aforementioned fields. Major findings of the study were that genomics is one of the most demanding big data domains and requires technological development in many fields to meet the computational needs [52]. In the time frame of 2008–2017, Scopus has recorded over 35,000 publications in connection with big data. This motivated us to analyze the literature published in the field of “big data” via citation analysis.

The objective of this work is to investigate the evolution of big data literature using bibliometric analysis. The aim is to identify the most influential work, top venues for publications, citation trends, geographical and institutional trends, as well as authorship trends in literature published in the domain of big data. The study is based on the articles published in the period of 10 years (2008–2017) and covers over 35,000 records. The rest of the article is organized as follows: Section 2 presents a brief review of studies in the field of bibliometrics. Section 3 presents research questions, and the methodology for data extraction. Section 4 presents an in-depth analysis of the data. Section 5 discusses the limitation of the study, and finally, Sect. 6 concludes the work.

2 Literature review

There is a plethora of work dedicated to citation analysis in various fields [17, 19, 24, 27, 34]. We briefly discuss works in bibliometrics in relation to “big data” and then present major scholarly publications in citation analysis.

Nobre and Tavares [48] analyzed the literature related to the application of big data/IoT in the context of circular economy indexed in Scopus for the time frame 2006–2015. The study found that China and USA are the most active countries. Surprisingly, among countries producing large greenhouse gases, Brazil and Russia were not contributing much in terms of number of publications in big data. Kalantari [40] performed bibliometric analysis of 6572 papers indexed in Web of Science from 1980 to March 19, 2015. Using MS Excel, general concentration, dispersion, and movement of the data from the selected pool were analyzed. Liao et al. [43] performed bibliometric analysis of big data literature published in the field of medical big data. The authors used Science Citation Index Expanded and the Social Science Citation Index databases as data sources to extract 988 references. There were no restrictions on the time span. The novelty of the work is the application of multi-regression analysis considering the number of authors, number of pages, and number of references. It was observed that the medical big data literature has seen a rise after 2010. By analysis of the keywords, it was identified that the medical care is shifting its focus toward patient-centered model than disease-centered approach.

One of the earliest works in the field of citation analysis in computer science is that of Culnan [18]. Culnan [18] analyzed and compared the citation patterns of academics and practitioners who published in the proceedings of a national computer science conference. The study identified that both the groups under consideration (academicians and practitioners) cited the same core journals, as well as documents belonging to the same age group. Goodrum et al. [29] analyzed computer science literature present on the web in the form of PDF and postscripts using autonomous citation indexing (ACI). For ACI citeseer (now called citeseerX) was used, and for a comparative analysis, Institute for Scientific Information (ISI) SCISEARCH was used. Using the data, profiles of source documents and citation profiles of two sources are discussed. Wohlin [54] used ISI Web of Science as data source to identify the most influential journal articles in software engineering for the year 1999.

Hoonlor et al. [35] performed an in-depth analysis of citation data in computer science. They inferred that most publications mention the keyword “algorithm,” and most abstracts are related to databases, neural networks, and Internet. The study also identified web as an attractive source of data and application test beds, which resulted in more research in the areas of data mining, cloud computing, and information retrieval. The study also concluded that funding is essentially required to keep research momentum and progress in a specific field.

Chadegani et al. [12] compared Scopus and ISI Web of Science based on a set of research questions. They concluded that ISI Web of Science has strong coverage dating old publications, whereas Scopus covers high-quality journals and more recent articles. Both databases provided customized search ability and are equally favored by scientists and researchers.

Ioannidis et al. [37] surveyed the highly cited scientists in biomedical field to score their publications to answer the question “Is your most cited work your best?”. The scientists were asked to score their publications on six dimensions including publication difficulty, surprise, disruptive innovativeness, greater synthesis, broader interest, and continuous progress. On average, a low average score was observed for publication difficulty, surprise, and disruptive innovativeness. The authors concluded that beside citations-based metrics, other measures must also be used for evaluation of scientific works. Ding et al. [20] criticized the traditional way of citation analysis and proposed to use content-based citation analysis (CCA) to use the value of a citation at syntactic and semantic levels.

Garousi and Mäntylä [24] performed a comprehensive bibliometric assessment by considering over 70,000 articles published in the field of software engineering from Scopus. The authors observed a considerable growth in the number of publications per year. However, approximately 45% of the papers are not cited. Using text mining techniques, web services, mobile and cloud computing, industrial (case) studies, source code and test generation were identified as the most hot research topics.

Garousi [25] reported the findings of a bibliometric study of Turkish software engineering community based on the research publication in software engineering outlets and indexed in Scopus until 2014. Author identified the top-ranked university, and scholar. The study also identified the contributions made by the Turkish software engineering community to be very low in comparison with rest of the world. Likewise, the study also identified the lack of diversity in topics covered.

Effendy and Yap [22] performed the trend analysis of research areas in computer science using Microsoft Academic Graph dataset. The authors proposed a new metric called FoS score to measure the level of interest in a specific research topic. Using the measure, they discussed citation trends, trends in conferences, evolution of research areas, and the relation between research areas.

3 Research questions and data set

3.1 Research questions

Following the pattern of Garousi and Fernandes [26], we formulate a set of research questions and base our analysis of the data on these key questions. The main objective of choosing the research questions is to identify the top cited papers, the contributions of various countries in advancing big data research, and identification of key research areas. The set of research questions are as following

  1. RQ 1.

    What are the top cited publications in the selected time frame?

  2. RQ 2.

    What are the top cited publications for each year in the selected time frame?

  3. RQ 3.

    What are the key topics/areas that are addressed in publications?

  4. RQ 4.

    What are the top venues for the most cited publications?

  5. RQ 5.

    What is the citation landscape of the publications? What is the average number of citations/publications?

  6. RQ 6.

    Which countries and institutions have contributed most in terms of publications count?

  7. RQ 7.

    What is the authorship trend? What is the average number of authors per publications?

3.2 Data set extraction

A key consideration in any citations-based study is the selection of data source. A number of online databases are available that provide access to the citation data. These sources include ISI Web of Knowledge, Scopus, Google Scholar and dblp. Out of these, ISI Web of Knowledge and Scopus are the two main sources used by the majority of researchers for citation data analysis [17, 24, 25, 33, 34]. Other citation databases are also used such as dblp by Hoonlor et al. [35]. Our choice for data collection is Scopus because of service availability, ease of use and authenticity of the data [26].

After selection of the source, the next step is the extraction of the required data from the data source. Scopus provides a flexible and customized way for data extraction from its database using various criteria such as search by author name, source name, affiliation, and keywords search. We used \(\texttt {title}\) and \(\texttt {keywords}\) as our main search criterion, i.e., we queried the database to return all documents where the word “Big Data” is located either in the title of the document or in the associated keywords. We also restricted the results to English language articles and articles published between 2008 and 2017 (both inclusive). The query is given in Table 1.

Table 1 Query for Data Extraction

Scopus has indexed 13 documents prior to 2008 satisfying our search criterion. Out of these, only a single documentFootnote 1 has received 19 citations. Likewise, no more than 3 papers/year are indexed by Scopus prior to 2008. Therefore, we chose 2008 as starting year for our data extraction. Thus, the time frame spans a decade of research in the field. Our search criteria have resulted in total of 34,655 documents. Using the graphical user interface provided by Scopus, we exclude some document types namely conference review, editorial notes, and letters. The main motivation behind the omission of these document type resides in the fact that they mostly provide information about the outlet such as scope of the conference/special issue, the number of submissions, and the acceptance rate, and normally have very low citation counts. The omission of these documents reduced the documents number to 33,623. As stated earlier that big data is a transdisciplinary field, we did not limit to literature from computer science only and allowed results from diverse fields such as business, medical science, and social sciences. Note that the data are downloaded on January 27, 2018, and the citation count might slightly differ at later dates. The dataset is stored in comma-separated values (CSV) format, and analysis is performed in the R statistical tool.

4 Results

In this section, we present answers to the research questions designed in Sect. 3.1.

4.1 Identification of top 10 research publications

Table 2 presents the top 10 publications based on the absolute number of citations received. Rather unsurprisingly, most of the top 10 research publications include works that are survey in nature. One of the interesting exceptions is the work on Rank \(\#\) 5 which reports findings of an experimental study of Facebook [1]. The authors investigated if emotional contagion occurs on Facebook by analyzing contents in the newsfeed of users. Reducing positive contents lead to the reduction of positive posts and an increase in negative posts. Same trends were observed when negative expressions were reduced in the news feed. The authors concluded that emotional states are transferable to other via emotional contagion.

Table 2 Top 10 publications by citation count

It can be seen from Table 2 that all the publications (with the exception of one) are at least 4 years old, i.e., they are published in 2014 or earlier. These publications have accumulated more citations than others. Therefore, relying on absolute numbers of citations is age-dependent, and puts the recent publications at a disadvantage in comparison with older publications. To overcome this issue, we calculated the normalized score by dividing the absolute number of citations by the number of years since publications. We believe that the normalized score is a better indicator to gauge the quality of recent publications.

Table 3 Top 10 publications by normalized score

We observe that the top publication both in terms of the absolute number of citations and the normalized score is that of Chen et al. [13], which highlights the importance of the work over the years. The work of Al-Fuqaha et al. [23] jumps from rank 4 to 2 in normalized score, highlighting the coverage of IoT in the recent literature. It can also be seen in Table 3 that there is no significant changes in ranks of publication in comparison with Table 2. The top 9 remains the same with only 2 and 4 swapping places. The only new entry in Table 3 is the work of Hashem et al. [31] which gained 3 ranks.

4.2 Most cited articles for each year

Table 4 identifies top cited papers for each year based on the absolute citation count. The normalized score is also provided. The table portrays the evolution of big data research. In the early years, research mostly involved the potential usage, structure, and applications of big data [6, 36, 39]. In the later years, the top cited papers covered different areas such as development of big data analytics tool [32], predicting personal attributes (such as ethnicity, political views, personalities traits etc) from Facebook likes [41], and the use of deep neural networks in big data [44]. A key anomaly in Table 4 is the absolute and normalized citation score of [6]. The publication has a normalized score of 9.625 which is nearly 4 times less than the next entry.

Table 4 Top cited papers for each year

4.3 Identifying research topics/trends

Unlike other scientific fields that have clear taxonomy and classification of the subject area [8], there is no classification scheme for big data. We use the frequency of keywords as metric to identify sub-areas and topics that gained the attention of researchers over the course of time. Our analysis found that the phrase “Big Data” is the most widely used keyword in our collection of records, which is unsurprising. The other keywords in order of occurrences are Data Mining (4995), Data Handling (4242) Digital Storage (3185), Cloud Computing (2921), Information Management (2795), Artificial Intelligence (2769), Distributed Computer Systems (2515), Learning Systems (2261), and Algorithms (2027). In order to make the comparisons more realistic, we removed the keyword “Big Data” and designed a word cloud using the statistical software R (see Fig. 1).

Fig. 1
figure 1

World cloud of keywords

In the early phase of big data research, the researcher focused mainly on uses and applications of big data research. For example, Jacob [39] discussed the challenges posed by the big data and highlighted possible solutions to overcome the challenges. Cohen et al. [16] discussed that the cost of data acquisition and storage has reduced considerably, and sophisticated data analysis has become a norm. They introduced Magnetic, Agile, Deep (MAD) data analysis practice. The proposed approach, design philosophy, and techniques were used to provide MAD analytics for Fox Audience Network. In addition, some works such as that of Brinkmann et al. [10] designed systems for large-scale data acquisition, processing, and storage. The authors claimed to collect 3 terabytes of data per day by performing continuous electrophysiological recordings of patients undergoing evaluation for epilepsy surgery. The huge amount of data generated posed storage, and processing challenges. Authors designed a platform that facilitated the acquisition, compression, and storage of large amount of data.

After that, the researchers focused on application development for big data analysis as well as applications of big data in various domains. Major works in the area include Herodotou et al.  [32], Chen et al. [14], Murdoch and Detsky [46], Hampton et al. [30], and Kramar et al. [1]. The diverse works focused on the developmental of new system for big data analysis [13, 32], application of big data in health care  [46], and ecological science [30], and studying emotional contagion in a large social networks [1].

In the later years, big data research focused on the integration of big data with emerging technologies such as IoT [15], and social big data [7]. As the volume of the data is increasing exponentially, current research works are focusing on improving the execution times of data analysis techniques [21, 45]. Similarly, the use of GPU to solve complex big data analysis also needs further investigation [5]. Rodríguez-Mazahua et al. [51] identified data cleaning and data privacy as key issues in big data analysis.

4.4 Top venues for publications

We identify the top 10 venues for publications in terms of number of citations. In addition, we also record the number of publication made in each venue. Lecture Notes in Computer Science (LNCS)Footnote 2 received the most number of citations (3064), which published the most number of articles as well (2538). This resulted in 1.20 citations per publication. LNCS is followed by Proceedings of the VLDB Endowment which received 2369 citations for 99 publications. The top 15 venues based on absolute citation counts are given in Table 5. The table also lists the number of publications and citations/publication for each venue. In terms of citations per publication, MIS Quarterly: Management Information Systems ranks at the top. However, a close inspection of the data reveals that the MIS Quarterly: Management Information Systems has received 1111 citations for 5 publications. Out of 1111 citations, 1098 citations are for [13], which means that the rest of 4 publications received only 13 citations. Information Communication and Society ranks 2 based on citations/publication; however, one publication [9] accrued 999 citations, whereas the remaining 13 publications acquired 86 citations. Other top venues based on the citation/publication include Proceedings of the National Academy of Sciences of the United States of America (66.9), Nature (43.3), and IEEE Transactions on Knowledge and Data Engineering (31.78).

Table 5 Top venues by most number of citations

4.5 Analyzing citation trends

The complete set of publications received 1,06,598 citations, which equates to 3.17 citations per publication. A vast majority (55%) has not received any citation. The number is significantly higher than reported in other studies [24, 26]. Garousi and Fernandes [24] reported that 43% of research publications have zero citation count when analyzing the literature of software engineering. 15% of publications have received only a single citation which is in the same range as reported in [26]. Further analysis of the data shows that 94% of publications have received no more than 10 citations, and 0.8% have received over 50 citations. Figure 2 presents an overview of the citation distribution.

Fig. 2
figure 2

Distribution of citation

An important question regarding the citation landscape of a set of publication is related to the applicability of power law [11]. It is often stated that citations of scientific publications follow heavy tail distribution (see [11] and references therein). In citation networks, it means that 80% of the citations are received by 20% of publications. Although the full investigation regarding the fitness of our data to power law is beyond the scope of this work, we will like to highlight that in our collected data, 80% citations are received by the top 12.74% publications (see Fig. 3), i.e., the vast majority of citations are received by relatively fewer number of publications.

Fig. 3
figure 3

Power law in big data citations

4.6 Geographical and institutional contributions

Table 6 presents the yearly contribution by the top 10 countries. Geographically, China has contributed the most number of publications (8901), followed by the USA (8568), and India (2342). It is interesting to note that during the years 2008–2011 China has produced only 4 papers (zero publication prior to 2011), whereas USA has produced 36 publications. During the same period, the other top-10 countries (excluding China) contributed 23 publications, i.e., the cumulative sum of 9 countries (27) is less than that of USA (36). These numbers establish the fact that the USA played a pioneering role in big data research and is later joined by other countries. It is also observed that during the period 2008–2011 Japan (1 paper in each year) and South Korea (1 paper in 2010) are the only Asian countries to publish in the domain. It can also be seen that China leapfrogged USA in the last two years only.

Table 6 Yearly contributions from top 10 countries

In terms of author’s affiliation Chinese Academy of Sciences, China has produced the maximum number of papers (698), followed by Tsinghua University, China (422), and Ministry of Education China (279). The top 10 institutes belong to China except CNRS Centre National de la Recherche Scientifique, France which occupies 9th position. The first entry from the USA is Carnegie Mellon University at position 15, followed by Massachusetts Institute of Technology at 16. Table 7 presents a summarized view of top-30 institutions in terms of the number of publications. It is important to note that a single publication may include authors from more than one country, and more than one affiliations (for a publication with all authors from the same country).

Table 7 Top 30 Institutions Statistics

4.7 Authorship trends

For the complete time frame (2008–2017), the average number of authors per publication is 3.45 with a standard deviation of 2.34. In total, \(13.67\%\) of articles are published with a single authorship, and \(41\%\) of articles have more than the average number of authors, i.e., \(41\%\) of publications have at least 4 authors. The surprising aspect of the findings is the share of publications with more than 10 authors. We found that \(1\%\) of the papers have more than 10 authors. Figure 4 summarizes the number of authors per publication distribution.

Fig. 4
figure 4

Distribution of number of authors per publication

We also analyzed the evolution of number of authors per research publications over the selected time frame. We observe that during the initial years (2008–2012) \(27\%\) of publications have single authorship. During the later years, the single authorship trend declines. In 2016, only \(11.6\%\) of papers have single authorship, the minimum for the selected time frame. Figure 5 depicts the authorship trends for the time frame. It should be noted that as the number of publications are significantly lower for initial years, the data for years 2008–2012 are combined for reporting purposes.

Fig. 5
figure 5

Trend of authors per publication

5 Limitations of the study

It is important to present the limitation of the study, as it might be possible that the results are not obtained if the same set of experiments is repeated again. We downloaded the data from Scopus on January 27, 2018. As Scopus has data download limits, the data were downloaded in an incremental manner and later combined. It is possible that the reader may find the number of citations for various articles different than the ones reported here. There can be several reasons for this. For example, the article might have accumulated more citations over time. It is also possible that various sources might report different statistics for the same article. A noticeable example is the case of publication“Data mining with big data.” The paper according to Scopus has received 681 citations, whereas Google Scholar has recorded 1320 citations for the publication till 2017. The publisher IEEE Transactions on Knowledge and Data Engineering recorded 524 citations for the same publication. The difference between the citation count of Scopus and other sources can be attributed to the fact that Scopus citations are based on Scopus-indexed publications only. It is also important to mention that Google Scholar citations also include non-academic citations. As stated earlier that a variety of past studies has validated the authenticity of Scopus, therefore, our results are also based on the statistics of Scopus only.

6 Conclusion

The study systematically analyzed the citations of publications in Big Data to answer a variety of research questions. Although our data start from 2008, we observe, rather surprisingly, that none of the top-10 publications are published between 2008 and 2011. As expected, the highest cited publications include survey papers in the field. We also found that over 50% of publications are not cited, and 80% of citations are received by 12.74% of publications. Geographically, China has the most number of publications, followed by USA. However, in the initial years, China has no significant contribution. This confirms the leadership role played by USA in the field. We also observed that average number of citation per publication to be 3.17, whereas the average number of authors per publication is 3.45.

The current study is based on citation count as metric and treats every citation equally. In other words, it does not differentiate between a citation included for the sake of completeness of work, and the one which forms the foundation of a research work. It will be interesting to perform content-based citation analysis of the field and identify important works based on the context as well. Another research direction will be to analyze the citation counts differences between various sources, and compare the ranking.