1 Introduction

Due to the sheer size and availability of multidimensional data, the rate of technological innovation has bought huge potential to make an extra ordinary impact on our daily life in different disciplines especially in healthcare sector. The rapidly growing and exploited data will refer to introduce a new gigantic term known as big data. Uncovering information from such complicated nature of data is often a complex process. The development and analysis of tools and methods for analysis of such large quantities of data provide us with an opportunity to make the transition into new era far easier. Having data-driven, real-time insights accessible to the organization through analytics can be a critical enabler for executing the organization strategies. Big data analytic’s greatest asset is its possibilities and its need to find new ways to provide the services that we are looking for.

Unlike other fields, big data analytics is so promising in healthcare sector and received much more attention in the last few years. Clinician decisions are becoming evidence-based, meaning that they are relying more on large swathes of research and clinical data as opposed to solely their schooling and professional opinion. Big data in terms of healthcare is defined as the name given to larger and complex electronic healthcare datasets that are problematic or almost impossible to manage by employing common traditional methods, tools or software [1,2,3]. Big data in healthcare are generated by healthcare record (such as patients record, disease surveillance, hospital, medicine, health management, doctor, clinical decision support or feedback of patient [4,5,6,7]) and clinical data (like imaging, personal, financial record, genetic and pharmaceutical data, Electronic Medical Records (EMR), etc. [8]). The generation and management of these enormous healthcare record are considered to be very complex; thus, big data analytics is introduced [9, 10]. With the rise of technological innovation and personalised medicine, big data analytics has the potential to make a huge impact on our life, i.e., how it helps to predict, prevent, manage, treat and cure disease. Furthermore, it helps government agencies, policy maker, and hospital to manage resources, improving medical research, planning preventative methods, and managing epidemic.

With the advancement in information and communication technology system, hard copy medical data tend to move towards Electronic Health Records (EHR) and Electronic Medical Records (EMR) systems. These systems generated exponential growth of data [11, 12]. Healthcare data are not only collected from clinical record, tele-monitoring or medical tests but also there are a larger number of healthcare apps. These apps have tremendous amount of subscriptions. According to the Ericsson Mobility Report of 2019, there were a total of 7.9 billion mobile subscriptions, with 49 million new subscriptions added during the quarter as the growth of people on this planet subscribe new and valuable data about health and well-being everyday [13]. These apps contain voluminous data due to the world of social media. There are more than 4 billions people [14] who use internet for the purpose of mailing, downloading, surfing, blogging, entertainment, etc. These amount of data also tend to move towards the concept of big data. Figure 1 depicts the ecosystem of healthcare assisted by big data and cloud computing approaches.

Fig. 1
figure 1

Healthcare ecosystem assisted by big data and cloud computing [15]

Moving towards the five characteristics of big data in healthcare sector, Volume refers to the medical record of personal data, clinical data, radiology images, genetics and population information, resource intensive applications like 3D imaging genomics and biological sequences. Likewise, rapid increase in diseases and medications produces exponential growth of data that are to be stored, manipulate and managed. For the effective capturing, management and manipulation of data modern techniques like advances in data management, cloud computing, visualization, etc. play a vibrant role for healthcare systems. Volume is rapidly increasing in bio-medical informatics like Proteomics DB [16] which contains data volume of 5.17 TB covering 92% of human genes information explained in Swiss-Prot database. Vast amount of volume is produced from medical images like Visible Human Project which comprehends female datasets of 39 GB [17]. It is estimated that volume of big data in healthcare increased to 35 zeta-bytes by 2020 [18, 19].

Variety in healthcare divulges that there is a gigantic amount of healthcare record either in structured, unstructured, or semi-structured format. There is a variety of unstructured healthcare record generated daily like patient information, doctor notes, prescriptions, clinical or official medical records, images of MRI, CT, radio films, etc. Furthermore, structured and semi-structured variety regarding EMS and EHS comprises actuarial data, electronic apps and automated databases information like physician name, hospital name, treatment reimbursement codes, patient name, address, etc., information of electronic billings and accounting, and some of the clinical and laboratory instrument reading observations. For the conversion of unstructured data into structured datasets, data analytics provides different facilities; one of them is natural language processing in health fidelity.

Another important characteristic is velocity that can be at rest or motion pace. At rest velocity, healthcare record encompasses doctor or nurse notes, scripts, documentary files, renders record, X-ray films, etc. Moreover, medium-velocity healthcare data include blood pressure readings, measurement of daily diabetic glucose by insulin pumps, ECG/EKG, etc. However, sometimes high velocity is required, as it become a staple of life or death. This type of data embroils on real-time data like monitoring of inside heart, anesthesia and trauma for blood pressure, room operations, detecting infections or diseases like cancer, etc. at early stage.

Value describes how much data are beneficial for healthcare ecosystem. For example, raw data like paper prescriptions, official record or patient information are less valuable than diagnostics record, medicines and laboratory instruments reading record. Veracity tells the reliability or understandability of healthcare record that explains the capturing of diagnosis, procedures, treatments, etc. and to verifying the information of patient, hospital, reimbursement code, etc. Different domains of healthcare and medical care were proposed in the literature. This review paper discusses five sub-disciplines (i.e., medical image processing and imaging informatics, bioinformatics, clinical informatics, public health informatics, medical signal analytics) that directly or indirectly involve in healthcare and bio-medical.

In Sect. 2, we present the theoretical information of big data and data analytics. Different architectures of big data analytics deployed in the domain of healthcare are explained in Sect. 3. We also present the advantages of big data to healthcare in Sect. 4 that give the insights how healthcare can be improved by big data analytics. Section 5 presents the review methodology to give the insight about the criteria of paper selection. Based on the review methodology, the big data in five sub-disciplines of healthcare (i.e., medical image processing and imaging informatics, bioinformatics, clinical informatics, public health informatics, medical signal analytics) are comprehensively explained in Sect. 6. We also summarize our main findings in Sect. 7. Then, Sect. 8 presents the notable applications of healthcare analytics based on the main findings. Section 9 discusses the challenges and open research issues. Finally, the Sect. 10 draws conclusion of this paper.

2 Background of big data and data analytics

The concept of big data was introduced in 1990s by Cox and Ellsworth [20], when they considered visualization as a big data problem. The significant academic references of big data in computer science was first discovered by Weiss and Indurkhya [21]. In 2000, Diebold [22] introduced big data in statistics/econometrics when they referred to exploited quality information. The concept was enriched by Douglas Laney at Gartner in an unpublished 2001 research [23]. In short, the term Big data is attributed to Weiss and Indurkhya, Diebold, and Laney. Big data is the name given to the larger and enormous datasets that are usually complex so that traditional information processing techniques are not enough to deal with them. Mostly, the difficulties or challenges regarding big data are how to capture, store, share and analyze data, how to visualize, update or query information privacy. From the view of Radar [24], big data deal with the huge amount of data that are not fit into the conventional databases; thus, an alternative way is chosen to extract and process the data from it. According to ZDNetFootnote 1, big data involve techniques and procedures for the creation, formation, manipulation, and organization of larger datasets and facilities offering for its storage. TechopediaFootnote 2 suggests that unstructured large complex data are processed by massive parallelism on readily available hardware because relational database engines are unable to process those data. In short, literature divulges that big data are larger data sets, enormous growth of data, massive data, unstructured or complex data [25,26,27,28,29].

Basically, main characteristics of big data are complexity and massive size [30,31,32]. However, big data are deliberated by three characteristics known as 3Vs—volume, variety and velocity [33,34,35]. Two additional characteristics are extended to make 5Vs properties of big data as depicted in Fig. 2. These additional characteristics are—value and veracity [26, 36, 37]. Volume leads to the size or quantity of stored and generated data. When the volume of data is large, they become big data [15, 38]. Variety is the type or nature of data when grouped from several sources. Data are varied in terms of format like CSV, text or Excel format in which data are stored in a database. Likewise, various forms of data also vary such as video, audio, SMS or PDF data [15]. This variety is also one of the decisive characteristics of big data. Velocity specifies the speed of data at which it is generated or processed. Value describes how much data are beneficial or valuable. The big data and the value are strongly co-related as storage of raw data is useless and inoperable. Huge data are valuable due to the costs and benefits while collecting and evaluating data [15]. The term veracity is the quality of data understandability. In other words, reliability, quality and accuracy of big data depend on the veracity property because it prevents low-quality data.

In the early stages of big data, the framework was defined using three Vs: Volume, Velocity, and Variety [27, 39]. Later, this framework was expanded to include two more dimensions: Value and Veracity [40, 41]. These 5Vs (Volume, Velocity, Variety, Value, and Veracity) are frequently referred to as the 5Vs of Big Data. The 5Vs framework has been a useful approach to addressing the issues and challenges of big data. Thus big data is defined as a holistic approach to manage, process, and analyze 5Vs (i.e., volume, variety, velocity, veracity and value) to create actionable insights for sustained value delivery, measuring performance and establishing competitive advantages.

Fig. 2
figure 2

5Vs characteristics of big data

Data analytics is the amalgamation of two words where data refer to raw facts, figures and information, and analytics means use of several tools to analyze data although data are small or big. Analytics is an umbrella term for all data analysis applications [33]. The big data analytics is the process of analyzing large voluminous data using different strategies. As aforementioned, big data are integrated from multiple sources; thus, big data analytics is used to explore how to extract valuable and hidden patterns and connections from this integrated data. In other words, big data analytics is simply analysis of data with the intention of extracting information and supporting conclusion-making from the inclusive procedure of scrutinizing, modeling, cleansing, and transforming of Big data.

Data analytics can be analyzed by three general methods: descriptive, predictive and prescriptive analytics. Descriptive analytics is used to summarize the big data. The simplest way to define descriptive analytics is that, it answers the question “What has happened?”. Descriptive analytics was found to have more applications in analyzing what implications the different health care decisions had on the service delivery systems and clinical outcomes.The focus of descriptive analytics in healthcare organizations is to collect the patient’s operational data with respect to the health organization and to arrive at patterns of patient’s care leading to evidence-based clinical practice, identifying unnoticed trends in patients, imbalances between the cost, capacity and patients’ needs. Predictive analytics is use to predict the future analysis by deploying a diversity of machine learning, statistical, modeling and data mining techniques to study latest recent and historical data. Prescriptive analytics is basically the predictive analytics that is used to take action and make the business decision. It is used by the health organizations when a selection is to be made from the available, feasible alternative solutions.

Most extensively used approaches for predictive and descriptive analytics on big data are based on either supervised, unsupervised, or semisupervised learning. An exponential time increase in data has made it difficult to extract valuable information from this data. Despite the strong performance of traditional methods, their predictive power is limited as traditional analysis only deals with primary analysis whereas data analytics deals with secondary analysis. Data mining involves the digging or mining of data from many dimensions or perspectives through data analysis tools to find previously unknown patterns and associations from data that may be used as valid information [42,43,44,45]. Moreover, it makes use of this extracted information to build predictive models. It has been deployed intensively and extensively by many organizations, especially in the healthcare sector.

Data mining is not a magical wand but in fact a big powerful tool that does not discover solutions without guidance. Data mining is convenient for the succeeding purposes:

  • Exploratory data analysis to examine the data corpus to summarize their main characteristics.

  • Descriptive modeling to segregating the data into clusters based on their properties.

  • Predictive modeling to forecasting information from existing data.

  • Discovering pattern to find patterns that occur frequently.

  • Content retrieval to discover hidden patterns.

Several techniques are deployed for reduction, optimization, regression analysis, etc. of big data. On account of the voluminous amount of big data, its dimensionality is reduced by linear mapping approaches like Principal Component Analysis (PCA) [46], and Singular Value Decomposition (SVD) [47]. Some non-linear mapping methods for dimensonality reduction are Kernel Principal Component Analysis (KPCA) [48], Sammon’s mapping [49, 50], Laplacian eigenmaps [51].

Mathematical optimization is another analytics tool that involves multi-objective and multi-modal optimization approaches like pareto optimization [52, 53], evolutionary algorithms [54, 55]. Extracting meaningful information and cluster development and analysis is achieved by various clustering algorithms like Clustering LARge Applications (CLARA) [56], Balanced Iterative Reducing using Cluster Hierarchies (BIRCH) [57], etc.

3 Architectures for big data analytics

Our anticipated general framework of big data analytics for healthcare is an abstraction of several conceptual steps that describe the generic functionalities of the domain. The first step in the framework is data collection, in which health and the clinical data are collected from internal or external sources. Variety of data includes Electronic Healthcare Records (EHRs), clinical images, health monitoring devices logs, etc. After the collection of data, next step is Data processing in which healthcare data are stored, extract and load in the data ware houses, middle-ware or in traditional formats like CSV, tables, etc. Data transformation is the next step in which data are transformed, aggregated and loaded in database file systems like Hadoop cloud or in a Hadoop distributed file systems (HDFS). Analytical phase is used to examine the big data using big data tools and platforms like Hadoop, Mapreduce, Hive, Hbase, Jaql, Avro and several others. Finally, the output is generated in the form of reports and queries using data mining and OLAP tools. The self-explanatory general and conceptual architecture is elaborated in Fig. 3 along with Fig. 4.

Fig. 3
figure 3

Conceptual journey of data to information in big data analytics environment

Fig. 4
figure 4

Architecture of big data analytics platform

Based on the domain abstraction and identification, there are several definitions of big data architectures proposed and developed by researchers for big data analytics. Some of the important architectures are Hadoop, MapReduce [58], Streaming graph [59], Fault tolerant graph, etc. We present some of the renowned architectures along with its core component comprehensively in detail. One of the major frameworks on Apache platform is Hadoop developed by Doug Cutting and Apache Lucene. It is a collection of open-source software utilities used for distributed computation, processing, and storage of large data sets or big data. Succeeding Figs. 5 and  6 depict the core components and basic framework of Hadoop.

Fig. 5
figure 5

Core components of Hadoop

Fig. 6
figure 6

Framework of Hadoop

3.1 Hadoop distributed file system (HDFS)

HDFS [60] is the master–slave architecture intended to run on the commodity hardware. It provides great throughput access to application data. It allows the underlying storage for the Hadoop cluster and enhances healthcare data analytics system by segregating huge expanse of data into smaller one and disseminated it across various servers/nodes. The architecture of HDFS is divided into Name-node and Data-node where Name-node is master and Data-node is slave. Documents are stored in the data node having size of 64M that cannot be changed. Following Fig. 7 illustrates the architecture of HDFS.

Fig. 7
figure 7

Architecture of HDFS

According to Fig. 7, client is a HDFS user. Name-node is responsible to manage the name space in the file system. It stores and maintains the files and folders into a file system tree. Data-node is the place where the real data are saved and handled.

3.2 MapReduce

Mapreduce is another cornerstone of Apache Hadoop that is developed in 2004 when Google published a thesis [58]. MapReduce is a standard functional programming model that processes and analyzes the big data . It breaks task into sub-tasks, gathering its output and analyzes efficiently large datasets in parallel mode. Data analysis and processing employed two steps, namely: Map phase and Reduce phase.

The architecture of MapReduce operation is split into three main components: Client, Job-Tracker, and Task-Tracker. Client submit its job to the Job-Tracker in the form of JAR file. Job-Tracker maintains all the jobs that are executed on the MapReduce and, thus, acts as master service. Task-Tracker executes the jobs that are assigned by Job-Tracker and, thus, acts as slave service. Figure 8 demonstrates the generic architecture of MapReduce operation.

Fig. 8
figure 8

MapReduce architecture

3.3 Apache hive

Apache Hive [61] is a Structured Query Language (SQL) based on Extract Transform Load (ETL) and dataware house on Hadoop plateform. It is a runtime Hadoop provision framework that works on Hive Query Language (HQL) that converts SQL queries into MapReduce jobs. The main operations performed by Hive are data encapsulation, analyzing, ad hoc querying and summarizing large datasets. Apache Hive have four major components: hive clients, services, processing framework and distributed storage. Hive client like thrift clients, JDBC clients, ODBC clients, etc. can be written in any supportive language like C++, Java, Python, etc. Services are used to perform queries. Services of Hive may include command line interface (CLI), web interface (WI), hive server, driver, meta-store, etc. Queries are processed, executed, and managed using internal Hadoop MapReduce framework. Finally, the distributed data are deposited in HDFS. The core components are illustrated in Fig. 9.

Fig. 9
figure 9

Hive architecture

3.4 Apache HBase

Apache HBase works on non-SQL and non-relational approach. It is a database management approach using column-oriented structure lies on the top of HDFS. It used the key/value data that perform read/write operations on large HDFS database. Apache Hbase is categorized into three main components: HMaster Server, HBase Region Server, and Zookeeper. HMaster server is the main component that manages and monitors HBase region servers, and performs database operations using DDL to create, update, and delete tables. Hbase tables are divided into several regions that manage, handle, and execute operations through Hbase region servers. Hbase is a distributed system that is coordinated by Zookeeper. The components of Apache HBase are depicted in Fig. 10.

Fig. 10
figure 10

Hbase architecture

3.5 Presto

Presto [62] is a distributed structured query language engine that is used to analyze large amount of data ranging in size from gigabytes to petabytes. The architecture of Presto is composed of coordinators and workers. User queries are submitted to the co-ordinator that is accountable for planning, executing, scheduling, and parsing the queries of Workers. The architecture is explained from the succeeding Fig. 11.

Fig. 11
figure 11

Presto architecture

3.6 Mahout

Mahout [63] is an apache scheme that is used to produce unrestricted applications of disseminated and accessible machine learning algorithms that support healthcare data analytics on Hadoop systems. It is designed to support big data analytics that provide free application on Hadoop platform like applications of distributed and accessible machine learning algorithms.

3.7 Avro

Avro [64] assists serialization and data encoding that advances structure of data by identifying data types, meaning, and scheme. It has the functionalities of serialization and versioning control features. Avro configuration is illustrated from the Fig. 12.

Fig. 12
figure 12

Avro architecture

4 Advantages of big data to healthcare

How big data analytics can improve healthcare? Simple answer to this question is: Analyzing big data can aid healthcare stakeholders to deliver efficient procedures and insights into the patients and their health. Numerous benefits can be obtained with big data analytics. Main source of healthcare data are: Electronic Health Records (EHR), Laboratory Information Management system (LIMS), Pharmacy, Monitoring and diagnostic instruments (MDI), Finance (Insurance claim and billing) and hospital resources. With the advancement of data acquisition devices and analytics techniques, data source are getting enriched with newer forms of data, i.e., hospitals start to collect genetic information in EHR as well. Within this vast variety of patient data lie valuable insights for both patients and organizations, which when applied judiciously can bring in wonderful results. Potential benefits include advanced patient care:

Quality of care EHR helps in assembling demographic and medical data such as clinical data, lab test, diagnoses, and medical conditions. Discovering associations and patterns within this data helps healthcare practitioners to provide quality care, save lives and lower costs.

Disease prevention Spending more on health does not guarantee health system efficiency. The investment in prevention can help to reduce the cost as well as improve health quality and efficiency. Health systems face considerable challenges in endorsing and protecting health at a time when the burden on finances and resources is substantial in many countries. The early detection and prevention of disease plays a very important role in reducing deaths as well as healthcare costs. Thus, the core question are: How can we diminish the level of ill health in the population? And how can we prevent the disease to occur based on early symptoms of patient?

Efficiency Managing healthcare data using traditional analytical tools is nearly impossible due to the diversity and volume of data. Healthcare stakeholders use big data as a part of their business intelligence strategy to examine historical patient admission rates and to analyze staff efficiency.

Disease cureness Healthcare practices have largely been reactive where the patient has to wait until the onset of disease after which treatment is prescribed which hopefully leads to a cure. However, no two persons in the world would have the same genetic sequence. Furthermore, environmental factors associated with the onset of the disease are not known, which is the motive why particular medication seems to work for few people but not for others. Since there are millions of things to be considered in a single genome, it is almost impossible to study them comprehensively. On the other hand, big data in healthcare have been revolutionizing the expanse of genomics medicine. Big data analytics can extract hidden patterns, unknown correlations, and insights by exploring large datasets. Scientists are banking on big data to discover the cure for cancer.

Cost Healthcare cost can be cut down by analyzing big data i.e., predictive analytics can help to detect disease at early stage. Moreover, big data also reduce medication errors by advancing economic, administrative performance, and re-admissions. For example, patient groups affected by a disease and are treated with different drug regimens can be compared to determine which treatment plans work best for the same or similar disease which result in saving resources and money.

Finding diseases cure A particular medication seems to work for a few people but not for others, and there are numerous things to be discovered in a single genome. It is not feasible to observe all of them in element. However, big statistics can help in uncovering unknown correlations, hidden styles, and insights by analyzing large sets of statistics. By applying machine learning in big data, practitioners find the big facts to have a look at human genomes and find the correct remedy or drugs to deal with cancer [42, 65].

5 Review methodology

The review methodology is the systematic process of finding the relevant literature from different sources. The main objectives of review methodology are:

  • To deploy the definitions and concepts of Big data in healthcare.

  • To explore the five sub-disciplines (i.e., medical image processing and imaging informatics, bioinformatics, clinical informatics, public health informatics, medical signal analytics) that directly or indirectly involve in healthcare and bio-medical.

  • To illustrate the repositories and complex datasets of five sub-disciplines.

  • To determine the big data analytical architectures and techniques in healthcare.

  • To discuss the potential advantages and applications of big data in healthcare.

  • To present the open challenges and research issues of big data in healthcare and the strategies tackling the challenges facing in the domain.

The main steps of review methodology are information sources, selection criteria, and search and selection procedure. Information Sources: The first step in the systematic process of research methodology is to collect the relevant articles. To search the relevant articles, we used Google Scholar. We scanned the references to present a thorough review. Selection Criteria: In second step, we selected the literature on the basis of following inclusion–exclusion criteria:

  • Studies were based on articles, conferences, and reviews

  • Studies written in English language

  • Studies related to the big data analytics in healthcare

  • Studies published from 2000 to 2019

Search and selection procedure: In the third step, we searched the studies from the information sources containing the keywords of “big data”, “big data analytics”, “healthcare”, “biomedical” and “healthcare analytics” to provide the background information, advantages, and architectures of big data analytics. As mentioned earlier, our goal is to expand the research in healthcare using five sub-disciplines, we used the additional keywords: “medical” , “medical image processing” “imaging informatics” ,“bioinformatics”, “clinical informatics”, “public health informatics, “medical signal analytics”. On the basis of initial search criteria, 47,130 papers were found; thus, we scrutinized the title, keywords and abstract and excluded 28,280 papers. We also performed the screening on the basis of full-text reading and excluded 18,020 papers that are irrelevant to the big data or healthcare domain. We ended with 830 papers that are included in this review paper.

The abstract symbols are used to present schematic process of review methodology in Fig. 13.

Fig. 13
figure 13

Schematic process of review methodology

6 Key application in healthcare

Health professionals, just like business entrepreneurs, are capable of collecting massive amounts of data and look for best strategies to use these numbers to reduce costs of treatment, predict outbreaks of epidemics, avoid preventable diseases and improve the quality of life in general.

Different domains of healthcare and medical care had been proposed in the literature. The general overview, analysis, and examples of big data in healthcare analytics were presented in the studies of Raghupathi [2] and Ward et al. [10]. The meaning of big data in healthcare was presented in the literature reviews of Baro et al. [3] and Wamba et al. [66]. In 2017, Zhang and Li [67] presented the literature review of specialized healthcare and HIV self-management. Jacofsky [68] discussed the pitfalls of analytics related to the physicians from metadata sets in healthcare. Another case study of healthcare analytics was presented in 2018 by Wang et al. [69] that presented IT-enabled procedures, advantages, and capabilities of big data analytics. Galetsi and Katsaliaki [70] reviewed the articles of big data analytical techniques for healthcare from 2000 to 2016.

In this review, we discuss five sub-disciplines (i.e., medical image processing and imaging informatics, bioinformatics, clinical informatics, public health informatics, medical signal analytics) that directly or indirectly involve in healthcare and bio-medical. As mentioned earlier, we cover the literature from 2000 to 2019 that provides the comprehensive evaluation of big data techniques in healthcare domains. The literature review of five sub-disciplines of healthcare is explained comprehensively in the following subsections.

6.1 Medical image processing and imaging informatics

Medical image processing and imaging informatics are the main applications that play a vital role in healthcare and bio-medical. One of the acceptable uses of medical imaging is to detect diseases like tumors detection of brain and lungs [71, 72], artery stenosis detection, organ delineation detection [73], aneurysm detection and the diagnosis of spinal deformity and so on. Image processing and machine learning techniques were deployed in these applications for the accurate and effective use of computer-aided medical diagnostics and decision-making. In complex healthcare and bio-medical, information is generated, managed, analyzed, exchanged, and represented imaging information using imaging informatics [75,76,76].

After the brief introduction, we will elaborate the related work of medical imaging and informatics, techniques and applications deployed in big data healthcare.

Medical imaging is used in image acquisition. Magnetic Resonance Imaging (MRI), Computed Tomography (CT), photo-acoustic and ultrasound images are used for single-dimensional medical data like visualizing the structure of blood vessels [73, 75, 77]. However, for multidimensional medical data like 3d ultrasound, functional MRI (fMRI), Positron-emission tomography (PET), etc. are used as shown in Fig. 14.Footnote 3 There are publicly available medical images repositories that contain medical images of patients in different sizes and modalities depicted in the Table 1.

Fig. 14
figure 14

Popular image modalities in healthcare Like CT, MRI, PET images

Table 1 Medical image modalities

Shackelford [78] used fMRI images and single-nucleotide polymorphism (SNP) for the classification of schizophrenia and healthy subjects. They retrieved 87% classification using hybrid machine learning method. Chen et al. [79] introduced a computer-aided decision support system for the treatment of patients with traumatic brain injury (TBI). They predicted the intracranial pressure (ICP) level from CT scans images. They combined CT scans images for features extraction, medical records and patient’s demographics. They achieved 70.3% accuracy, 65.2% sensitivity and 73.7% specificity correspondingly.

Yao et al. [80] introduced a system for retrieval of medical images based on Hadoop. They applied the local binary pattern algorithm and Brushlet transform for feature extraction of medical images. They implemented MapReduce for storing features in HDFS. They reported highest precision rate of 95.04% and recall of 92.21% on brain CT images. They concluded that retrieval efficiency of medical images were improved but retrieval time decreased.

Jai-Andaloussi et al. [81] employed the MapReduce for computation and HDFS for storage in content-based image retrieval systems. They used mammography image database and applied Bi-dimensional Empirical Mode Decomposition with Generalized Gaussian Density functions (BEMD-GGD) method and Bi-dimensional Empirical Mode Decomposition with Huang-Hilbert Transform (BEMD-HHT) method. They used Kernal Linear Discriminant (KLD) and euclidean distance. They produced promising results to prove the hypothesis that MapReduce technique can be effectively employed for content-based medical image retrieval.

Dilsizian and Siegel [82] worked on cardiac imaging and medical data by integrating several techniques like data mining, AI, and parallel computing. Their system used AI and big data for the diagnostic imaging of 55 participating sites from the group of formation of optimal cardiovascular utilization strategies. The system result decreased from 10 to 5% in such case.

Istephan et al. [83] conducted a feasibility study in the epilepsy domain. They used the distributing computation of hadoop clusters. Their framework deals with the structured and unstructured medical data.

6.2 Bioinformatics

Bioinformatics is a discipline of sciences which deals with mathematical, computerized and IT-based methods, techniques, algorithms and software tool for capturing, storing, analyzing, compiling, simulating and modeling information of life science and biological data. Role of big data in bioinformatics is to provide efficient data manipulation tools for investigation to analyze biological information of patient. Hadoop and MapReduce are currently used extensively for bioinformatics analytics.

Basically, bioinformatics is the combination of biology and computer science [84]. The biological analysis system analyzes variations at the molecular level. The bioinformatics consists of a variety of data types like genomics (genes sequencing), RNA, DNA, proteomics (protein sequencing), gene onthology, protein–protein interaction, pathway data, association network of the disease gene and a network of human disease as shown in Fig. 15. With the current trends in personalized care, there is an increasing demand to analyze massive size of personalized patient data in a manageable time frame.

Fig. 15
figure 15

Bioinformatics types

The size of bioinformatics’data is increasing exponentially day by day. For example, a single human’s sequence of the genome is almost up to 200 GB [85]. A database produced by European Bio-informatics Institution (EBI) has getting double volume after each year [86]. Genomics or Genome sequencing data are currently being annotated as big data of bioinformatics problem because human genomics consists of 30,000–35,000 genes [87, 88]. Genomics data are usually the data related to gene sequencing, DNA sequencing, genotyping, gene expression, etc. [89, 90] Gene is made of DNA comprising 3 billion pairs of four building blocks or bases known as Adenine, Thymine, Cytosine and Guanine. The single genome has the size of about 3 GB. Genome analysis employing micro-arrays has been profitable in examining traits across a population and widely contributed in treatments of several complicated diseases like bipolar disease, hypertension, rheumatoid arthritis, diabetes, muscular degeneration, coronary heart disease, Crohn’s disease, etc. [91]. This genomics information tends to move towards big data analytics.

Table 2 Bio-informatics databases

In bioinformatics, protein sequencing and protein–protein interaction are sophisticated problems in functional genomics. This is due to huge number of enormous features in feature vector that is not only cost effective and complex analysis, but also reduces accuracy. Thus, feature selection of big data problem was overcome by the method proposed by Bagyamathi et al. [103]. They combined improved harmony search algorithm to improve the accuracy and feature selection. Likewise, another feature selection methodology was introduced by Barbu et al. [104]. They reduced the dimensionality of an instance using annealing technique for big data learning. Similarly, adaptiveness or behavior of big data is predicted by incremental learning approach. For this purpose, Zeng et al. [105] implemented incremental feature selection method called FRSA-IFS-HIS. They applied fuzzy rough set theory on hybrid information systems and reported better performance in big data feature selection.

Once the features were extracted and selected, next step is classification or clustering. Classification is the supervised learning procedure of finding a model that describes and discriminates data classes or concepts. The model is used to predict the class label of test instances from already trained instances. Among numerous models described in the literature, linear and non-linear density-based classifiers, neural networks, decision trees, Support Vector Machines (SVMs), Naive Bayes, and K-nearest Neighbour (KNN) are the most often used methods in numerous applications [107,108,109,109]. In big data analytic, advanced models had been reported in the literature like neural networks approaches, divide-and-conquer SVM [110], Multi-hyper-plane Machine (MM) classification model [111], etc. for big data parallel and distributed learning.

Giveki et al. [112] diagnosed automatic detection of diabetics using weighted SVM on mutual information and modified cuckoosearch. They conducted experiment on diabetics datasets by selecting features from PCA. Haller et al. [113] classified parkinson patients by employing SVM. They performed pre-processing using DTI fractional anisotropy data and select most discriminated voxels as features and then classified using SVM. Son et al. [114] predicted the heart failure patients by deploying SVM. Likewise, Bhatia et. al. [115] classified heart disease by SVM. They selected optimal feature subset using integer-coded genetic algorithm.

The big data classification and regression is effectively performed using advanced decision tree. In bioinformatics, Ye et. al. [116] implemented Gradient Boosted Decision Trees (GBDT) techniques to distribute and parallelize big data. Calaway et al. [117] estimated efficiency of decision tree on big data by employing rxDTree. Hall et al. [118] modified decision tree learning by generating rules for large training dataset.

Clustering is the unsupervised learning that analyzes data objects without labeled responses. To handle big data, CLARA [56], CLARANS [119] DBSCAN [120], DENCLUE [121], and CURE [122], k-mode and k-prototype methods [123], PDBCSCAN [124], and IGDCA [125] methods were used in the literature. Literature divulges several bioinformatics repositories [126] explained in the Table 2.

Along that there were several techniques and tools employed in bioinformatics for specific task. One of the bioinformatics type is microarray data analysis. Tools used for this type were caCORRECT [127] and omniBiomarker [128]. For gene–gene network analysis, FastGCN [129], UCLA Gene Expression, Tool (UGET) [130], WGCNA [131] tools were used for specific tasks like finding disease associated with genes, parallelism with GPU, etc. Several tools had been proposed for Protein–Protein interaction (PPI) that is a complex and time consuming process. NeMo [132], MCODE [133], and ClusterONE [134], PathBLAST [135] had been developed for PPI analysis. For pathway analysis, GO-Elite [136], PathVisio [137], directPA [138], pathway processor [139], Pathway-PDT [140], and pathview [141] tools had been employed.

In protein–protein interaction and protein sequence, sequencing data are mapped with the specific genomes for the analysis of various tasks like genotype and expression variation. One of the major task of genomes is DNA sequencing that is produced from millions of sequencing machines data. There were several techniques for matching DNA sequence with reference gene. A parallel computing model for matching genomes is CloudBurst [142]. It uses 24 core clusters for evaluation that is 24 times faster in speed than single-core system. It has the capability of short read mapping of 7 million reads that improved the scalability of reading huge sequencing data. On the basis of CloudBurst, Contrail [143] was developed to accumulate hefty genomes and for the identification of single nucleotide polymorphisms (SNP), Crossbow [144] was prepared.

A proteomic search engine based on Hadoop distributed framework is Hydra [145] software package. It is a distributed computing environment that processes large peptide and spectra databases to support searching of immense volumes of spectrometry data. It has the fast processing of performing 27 billion peptide scorings on a 43-node Hadoop cluster in approximately 40 min. Another query engine for bioinformatics and genomics researchers is SeqWare [146] built on Apache HBase [147]. Th SeqWare has an interactive interface with genome browsers and tools. It includes loaded U87MG and 1102GBM tumor databases used for the comparison with other prototypes.

There are certain tools used for the error identification of sequencing data. SAMQA [148] is the error identification tool that provides a scaleable quality for standards for large scale genomic data. ART [149] can identify three types of errors from sequencing data like base insertion, deletion, and substitution. CloudRS [150] is a parallel algorithm for error correction. It is based on RS algorithm [151]. For the analysis of data sequencing and genomic analysis, several frameworks and toolkits were developed. CloVR [152, 153] is a distributed virtual machine package for sequencing analysis that supports both local and cloud systems. Another virtual machine tool is CloudBioLinux [154] that provides 135 bioinformatics packages for analysis. Genome Analysis Toolkit (GATK) [155, 156] analyzes large sequence and genomics. It is based on MapReduce-based programming framework that had been used in 1000 Genomes Projects. BlueSNP [157] analyzed 1000 phenotypes and found association based on R package and Hadoop platform.

6.3 Clinical informatics

The clinical laboratory is a major source of data related to patients’ diseases and health issue. There are approximately 80% unstructured data like clinical documents, radiology, pathology, patient discharge summaries, diagnostic testing reports, X-ray and radiological images, transcribed notes, etc. as shown in Fig. 16. Clinical informatics is the study of Information Technology (IT) and healthcare for organizing the patient’s clinical data and laboratory test, reports, etc. into structured and computerized form to increase data retrieval and extraction efficiently that will assist in evaluations and reports effectively. It divulge the development of electronic health informatics systems for improvement of care and management of patients and sharing of data in seconds using computer and internet. Increasingly laboratory data are being integrated with other data of patient to improve the diagnostic process efficiency, and increase its meaningful use to improve patient outcomes. IT-based systems replace the manual data entry in records, reports, documents, and also save time and cost associated with records, hospital data and reports on daily bases, like billing and schedules of patients [158]. However, clinical informatics is currently not practiced in small clinics, hospitals, laboratories in rural and county side areas due to implementation of clinical informatics technology [159]. For boosting the implementation, the Electronic Care Records (ECR) system as a clinical informatics in the whole government hospitals in USA, HITEC [160] made some interesting incentives for the medical organizations, hospital and clinics. The doctors and physicians should use EHR systems for data of patients which they can share with any others and can provide to patients online and or can access anywhere.

Fig. 16
figure 16

Unstructured clinical informatics

In big data analytics, the first step is to store and manage data in some structured form. Clinical data are stored to observe the information of patients, hospitals and other relevant structured and unstructured record. It can be then used to settle on clinical decision, assessing patients and make treatment plans. Data warehouses and relational databases are the traditional and structured methods to store and retrieve data. However, to use clinical data, they are first transformed and classified when they are integrated from multiple sources [161, 162]. A detailed systematic review paper was published in [163] till 2011. We here present the further related work. Dutta et al. [164] stored EEG data using Hadoop and HBase in data warehouses. Jin et al. [165] analyzed and stored distributed EHR data using big data tools like Hadoop HDFS and HBase. Similarly, Nguyen et al. [166] stored signal clinical data using HBase. Jayapandian et al. [167] and Sahoo et al. [168] developed a system named ’Cloudwave’ for storing and querying EEG clinical data that are voluminous. Mazurek [169] stored unstructured data in Not Only Structured Query Language (NoSQL) repositories to provide fast processing speed and data mining capabilities. For this purpose, relational and multidimensional technologies were combined with NoSQL.

Clinical data were often retrieved and shared interactively for data integration and knowledge sharing, so the cloud computing was usually considered for this purpose. Bahga and Madisetti [170] proposed a system based on cloud approach for inter-operable EHRs. Chen et al. [171] translated the informatics aspects of present and future using cloud computing. For multi-site clinical traits, the interactions of researchers were enhanced by the conceptual software architecture developed by Sharp [172] using cloud approach.

Clinical data were analyzed to predict the disease, risk, diagnosis, and progression. Literature revealed a lot of data analysis strategies for the prediction of clinical record. One of the predictive modeling platforms was ”PARAMO” designed by Ng et al. [173] for analyzing EHR and the generation and reuse of clinical data using a Hadoop cluster. They analyzed the EHR from 5000 to 300,000 patients and reported promising time effective results. Chawla and Davis [174] formulated the framework for patient-centered to explain the big data approaches for personalized medicine. Similarly, the big data for perioperative medicine were illustrated by Abbott [175]. Zolfaghar et al. [176] implemented big data techniques for the predictive model. They conducted an experiment on patient data of ”National Inpatient Dataset and the MultiCare Health System” for the congestive heart failure. They reported the maximum accuracy upto 77% and recall upto 61%, respectively. Rangarajan et al. [177] proposed data lake architecture that used HDFS for data storage. Similar health conditions of patients were clustered using K-means. From each cluster, the successful recommendation was found by deploying SVM. Wang and Hajli [178] examined 109 case description of 63 healthcare organizations. They modeled the big data analytics for business transformation using RBT theory and capability building view in the model. Each case occurrences along with pair-wise connections, constructs and path-to-value chains were used to find business value (see Table 3).

Table 3 Clinical informatics databases

6.4 Public health informatics

Informatics is an ”Applied Information Science”. It synthesizes the practices and theories of information technology, computer science, management sciences, and behavioral sciences into concepts, tools, and methods for implementing information systems into health for public. Informatics transform raw data into information effectively according to requirement of users. Healthcare informatics research is a scientific attempt that improves both health service organizations’ performance and patient care outcomes as shown in the following Fig. 17.

Fig. 17
figure 17

Healthcare informatics researches

Public healthcare is determined through Epidemiology. Epidemiology is the study of analyzing how frequently diseases arise in different groups of people and why. Epidemiological information is used to formulate and evaluate techniques to prevent illness. This information also serves as a guideline to the management of patients in whom disease has already evolved. Traditionally, epidemiology has been based on the data collected by public health agencies through health personnel in hospitals, doctors’ offices, and out in the field.

The healthcare mechanism is an usual first line of reaction to clinical activities, whether of large or less severity. Informatics are used to figure out sentinel occasions, leading to analysis that can keep away from doubtlessly devastating effects. An example of response is war on cancer announced in 1973 when the programmers of National Institutes of Health feed the data from registries to the information system entitled with Surveillance Epidemiology and End Results (SEER) system. This system provides the information to the public health planner and epidemiologists to analyze the distribution of cancer throughout the population [180]. After many years of monitoring and evaluation, Age-adjusted mortality rates as a consequence of cancer were dropped step by step since early 1990, with important development in areas including lung. Most cancers reflecting fulfillment in public health efforts aimed at controlling precipitants to the disease [181].

Another example of that capacity can be seen inside the response to the 2001 bio-terrorism assaults. During September 2001, anthrax spores had been traced to postal facilities in Trenton, New Jersey and Brentwood, Washington. Epidemiologists face daunting venture: the new Jersey facility was a facility of 281,387 square ft, staffed by 250 employees according to shift and processing over 2 million items of mail in line with day [182]. Informatics help to become aware of the those who could have been exposed to anthrax, monitored the screening system, and recorded who obtained antibiotics and distribution of recognized cases and known deaths. Further analytical strategies and significant healthcare researches were explained in [183, 184].

In latest years, innovative data sources have introduced that are used to collect data in a second from individuals directly using electronic devices. Social media change the life of society and make global World. The exponential amount of data is produced daily. Big data are produced from Public Health (PH) information and can be generally characterized as big data. Public healthcare data are collected, analyzed, assured, and accessed so that big data analytics techniques are deployed to extract hidden informative patterns. Public or social media information is further used to predict, monitor, and diagnose of diseases, i.e., effective use of PH data determines the extent to which social health concerns can be determined. Literature divulges several survey papers based on data mining [185, 186], deep learning [187, 188] and other [189]. We here present some of the public healthcare work using social media. The data-sets for public healthcare data corpus are explained in the Table 4.

Table 4 Public health databases

Young et al. [198] gathered 553,186,016 tweets from the Twitter. They extracted more than 9800 keywords and geographic annotations that contains HIV risk words. They revealed that social media monitor global HIV occurrence and concluded that positive correlation of greater than 0.01 was retrieved between HIV-related tweets and HIV cases. Hay et al. [199] facilitated public health surveillance using online social media combined with epidemiological information. They developed atlas for real-time disease monitoring.

Nambisan et al. [200] detected depression from messages and tweets of social media thus big data analytic tools were used to extract the hidden valuable patterns for detecting mental disorders. They concluded that behavioral and emotional patterns in messages showed the symptoms of depression. Tsugawa et al. [201] implemented multiple regression models to detect the depressive tendencies. They extracted frequency of words form messages and Twitter from the popular micro-blogging services to detect depression and achieved a correlation of approximately 0.5.

Park et al. [202] analyzed depression of 60 participants from their activities on tweeter from sentiment words of depressed users. Another contribution by the same author was to detect the symptoms of depressive users through Facebook [203]. Choudhury et al. [204, 205] developed a large dataset from Twitter posts using crowd sourcing methodology. They implemented the probabilistic model to indicate the depression level form social media. Postpartum depression affects up to 20% of mothers and has negative consequences for both mother and child [206]. Similarly, same authors in [207] detected and predicted the onset of post-partum depression of 165 mothers through Facebook shared data.

Sadelik et al. [208] predicted infectious diseases through the social network. They used 1000 Twitter messages related to healthcare. They applied statistical models on geo-tagged postings made on Twitter for prediction of diseases that cause an infection like flu, etc. Digital media are widely used to improve healthcare monitoring and its effectiveness. Ginsberg et al. [209] used trends models and search queries on Google to detect influenza and flu-like diseases. One of the most earlier comprehensive review papers of public healthcare informatics using social media was presented by Hagg et al. [210]

6.5 Medical signal analytics

Nowadays, technology is advancing rapidly that provides effectiveness in every walk of life, especially in healthcare. Currently, healthcare systems use a variety of continuous monitoring devices that generate signals. Physiological signal monitoring devices and telemetry devices are pervasive [211] because these devices improve healthcare management and patient healthcare [212, 213]. These devices use discretized or physiological waveform data and generate alert mechanisms in case of an overt event. There are certain issues in medical signals that tend to move towards big data. The most notable obstacle is volume and velocity of continuous and high-resolution multitude monitors connected to each patient. The generated alarm systems are unreliable and cause alarm exhaustion for both caregivers and patients [214, 215]. The primary failure of these systems is due to relying on single source of information.

The first step in streaming data analytics in healthcare is the acquisition of signals. It is usually rare to store the streaming signals from continuous acquisition devices. However, to access the live streaming data from devices is one of the foremost tasks for big data analytics applications. There are many challenges posed to healthcare systems during streaming data collection like network bandwidth, scalability, and cost [216]. Thus, Research communities are developing continuous monitoring technologies [217] to capture live monitor signals. Next step is to store the signals data from monitoring devices using Big Data analytics tools like HDFS, MapReduce, MongoDB [218, 219], etc. Medical data including signals are complex due to interconnected and interdependent data among several sources. Thus, data are integrated and aggregation techniques are deployed for effective performance [220, 221]. The workflow of generalized streaming healthcare is depicted in Fig.18. The most notable data repositories containing signals information in healthcare are shortlisted in Table 5.

Fig. 18
figure 18

Generalized work-flow of streaming healthcare

Table 5 Medical signal databases (Shortlisted from Physionet)

After introducing medical signal analytics, we present some of the related work of big data analytics in medical signaling. Han et al. [224] developed a patient care management system using a scaleable infrastructure. This system combined static and continuous data from monitored ICU devices. It analyzed and mined medical data in real time.

Bressan et al. [225] implemented an architecture for neonatal ICU. It used data of EEG monitors, infusion pumps, and cerebral oxygenation monitors. Their proposed system provides effective decision system for clinics.

Lee and Mark [226] conducted experiment on MIMIC II database for therapeutic intervention to hypotensive episodes. Their system predicted intensive care based using blood pressure and cardiac time series data.

Sun et al. [227] also used MIMIC II database to extract the physiological waveform data along with clinical data. They selected cohorts and find the similarity of patients from them that is beneficial for healthcare. The similarity measure was used for the treatment of similar diseases and deducted effective decisions from them. Another study on MIMIC II database was to detect the cardiovascular instability in patients at an early stage. For this purpose Cao et al. [228] developed a system that combined multiple waveform data from MIMIC II corpus.

Roux et al. [229] discussed the neuro-critical care of the patient’s disorders using different physiological monitoring systems. They provide a platform for the researchers with guidelines by examining the potentials and implications of neuro-monitoring. Rajan et al. [230] used a multi-channel signal acquisition method for the development of physiological signal monitoring system using NI myRIO connected with the wireless network. They also used the Internet of Things( IOT) techniques for better performance in healthcare.

Zhang et al. [231] recognized the Lung cancer using sensor-based wrist pulse signal processing with the technique of cubic support vector machine (CSVM). They implemented iterative slide window (ISW) algorithm for signal segmentation and extract 26 features. Using these strategies, they achieved 78.13% accuracy. Nanda et al. [232] distinguished between essential tremor and Parkinson’s tremor using non-invasive recording techniques. They employed Neural Network for the classification of tremor sEMG signals and achieved 91.66% accuracy.

7 Key findings

This survey presents the emerging landscape of big data and analytical techniques in the five sub-disciplines of healthcare. We present various domains of healthcare in which big data technology has played a significant role in modern-day healthcare revolution, as it has totally changed the perception of people about healthcare activities. Big data analytical techniques deployed in five sub-disciplines such as, medical image processing and imaging informatics, bioinformatics, clinical informatics, public health informatics, medical signal analytics are explained comprehensively that draw an integrated depiction of how distinct healthcare activities are accomplished in a pipeline to facilitate individual patients from multiple perspectives. The existing reviews did not provide the detailed explanation in multiple sub-disciplines of healthcare. There is no comprehensive evaluation of studies in the existing reviews.

The existing studies discussed the different sources of healthcare for big data such as pharmaceutical firms, healthcare providers, diagnostic companies, laboratories, not-for-profit organizations insurance companies and web-health portals [10, 235,236,237,238,237]. The big data techniques used for the analysis of healthcare data are machine learning, data mining, cluster analysis, pattern recognition, neural networks, deep leaning and spatial analysis. Most of the studies processed the patient data using Hadoop and its tools, but they are batch processing tools [240,241,242,243,242]. There are some studies that used newer tools like Spark, Storm, GraphLab, etc. for the processing of real time and streaming data [242]. Most of the studies discussed the applications of big data analytics in different fields of healthcare like personalized medicine, clinical decision support, clinical operations optimization and cost effectiveness of healthcare. It can be shown that healthcare analytics improves the quality and early identification of patients. There are researches related to diabetes, gynecology, oncology, cardiovascular diseases and so on that enable to save time and cost [69, 245,246,247,248,247] (see Fig. 19).

Fig. 19
figure 19

Deep learning architecture for big data analytics

With the rapid increase of publications in biomedical and healthcare industry, we have conducted the detailed review regarding healthcare analytics in five sub-disciplines. We summarized the usability studies of each discipline in Table 6, including image visualization, image classification, image retrieval, data and workflow sharing, data analysis, feature selection, bioinformatics classification and clustering, micro-array data analysis, protein–protein interaction, pathway analysis, protein sequencing, query and search engine, error identification of sequencing data, storage and retrieval of EHR, treatment recommendation, business transformation, disease prediction, diagnosis and progression, data security, infectious disease surveillance, population health management, mental health management, chronic disease management, signal acquisition, signal storing from monitoring devices, signal integration and aggregation respectively. It is concluded from this survey, that bioinformatics is one of the primary disciplines in which big data analytics is currently evolved and playing a scientific role, due to the complex and massive bioinformatics data. There are a lot of tools, techniques and platforms for bioinformatics used to analyze biological, genomics, proteins and gene sequencing data. However, there is less potential of big data applications in other disciplines such as medical imaging informatics, clinical informatics, public health informatics and medical signal analytics.

Table 6 Comparative analysis of the literature

8 Big data analystics applications

Healthcare sector produces huge amounts of patient data on a daily basis. Traditionally, most of these data were used to be in the form of hard copies but, due to the advancement in data acquisition devices, healthcare organizations are gathering data electronically. Healthcare data analytics has the potential to bring in dramatic changes in healthcare industry to smooth the process and improving the quality of care. Data analytic researchers, healthcare providers, government agencies and the pharmaceutical companies identify range of different ways that big data techniques can help us to significantly improve patient outcome through policy making and evidenced-based decisions. Below are the major areas in healthcare sector where big data analytics has a huge impact:

Strategic planning ‘Management is based on early measure: you cant manage if you cant measure’. Healthcare is a time critical service. Hospitals are struggling with patient flow. Machine learning and data analytics plays important role in the prediction of patient flow and ensuring smooth patient flow as well as reducing waiting period. Early predicting of hospital visit helps the management to decide and take the necessary step to reduce patient waiting time thereby giving timely treatment. There are different applications like Patient Flow Manager and Q-nomy’s etc. that provide a comprehensive graphical view patient low information, drawing of inpatient, elective, emergency, outpatient and other hospital systems. For example, care mangers can analyze check-up results among patients in different demographic groups that help to identify what factors discourage patient from taking up treatment. The classical example is staff management: how many clinicians and nurses staff should be give at specific time.

For our first example of big data in healthcare, we will look at one classic problem that any shift manager faces: how many people do I put on staff at any given time period? If you put on too many workers, you run the risk of having unnecessary labor costs add up; too few workers, you can have poor customer service outcomes—which can be fatal for patients in that industry. In other example, we can predict admission trend based on admission history of last few years, i.e., using 10 years worth of hospital admissions records, which data scientists crunched using “time series analysis” techniques followed by machine learning relevant to predicted future admissions trends.

Fraud detection ‘Suspect, detect and protect’. Fraud, waste, and abuse have caused significant cost and it range from honest mistakes that result in erroneous billings, inefficiencies that may result in wasteful diagnostic tests, over-payments due to false claims. Personal data are extremely sensitive due to its profitable value in black-markets; thus, healthcare industry is 200% more likely to experience data breaches than any other. With that in mind, effective detection of frauds is very important for reducing the cost and improving the quality of healthcare system. Fraud detection in healthcare is an important yet difficult problem. Big data have inherent security issues and healthcare organization are more vulnerable than they already are. Many organizations are using analytics to reduce security threats by analyzing the changes in network traffic, or suspicious behavior that reflects a cyber-attack. WhiteHatAI Centaur system, NICE ACTIMIZE, NHCAA, SAS, Optum, etc. are being used for medical claims processing that identify and detect healthcare fraud, waste, and abuse before it happens. Likewise, data analytic can help to prevent fraud and inaccurate claims in a systemic, repeatable way by streamlining the process of insurance claims. For example, the Centers for Medicare and Medicaid Services saved over \(\$210.7\) million in frauds in just 1 year.

Resource management ‘How you use a facility, many factors pushing and pulling’. Big data are making huge advances in reducing hospital waiting lists. Despite expensive efforts by the government and healthcare organizations, waiting times barely changed, with the median even increasing slightly, i.e., Australia has been trying hard to reduce the waiting list times on its hospital for more than two decades. Efficient and timely resource utilization helps to over come the patient flow and reduces the financial burden on organization. Data analytics continues to make inroads the manage hospital resources efficiently with respect to patient flow and risk. Examples are readmission, ambulance, bed utilization, etc.

The common example is 30-day patient readmission or return visit to an emergency department. 30-day readmission identifies the patients who have high possibility to return to hospital with 30 days of discharge. The development of risk prediction model helps to identify patients who would benefit from the disease management program in an effort to not only reduce the patient readmissions but also healthcare cost.

Personalized medicine ‘Disease and its treatment is unique as we are’. The promise of personalized medicine is the shift away from ‘one size fits all’ medicine. Through the datafication and genomic fingerprints, much more information of each patient can be analyzed without requiring multiple rounds of testing. Best treatment can be made on an individual basis at a faster rate using personalized data.

Genomics ‘The more is the data you have the better you can treat’. Human body consists of 30,000–35,000 genes [88, 296]. From the DNA structure of the human, it is estimated that there are 23 chromosomes with the distribution of 3.2 billion base pairs [297, 298]. These data increase dramatically to about 200 gigabytes. Thus, big data analytics is required for genomics and sequencing practices that are used for the treatment of complex diseases like Crohn’s disease and age-related muscular degeneration [91]. The impact of genomic data analytics has the great potential to improve healthcare outcomes, quality, and safety, as well as cost savings.

Disease prediction and prevention ‘Precaution and care can help live longer’ . Many healthcare organizations, research labs, hospitals are leveraging big data analytics are by changing the models of treatment delivery. Thus, big data analytics has tremendous applications in the healthcare domain for reducing cost overhead, detecting and curing diseases, predicting epidemics and enhancing the worth of human life by averting deaths. Number of projects from ”Google”, ”DeepMind”, ”IBM”, ”Royal Free London NHS Foundation Trust” and ”Imperial College Healthcare NHS Trust” and others have proved the importance of deep learning and machine learning for detection, identification, diagnostics and predictive analytics. DeepMind collaborated with Moorfields Eye Hospital to the analyze anonymized eye scans, searching for early signs of diseases leading to blindness. There are also projects signed with the Royal Free London NHS Foundation Trust and Imperial College Healthcare NHS Trust to develop new clinical mobile apps linked to EHR.

Big data have transformed healthcare by putting data to work, revealing clinical and operational insights. The most applicable applications of IBM are IBM Content and Predictive Analytics. ”IBM Content and Predictive Analytics” for healthcare is the first industry-specific analytics solution to enable organizations to analyze the past, see the present and predict the future by simultaneously. For example, we can predict admission trend based on admission history of last few years, i.e., using 10 years worth of hospital admissions records, which data scientists crunched using “time series analysis” techniques followed by machine learning relevant to predicted future admissions trends. One of the major applications of big data analytics in the healthcare domain is medical image processing, as in healthcare enormous amount of medical images are produced like X-ray, CT and PET–CT images, MRI, ultrasound, fluoroscopy and photoacoustic imaging. These medical images produced big data that are used for various purposes like detection, diagnoses, assessment, decision-making of therapy, etc. [299].

Heart is the basic organ of the body. If the heart stops its working, human body does not exist. There are several disorders of heart; one among them is the heart attack. Big data analytics facilitates to predict the heart attack at the early stage using early heart attack detection system based on medical biosensor [300, 301] that detects heart attack at the early stage. There are some online systems [302] and healthcare information system [303] that provides guidance about heart diseases using IOT and Hadoop techniques.

The brain is the vital organ of the body that controls all the activities of the body just like CPU of the computer. Thus, data mining and data analytics tools are deployed to detect the brain disorders like Parkinson’s brain disease prediction [304] ,[?],[72, 305, 306]. Diabetics is one of the common diseases in this world. Big data analytics tools like ‘Hive’ and ‘R’ are used for the analysis of diabetics using descriptive dataset [307, 308]. Efficient predictive models are established to reveal the data related to the investigation of diabetics.

There are online applications that are remotely facilitating the healthcare domain. AmWell, Practo, Portea, Isabel, etc. are the most popular apps that are used for various purposes like appointment of doctors at hospitals, clinics, etc., patient diagnosis, ordering medicines, consultation with the doctor remotely for treatment, etc [309]. Summarizing the applications of big data analytics in healthcare [2, 310], it is concluded that big data are beneficial to identify and diagnose the patient accurately and precisely. It is used for the prediction and management of health risks and obesity to efficiently detect the level of frauds. It reduced the cost, variations, and elimination of duplicate care and improper claim submission.

9 Challenges and open research issues

The healthcare sector suffers from multiple challenges, ranging from new disease outbreaks to preserving an optimal operational efficiency. To overcome these challenges, data mining and data analytics in the development of applications of healthcare have tremendous potential; however, success hinges on the availability of quality data but there is no magic recipe to successfully apply data analytics methods on any problem. Thus, the successful development of data analytics-based applications depends on how data are stored, prepared and mined. However, chemical analytics poses a series of challenges when dealing with a enormous amount of complex data. These challenges involve data complexity, access to data, regulatory compliance, information security and efficient analytics methods, inter-operability, manageability, security, development, re-usability,open data, missing data and data heterogeneity.

9.1 Multiple source information management

In healthcare data analytics, the main goal is to analyze the real-world medical data to perform prediction or classification task. One of the biggest hurdles in development of such application depends upon on the data structure, i.e., how medical data are spread across many sources, how data are stored, prepared and mined. One of the worst examples of lack of data sharing is: a woman who was suffering from mental illness and substance abuse visited variety of local hospitals more than 900 times in a period of less than 3 years in Oakland, California, USA. It results in heavy cost, extensive use of hospital resources and more important, harder for woman to get good care.

Healthcare data correlations are leveraging in longitudinal records i.e., complex, heterogeneous, distributed and dynamic data i.e., in the US alone, healthcare data extended to 150 exabytes in 2011 and is expected to reach the zettabyte scale soon. Despite the rapid increase in EHR adoption, there are several challenges around making this information useful, readable and relevant to the physicians and patients who need it most. One of the key challenges in the healthcare industry is how to manage, store and exchange all of these data. Inter-operability is considered to be one of the solutions to this problem. There exists a poor inter-operability in EHRs that creates big data analytics challenging in healthcare. Integration of different data sources would require developing a new infrastructure where all data providers can collaborate each other to share. Another challenge is data privacy that limits the sharing of data by blocking out significant patient identification information such as MRN and SSN. Healthcare needs to catch up with other industries that have already moved from standard regression-based methods to more future-oriented like predictive analytics, machine learning, and graph analytics. Big data technologies like Data ingestion, data modeling, and data visualization are integrated with existing tools to provide a supported enterprise solution.

Big data management is one of the hard tasks as there is a big cluster of data that are monitored and managed. Most patients visit multiple clinics to try to find a reason for their disease and medical solution for their illness. To overcome this issue, several management tools are integrated that is overwhelming and cost effective strategy. Proficiently handling large capacities of medical imaging data and extracting possibly useful information is another hard task. Hospitals have yet to achieve a level of inter-operability, and without it, it is almost impossible to improve patient care. The US Health Department is aiming for inter-operability between disparate EHRs by 2024. Medical stakeholders (physicians, administrators, patients, etc.) believe that inter-operability will improve patient care, reduce medical errors and save costs. Imagine having the insight and opinions of hundreds of IVF/PGD patients to assist your decision before undergoing treatment rather than only relying on a physician’s recommendations. Due to the importance of data integration, healthcare organizations are turning to the implementation of inter-operability. To achieve a high level of inter-operability, HL7, HIPPA, HITECH and other health standardization bodies have demarcated several standards and guidelines to assist organizations to know whether they meet inter-operability and security standards. The Authorized Testing and Certifying Body (ATCB) provides a sovereign, third-party opinion on EHR. Two types of certification (CCHIT and ARRA) are used to evaluate the system. The review process comprises standardized test scripts and exchange tests of standardized data. Healthcare industry needs to catch up with other fields that have already progress from standardization.

9.2 Security, privacy, and confidentiality

Every stakeholder in the health industry has a role to play in ensuring the security and privacy of patient information. It is a shared responsibility. Patient privacy and information security are fundamental components of a well-functioning healthcare system that helps to accomplish better health outcomes, healthier people, and smarter spending. For example, a patient may not disclose certain information or may ask a physician not to record his health information due to a lack of trust and the perception that this information might not be kept confidential. This attitude puts the patient at risk and deprives physicians and researchers of important information as well as putting the organization at risk in terms of clinical outcomes and operational efficiency analysis. To reap the benefits, providers and individuals must belief that patients’ health information is kept private and secure. On the other hand, providers are facing several challenges in ensuring that privacy and security issues are managed at a standard that meets the patients’ satisfaction, i.e., efficient data analysis without providing access to precise data in specific patient records. Security and privacy in data analytics poses several challenges, especially when it draws information from multiple sources.

The major goal in healthcare is not to protect the patient’s privacy rather it is to save lives. The HIPAA (Health Insurance Portability and Accountability Act) of 1996 comes to mind when privacy is debated in the health sector. It delivers legal rights to patients concerning their personally identifiable information and establishes responsibilities for healthcare providers to defend and restrict its use or disclosure. With the escalation in the amount of healthcare data, data analytics researchers envisage huge challenges in ensuring the anonymity of patient information to avoid its use or disclosure. Limiting data access, unfortunately, reduces information content which might be very important. Moreover, real data are not static but grow larger and vary over time and none of the existing techniques result in any convenient content being released in this scenario.

9.3 Advanced analyzing techniques

Technological advancements (wearable devices, patient-centered care, etc.) are transforming the entire healthcare industry. The nature of healthcare data has progressed, and currently, EHRs have simplified the data acquisition process with the help of the latest technology, but, unfortunately, they do not have the ability to aggregate, transform, or perform analytics on it. Intelligence is restricted to retrospective reporting that is insufficient for data analysis. A plethora of algorithms, techniques, and tools are available for the examination of complex data. Traditional machine learning deploys statistical analysis based on a sample of a total dataset. The use of traditional machine learning methods for these data is not efficient and is computationally infeasible. The combination of the huge volume of healthcare data and computational power lets the analysts to focus on analytics techniques which are scaled up to accommodate the volume, velocity, and variety of complex data. During the last decade, there has been a melodramatic change in the size and complexity of data; thus, several emerging data analysis techniques have been presented.

Healthcare needs to catch up with other industries that have already progressed from traditional methods to advance methods like predictive analytics, deep machine learning, and graph analytics. Innovative analytics techniques need to be developed to interrogate healthcare data and gain insight into hidden patterns, trends, and associations in the data. It deduces relationships without the need for a specific model and enables the machine to identify the patterns of interest in huge unstructured data. As one example, a deep learning algorithm that observed data from Wikipedia learned on its own that California and Texas are both states in the U.S. It does not have to be modeled to understand the conception of a country and state, and this is a gigantic difference between older machine learning and emerging deep learning methods.

9.4 Data quality: open data, missing data, and data heterogeneity

Gone are the days when healthcare data were small, structured and collected exclusively in electronic health records. Due to the tremendous advancements in IT, wearable technology and other body sensors, data have become quite large (moving to big data), unstructured (80% of electronic healthcare data are unstructured), non-standard as well as in a multimedia format. This variety in data makes it challenging and interesting for analysis. Currently, the quality of healthcare data is a cause of concern for four reasons, incompleteness (missing data), inconsistency (data mismatch between within same or various EHR sources), inaccuracy (non-standard, incorrect or imprecise data), heterogeneity, and data fragmentation. Data quality involves a group of different techniques, these being data standardization, verification, validation, monitoring, profiling, and matching. The problem of poor data in the health industry has reached epidemic proportions and introduces several pernicious effects, particularly in relation to disease prevention. The problem with dirty data is mostly related to missing values, duplication, outliers and stale records.

Although real-time data monitors (especially in ICUs) are partially used in most hospitals, real-time data analytics is not in practice. Hospitals are moving to real-time data collection and in the near future, real-time data analytics will revolutionize the healthcare industry, enabling such things as the early identification of infections, the continuous monitoring of the progress of treatment, the selection of the right drugs, etc. which could help to reduce morbidity and mortality. To achieve real-time data processing, we need data standardization and device inter-operability.

The other common issue is data standardization. Structuring of only 20 percent of data has shown its importance but on the other hand, clinical notes are still in practice and created in billions due to the reason that the physician can best explain the clinical encounter. Empower physicians as well as maintaining the data quality is quite challenging. So far, these data are excluded from data analytics as it is available in the natural language and not discrete. Transforming this unstructured data into a discreet form requires efficient intelligent technology and it is has been a very difficult problem for medical IT until now. The only way this unstructured and nonstandard data can be used is using NLP to translate the data using ICD or SNOMED CT into discrete data.

Heterogeneous data are any information with high variability of information types. They are low-quality and ambiguous data due to high information access, data redundancy, missing values, and untruthfulness. It is difficult of integrate heterogeneous information to satisfy the business data needs. For instance, heterogeneous information are frequently produced from Internet of Things (IoT). The challenges of Big Data algorithms concentrate on algorithm design in tackling the difficulties raised by big data volumes, distributed data distributions, and complex and dynamic data characteristics. The challenges include the following stages. First, heterogeneous, incomplete, uncertain, sparse, and multi-source data are pre-processed by data fusion techniques. Second, dynamic and complex data are mined after pre-processing. Third, the global knowledge obtained by local learning and model fusion is tested and relevant information is fed back to the preprocessing stage. Then, the model and parameters are adjusted according to the feedback. In the whole process, information sharing is not only a promise of smooth development of each stage, but also a purpose of big data processing [311].

10 Conclusion

The exponential growth of big data analytics has rapidly increased that plays a vital role in the progression of healthcare practices and research. It includes providing tools to collect, analyze, manage and store a large volume of structured, unstructured and large complex data. Big data have brought a dramatic change in healthcare which reduce the cost of treatment and accelerate the identification of disease, cancer, etc. and improve the life’s quality. They have been recently applied in aiding in the process of healthcare personnel, care delivery, early disease detection, disease exploration, patient care, and community services.

In this paper, we have discussed the big data analytics methods, tools, techniques and architectures in the healthcare domain. We have focused on five major sub-disciplines of healthcare, i.e., medical image processing and imaging informatics, bioinformatics, clinical informatics, Public Health informatics and Medical Signal analytics along with techniques, tools, and repositories deployed in each discipline. These disciplines play a vital role in healthcare and bio-medical due to the enormous amount of data.

Healthcare providers had no direct incentive in sharing the patient information with each other, that made it harder to efficiently utilize the power of analytics in healthcare industry. We can possibly change the way to healthcare providers use modern advances and sophisticated technologies to pick up understanding from their clinical, data warehouses, information storehouses for extracting informative patterns and decision-making. Later on, we will see the quick, across the board execution and utilization of big data analytics over the social insurance association and the medicinal services industry. Keeping that in mind, the few difficulties must be tended to. Its potential is extraordinary; however, issues, for example, multiple source information management, ensuring protection, shielding security, setting up models and administration, advance analyzing techniques and data quality are the notable challenges in the domain. Regardless, the future trends of big data in the social insurance framework have the capability of enhancing and quickening communications among clinicians, executive, logistic manger, and analyst by diminishing costs, reducing risks and improving personalized care.

Implementation of big data analytic is the responsibility for all stakeholders in healthcare industry. It is the responsiblity of stakeholders to make and review the polices of big data to improve the patient outcomes. Government agencies, healthcare professionals, hardware companies, pharmaceutical industries, people, data scientist, researchers, and vendors must be involved in developing the big data framework that will provide the future direction of big data analytics in healthcare industry.