1 Introduction

The expansion of digital knowledge is engendering vast amount of data in exabytes, hence generating challenges among the researchers and scientists to design and develop automated technology. However, big databases are creating painstaking efforts among researchers to handle and develop knowledge-based technology for future decision making. In, fact the EHR (electronic Health Records) has undergone dramatic shift which has perpetually changed the global scenario of healthcare application domain. The shift has occurred due to novel ICT (Information and Communication technology) which has changed the era of traditional tool to new generation of sensor-based technology, imaging, scanning and other technological advancements. The focus is to retrieve hidden patterns and information from big data bases using new intervene technology which can evidently improve the clinical decision making while maintaining privacy and security of patient’s data. So, big data analytics tends to be anticipated technology which can be widely applied in healthcare application domain in varied areas which include insurance fraud detection, treatment, predictions of disease and identifying factors related to healthcare costs [1,2,3,4,5]. Hence the ultimate goal is to deliver an effective and efficient treatments to benefit end users for future decision making.

Big data analytics can be foreseen as new adoptive IT based approach which can render wide benefits to healthcare practioners for transformation in better clinical decision making. Indeed, the fear exists when the privacy and security of big data is a concern for healthcare. Thus, challenge is to handle complex and voluminous nature of data, with the next level of atrocities is to establish and protect the patient level data. Such series of complex tasks are established with new novel algorithmic technology and standards [5,6,7].

Big data analytics with healthcare technology can vastly improve the efficiency and effectiveness of patient care by providing the insights of data with utmost security and privacy. This can facilitate the various process healthcare which include patient data flow, overall patient stays during hospitalization, insurance data and other cost related features to improve the quality of care. In general, big data analytics is an imperative technology to generate impounded outcomes to reduce the global burden of disease with focus on data privacy and security. Hence, data analytics techniques bound to have extensive vision where the paradigm is to generate and develop secure and sustainable tool for varied application domains which include healthcare, GIS (Geographical Information system), imaging, industry, banking, and others [8,9,10,11]. In technological context there exist huge advancement in electronic databases with volume and complexity, herewith the knowledge discovery is discovered as an exponential tool to analyse data. However, knowledge discovery is explicit term which is radical in nature and prone to detect hidden patterns and knowledge which can benefit end users for varied application domains with privacy and security concerns.

The big data analytics encompassing effect tends to discover hidden patterns which has been classified using various algorithm technique such as Artificial Neural Networks where the technique usually works on the concept of biological model where the brain neuron send the signals to different parts for appropriate actions in similar context the predictive model design the data to gather the information from training data sets and depicts the overall chances of prognosis of disease for future outcomes, to classify the data Decision trees are variably utilized to discover a tree like structure where the tree is depicted as per training datasets. The Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) are two best known techniques to create decision tree. The K-Nearest neighbour method identify the closest neighbour with respect to distance matrix and classify every record with combination among the groups of the k records. A new era big data technique has evolved with genetic algorithms where the concept is to evolve new modified algorithms to substantiate the genetic combination for natural selection and mutation of evolution for discovery of hidden patterns with utmost security and privacy [12,13,14,15,16].

The concern big data is facing is security and privacy in the domain of healthcare where the existence of advancement in emergence threats which can occur due to gaps and disclosures in adaptive information systems. Although, this has created an immense challenge among the researchers and scientists around to globe to deal with complimentary issues of security and privacy. As, we know healthcare is very sensitive issue where the patient data records should be kept at most private regardless of any changes in policy and guidelines [17,18,19,20,21]. Thus, patients are suffering from varied critical disease which include HIV, AIDS, Cancer, Tuberculosis and others where the patients do not want to share details with any organization which can hamper the basic socio needs and well-being of patient [8, 22,23,24,25,26].

This invasion in privacy and security of data should be assessed as a persistent threat where it should be discussed as a critical issue in context with patient privacy. This threat has raised several questions and suggestive studies where the privacy and security were the foremost challenge among the researchers and scientists. Hence, privacy is the major concern in healthcare databases which must be dealt on highest priority for decision making. As a result, developers are complimentary their studies with big data analytics to assess the nature of data and provide comprehensive technology for analysis.

In general, big data analytics provides variable opportunities which can benefit healthcare practioners to transform the application to retrieve effective and efficient patterns, but it has multifaced challenges which include privacy and security, complexity, high dimensionality and other. Hence, privacy and security are embarked issues which raises serious concern to protect patients records from cyber threats. The concerned are elevated among healthcare practioners to determine an appropriate technology which can adequately safeguard the privacy and security of organization. Indeed, the current gaps in technological implication has motivated to regulate the breach in security and effective usage of healthcare data. In the current study of approach, we discuss the work related to big healthcare databases and presents jeopardies which can adhere with security and privacy of data.

Thereafter, we tried to address the big data analytics approach while maintain privacy of healthcare databases for future knowledge discovery. The current objective was to design and develop a novel framework which can integrate the big data with privacy and security concerns and determine knowledgably patterns for future decision making. The novel framework is implemented using unsupervised learning technique in STATA and MATLAB 7.1 to develop patterns for knowledge discovery process. In the current research we have utilized Big data Analytical techniques in healthcare databases with patients suffering from HIV (Human Immunodeficiency Virus) and TB (Tuberculosis) coinfection to develop trends and detect patterns with socio economic factors to deal with privacy and security risks in healthcare databases. The focus of study was to retrieve hidden patterns and information from big data bases using new intervene technology which can evidently improve the clinical decision making while maintaining privacy and security of patient’s data. Thereafter, big data analytics tends to be anticipated technology which can be widely applied in healthcare application domain in varied areas which include insurance fraud detection, treatment, predictions of disease and identifying factors related to healthcare costs.

The overall paper is discussed as Sect. 2 will overview the impact of big data on healthcare with concerns on privacy and security with focus on past literature study. A novel framework is designed and developed to effectively and efficiently capture big data from various resources in context to maintain the privacy and security of data and detect hidden patterns for clinical decision making for retrieval of information from large scale databases in Sect. 3. However, effective and efficient implementation of corresponding framework is discussed in Sect. 4. In last Sect. 5 conclusion is deliberated.

2 Privacy as Concern in Big Healthcare Databases

In the high-end epoch of technology, seamless available resources have created an unprecedented growth of databases which is pronounced with a new term as “Big Data”. The data stored in varied resources consists of unstructured, structured, text, images, such complexity of data is difficult to be managed with traditional statistical technology. Hence, analytical technique needs to be comprehended with upcoming challenges of big data in varied application techniques.

In general, big data is vastly applied in diverse area of healthcare, which has widely enabled healthcare practioners and researchers to gain insight of data for better decision making. But the challenge exists to enable a better efficiency patient model which can benefit overall cost and privacy of the patient. A similar approach conducted in the project “WORLDII” an initiative conducted by New Zealand Privacy Commissioner aims to legalize the privacy of data flow from one consecutive location to another [27]. The patient’s data is a sensitive issue where the prioritization is on big data with privacy needs. A data directive is issued by European council for data flow in compliance with e-health to assure that personal data is processed with utmost security and credential detail or private data should not flow at free access.

Foremost, to enable a platform which should be followed among the organization with privacy of data a guideline is issued by Organization of the Economic Cooperation and Development (OECD). The policy assures that individual patient data or personal data (PD) should be secured, so patient should not suffer any arbitrary losses [28]. The policies are minimum standards laydown in eHealth sector and should be followed for protection and privacy of patient data.

In general, the security and privacy among the big data with healthcare has limitless research where getting access and control of data is a complex situation. In this regard, several organizations in healthcare must have security measures so they can protect the data flow which can be embedded for integrated hardware and software system among big data. The data lifecycle in the security is a new contest of security in big data where it is as refereed in three aspects which include data security, the control and access of the data with relevant information security [9]. The data lifecycle was established to enable effective and efficient decision making in context with data.

Moreover, the development technological need is to assure to determine the effective and efficient patterns from big healthcare data in concern with maintaining the privacy and security of data. Hence, big data analytics is widely anticipated technology where the potential is to retrieve hidden patterns and information from raw datasets. The patterns detected can be exploited for real world application domain where the decision support mechanism can benefit healthcare practioners. As we know the medical databases has changed vastly from the past two decades due to intervene IT based technology where new type of datasets have evolved like the EHR (Electronic Health Records), imaging, Radiation data are generated at an amicable speed which is difficult to handle via traditional tools. Hence, big data analytics has ability to handle the complex nature of data and detect knowledgeable patterns for decision making [5,6,7, 12].

While the healthcare sector, is transmitting the data at utmost speed and lacks the delivery support system to generate the validated predictive results in concern with privacy and security of big data. In fact, it’s complicated matter where the healthcare practioners are unaware of the threats which are induced and can susceptibly hamper the patient personal data. Whereas, implementing the big data with security concerns is the trivial area of research among the scientists and healthcare practioners [29,30,31,32,33,34].

In past, several data mining techniques are involved to breach the security and privacy of data to gather the sensitive data and publicly revealed the secure or personal details. So, security remains a complex task where sophisticated technology should be adoptive to analyse the big data. In current study of research big data analytical approach is applied with privacy among the healthcare data to provide minimal access to personal data of patient and retrieval of effective and efficient patterns for knowledge discovery [21, 35,36,37,38,39].

In the current approach we have applied the big data analytic techniques for patients suffering from Tuberculosis (TB) and HIV while keeping in concern with maintaining privacy of patient personal data. As we know TB affects people of all ages around the globe. But in year 2012, about 80% TB cases were reported from just 22 countries, showing a greater prevalence of the disease in some countries as compared with other countries. This increased prevalence can be due to multiple reasons including population, lifestyle of the people, major occupation, gene pool of the region and other socio-economic factors. HIV and Tuberculosis have always been linked to the economic and financial status of the individuals [15].


Thus, several data analytics studies are interlinked with each other to determine the cause of TB and its prognosis with HIV [1,2,3,4], but privacy remains an exclusive cause of study. However, in healthcare several procedures are discussed to maintain the privacy of data which includes:

  1. 1.

    Authentication

The authentication of data and user with app authentic behaviour is a complex situation where every organization needs to embark and determine the confirming claims. This is among the vital situation in any organization where the focus is to ensure the authentication of user while detecting the fraud behaviour of user among the others.


In past several threats have pruned which has led to special problems, especially the Eavesdropping report of patient health records this tends to unauthorized access of communication in layer of network where the attacker tries to illegally sniffs into the communication network and detect the patient data with unlawful interception. Another security breach known as man-in the middle threat is commonly known attack the usually occurs where the two communicating networks were breach with third party and access is gain between information channel and attacker gain the access of entire data flow in the communication protocol. To deal with threats, endpoint authentication processes are determined which include cryptographic protocols.

  1. 2.

    Encrypting Data

The data encryption means the overall orientation of data is encrypted to minimize the breach of security in data flow. In healthcare organization the data usage is from healthcare practioners to patient and hospital, so the devices are connected to each other via network. The encryption of data can efficiently reduce the packets sniffed and minimize threat. Further, the keys hold by each node should be minimized to reduce the causes privacy and security breach. In past several, algorithms are developed for encryption but the success in big data is still a feasible study of approach.

  1. 3.

    Integrity

The integrity of data should be maintained as the information transferred should not be modified by the attacker. In general, the attacker modifies the original value with some modified values. This is the most popular attack on the data where the frisking is done with personal data of user which may include social security number, data of birth, address of the users and other values. Several data anonymization techniques are discussed which include k-anonymity to protect the values being replaced with modified values. But these methods suffer varied drawbacks in big databases so, a significant technique needs to developed and deployed by overlooking the need of privacy and security needs.

  1. 4.

    Auditing

The secure data auditing is required to depict the security and privacy breach in the network or any intrusion detection. Auditing can generalize the user activity while identifying the log records in healthcare databases, so as to detect any modification or access of data. Several studies in past for intrusion detection are recorded to measure the traffic flow or data flow. Hence, solution exists if the security breach exists then the data was stored in distributed network for ensure the healthcare system. In, context to big data flow via network the system should be able to find abnormalities flowing in the network and should substantiate the alerts in heterogeneous environment. Hence, several integrated frameworks are discussed for deployment in real world scenario.

  1. 5.

    Availability

The data must be provided to legitimate user whenever required and any delay information can affect overall patient diagnosis and can lead to clinical implications. But the control and access of should be vitrified to the authentic user and the control policy should also be governed with prioritized to users’ access. This, system can ensure the privacy of patient requirements where the specific privilege permissions are granted to users where the control is at administrator end.

3 A Novel Framework for Privacy in Big Data Analytics

The healthcare organization are facing challenge in day to day scenario for manging their data and safeguarding it from cyber-attacks. This growing need of data privacy and knowledge discovery from big databases is a challenging task where the focus is to generate patterns for future diagnosis and prognosis of disease with privacy and security. In the current approach we have applied the big data analytic techniques for patients suffering from Tuberculosis (TB) and HIV while keeping in concern with maintaining privacy of patient personal data. The current approach of study is focused on to detect patterns from healthcare databases for future decision making. The novel framework is designed to capture data from varied heterogeneous resources for patterns discovery. Figure 1 represents the framework for retrieval of information and decision making.

Fig. 1
figure 1

Novel framework for privacy in big data analytics

3.1 Data Capture

In this study, the entire population of US in all the 50 states was taken into consideration which was around 323.1 million. The population mainly involved people working in service sector.US population involves people of different races like Hispanic, Non-Hispanic, African American, Asian, Native Hawaiian, American Indian, Latino, White, Pacific Islanders. US economy is considered as a Developed Economy by all the categorising organisation be it World Bank or United Nations (UN). Data was obtained from OTIS (Online Tuberculosis Information System), a data repository of CDC (Centre for Disease Control), which is a major operating component of Department of Health and Human Services. It conducts extensive research and provides valuable information about different health issues. The data obtained consists of the information of 1,79,625 patients which was further subdivided into different categories which includes: Year wise: The data contained information of TB patients from year 1993–2014; Age Group wise: The database was further classified based on their age into different groups: 0–4, 5–14, 15–24, 25–44, 45–64, 65+ years of age; Race/Ethnicity wise: The data of TB patients to be studied was classified according to race/ethnicity of patients under subcategories: “Asian, Non-Hispanic”, “Multiple Race, Non-Hispanic”, “American Indian or Alaska Native, Non-Hispanic”, “White, Non-Hispanic”, “Native Hawaiian or Other Pacific Islander, Non-Hispanic”, “Black or African American, Non-Hispanic”, “Hispanic or Latino”; HIV status wise-Data also contained information about the HIV status of the patients; Socio-economic Practices wise: The data contained information about whether the TB patients consumed alcohol or other types of drugs(injecting or non-injecting); Vital Status: The data also contained the status of the TB patients whether the patients are alive or died due to TB.

3.2 Data Pre-processing

Data pre-processing is among the major step for evaluation of patterns. As real time datasets consist of missing values, noisy values and inconsistent data records which needs to handle effectively for diagnosis and prognosis of disease. Data pre-processing comprises of varied steps which include data cleaning, data transformation, data selection, data integration and others. Hence, the steeps are pathways for accessing high quality which can deliver imperative results for future prediction of disease. However, after applying data pre-processing dataset is prepared to be utilized for further investigation of study. In current approach, we have applied data cleaning, data transformation and data selection for discovery of knowledge from large scaled database [25].

Data Cleaning The real-world datasets comprise of missing values and noisy values which can generate bungling results thereafter effecting overall decision making. Hence, data cleaning is applied for removing missing values and replacing them with mean values for retrieval of effective and efficient patterns for knowledge discovery process. Thus, applied step was examined for varied techniques which include missing values replaced by NULL values, manually entering the values which is again a very time-consuming step finally the most appropriate technique was removing the value with mean or median which was considerable benefited technique to detect relevant results for knowledge discovery.

Data Transformation It is the technique to transform the data with varied scale ranges from 0.0 to 1.0 in respective of data values. The transformation is accomplished for several techniques from classification, neural networks and clustering for normalizing data sets values. There are several well-known data transformation techniques which Z-Score for normalization of data are, decimal scaling for scaling with varied decimal ranges and min–max normalization. In the current approach, we have utilized min–max technique for transformation of data.

Data Selection The feature selection is binding step for identifying the most appropriate features which correlate among each other and removing irrelevant and superfluous attributes. Thereby, the dimensionality of data is reduced which can severely increase the optimization technique of algorithm for discovery of patterns and knowledge. Several techniques such gain ration, information gain, Correlation based techniques are applied to determine the most effective attributes for knowledge discovery. However, we have applied correlation-based technique for data transformation as the features were more correlated with class values as compared to each other and no other technique was imposing better results.

3.3 Medical Data and Privacy Preserving Data mining

When we talk about medical data specially the patient health data, there is a big requirement of privacy to safeguard a patient’s details. The medical records can be snooped by insurance companies, medical laboratories or advertising firms to franchise their commercial interests. Since, clustering data mining technique does not rely on class labels, it forms the first choice of data analytics method in conjunction with privacy preservation. The patient attributes need not be mapped to specific class labels for predictive analysis. When we talk about a patient, the person can be identified by primary attributes such as, name, age, social security number, or a set of secondary attributes like occupation, history of diagnosis and so on. These attributes together make a dataset which can uniquely identify a person. To provide privacy to the patient, we have tested a hybrid solution which comprises of vertical partitioning of patient attributes and anonymization through generalization of some specific attributes, for instance generalization of age into a range of age groups, and finally analysis through K-means clustering.

3.4 Predictive Data Analytics

The data analytics techniques are the major step for decision making process among the large-scale databases. It tends to identify the process where the algorithmic powers are utilized for predictive modelling. However, determine the appropriate data mining technique for discovery of patterns among large scale databases is substantial for decision making process. Data mining techniques are indispensably distributive in two categories descriptive and predictive techniques. In the current approach we have utilized the predictive technique to discover clusters of variable size and number. The overall results were retrieved by Matlab 7.1 and STATA based software. The K-Means clustering technique was applied to retrieve deterministic results for future decision making. The K means clustering works on the principle to redefining mean values for each cluster using K as number of clusters. The centres must be chosen very thoughtfully as different location of centres leads to different results. The best way is to place these centres as far away from each other as possible. Next step involves considering each point or value of the dataset and linking it to the nearest centre. After associating all the points to the centres primary grouping is said to be completed. Further, new centroids are defined from the clusters obtained in previous step. After new k centroids are defined the same data points are linked to new centroids. This step is repeated again and again till no changes are done or centres do not move further.

$$M\left( O \right) \, = \mathop \sum \limits_{i = 1}^{D} \mathop \sum \limits_{j = 1}^{Di} \left\| {\left( {p_{i} - n_{j} } \right)} \right\|^{2}$$
  • \(\left\| {\left( {p_{i} - n_{j} } \right)} \right\|^{2}\) = Euclidean distance between pi and nj.

  • Di = number of data points in ith cluster.

  • D  = number of cluster centres.

Simple k means clustering algorithm clusters data depending upon different distance-based methods which include Euclidean Distance based method, Manhattan Distance based Method, Chebyshev Distance based Method, Filtered Distance based Method and Minkowski Distance based Method. Each of the distance methods have different ways to calculate distances between two points. Thus, each distance method gives different clusters where the quality of clusters obtained differ for each variable dataset.

Knowledge Discovery and Decision making: It tends to be the final process where the discovery of knowledge can be interpreted with the prediction of results for future intervention policy or decision making. The outcome of research proves that the identified model is capable to determine the realistic knowledge or unable to achieve the probabilistic results. Hence, identifying the retrospective factors which can be enabled for iterative and judgmental decision making. If the results tend to be incompatible or inconsistent then the recursive process can be synthesized for decision making.

4 Results

The data was collected for HIV and TB patients to determine the overall correlated patterns which are the root cause for prognosis of disease. The data obtained consists of evidences among 1,79,625 patients which were further correlated with socio economic factors which included race, age, gender, HIV status, and others from year 1993 to 2014 [23, 24, 40].

To implement privacy preservation, the dataset to be analysed was first partitioned vertically. The attributes which were found relevant for clustering were year of diagnosis, age, race, HIV status and vital status. The attribute year of diagnosis was generalised under the ranges 2000–2002, 2003–2005 and 2006–2010. The attribute age was generalised under the age groups 15–24, 25–44, 45–64 and 65+, which were also in conjunction with the clustering strategy.

To evaluate the weight of the shortlisted attributes, we computed the precision value given by, Precision = Ci/T, where Ci is the measure of correctly extracted information and T is the total information that was extracted without any partitioning. The precision values are shown in Table 1.

Table 1 Precision accuracy of extracted attributes

To test and compare the K anonymization, the attributes were queried with different levels of generalizations. The Fig. 2 shows varying levels of precision in the query results for anonymity threshold K, where 0 ≤ K ≤ 90. As we can see, the precision of results falls sharply for high levels of anonymization, and thus the information loss. For maximal results, we have limited anonymity level to an optimal level of L = 2.

Fig. 2
figure 2

Precision at different levels of privacy anonymization

The data was then analysed for HIV and Tuberculosis co-infection to determine the interrelated factors for discovery of knowledge. The prognosis of disease was measured with number of occurrences with total population in consideration. Similar, approach was synthesized for the prognosis of TB with HIV to calculate patient suffering from both HIV and TB, proportional technique was involved by dividing the prognosis of both TB and HIV incidence with total population in consideration. Relative incidence rate of TB and HIV with respect to TB without HIV was calculated by diving the incident rates of the above two. Mortality rates were calculated for all the above variables in the same way. Finally using this refined data, Classification models were constructed using J-48 decision tree. The data was clustered into different clusters.

In Table 2 variable clusters of varied shapes and sizes were determined, where 4 clusters were obtained through Simple k means clustering algorithm utilising Euclidian Distance Based method. Cluster0 contained mostly contained population between 45 and 64 years of age, Hispanic or Latino race and HIV status was considered negative with most of them being alive. Cluster 1 contained population of Age group 65+, white race and most of them were dead. Cluster 2 consists of 25–44 years of age with HIV status negative and most of them were also alive. Finally, cluster 3 contained of 15–24 years of age which had HIV status as negative and were also alive.

Table 2 Clusters obtained Simple k Means clustering analysis

About 1,79,625 Tuberculosis cases were reported in United States from year 1993 to 2014, out of which 18% (i.e. 32,636) were diagnosed with HIV. Incidence of TB cases initially increased from 1993 (7329 cases) to 1996 (8744 cases) which is about 19.3% increase. From Year 1996 to 2000 TB cases showed a decline of about 7.6% with 8072 cases reported in year 2000. From Year 2000 to 2004 TB cases incidence increased by about 5% with 8476 cases reported in Year 2004. In Year 2004–2009 the trend was in lower side where TB cases decreased by 12.8% with only 7383 cases of TB reported in Year 2009. In 2009–2011 witnessed a huge surge in TB cases with an astonishing 18.9% increase with 8784 cases being reported in Year 2011. From year 2012 to 2014 TB cases declined gradually with only 8173 cases reported in year 2014 which is a 6.9% decrease. Figure 3 represents the overall clusters retrieved.

Fig. 3
figure 3

Number of TB cases in each year

In Fig. 4 overall increase in the prognosis of TB from 1993 to 2014 is measured using STATA where the data represents the step high in year 2011. Further, the study was implemented to measure the age wise trends in TB patients’ year wise. The results represented that different age groups have different susceptibility towards TB disease with a clear difference in the number of cases reported for each of different age groups. Age group 0–4 years consisted of only 2% TB cases reported in US from year 1993 to 2014. Age Group 5–14 years represented even lesser proportion as only 1% TB patients out of all the TB patients belonged to this age group. Age group 15–24 years that is young adults comprised of about 11% of all the TB patients. Age group 25–44 years i.e. Middle-aged population comprised of the maximum proportion of TB patients with almost 42% of all the reported TB cases belonging to this age group. Age Group 45–64 Years of age i.e. old people comprised about 30% of all the TB cases. People with 65+ age comprised 14% of the TB cases.

Fig. 4
figure 4

Clusters correlated age and year wise

In Fig. 5, the graph represents the high percentage of patients suffering in age group of 25–44 where the substantial policies should be determined at managerial level low down the prognosis of disease.

Fig. 5
figure 5

Clusters of TB cases in each race

Further, the data was analysed for races that are inhabiting US discriminately with some races getting affected with TB more as compared to other races. The results obtained represented that Asian, Non-Hispanic, Black or African American, Hispanic or Latino were observed to suffer from TB supplementary rater than as compared to Multiple Race, Non-Hispanic, Native Hawaiian or other Pacific Islander, Non-Hispanic, White, Non-Hispanic. The American Indian or Alaska Native, Non-Hispanic were suffering with lower rate as compared to others. In Fig. 4 an illustration is presented race wise.

In Fig. 6 the socio-economic trends were depicted with drug usage of TB patients. The results obtained clearly represented that drug use (injecting or non-injecting) or alcohol use does not have significant effect on the vital status of the TB patient as the graph obtained from analysis of clusters did not show any clear patterns to prove the above stated. The distribution of those dying and those who were alive was completely random throughout all the years.

Fig. 6
figure 6

Vital status of TB patients with drug use and alcoholism in each year

Similarly, the focus of study was to determine the correlated factors of TB with HIV and their socio-economic causes of dead and alive. Thus, further the study was evaluated to determine the synchronized patterns for knowledge discovery among large scale databases. The outcome of the study represented that there exists a definite trend with number of HIV–TB cases from Year 1993 to 2014. The data exhibited a regular decline with the number of HIV–TB patients from year 1993 to 2014. Hence, in year 1993, HIV–TB patients comprised about 50% of all the TB infected patients. But with each passing year number of HIV–TB patients kept on decreasing in a regular manner with only 6% HIV–TB patients in Year 2014. Table 3 represents the overall rate of TB and HIV patients with their alive and dead status year wise.

Table 3 Percentage of HIV TB and non HIV TB cases along with vital status of both scenarios

Likewise, Fig. 7 represents the cluster results year wise with total HIV and NON-HIV patients suffering from TB year wise.

Fig. 7
figure 7

HIV and non-HIV TB cases

However, mortality rates in HIV and TB patients were observed to be high as compared to Non-HIV TB patients. As data represents the mortality rate was 6% higher with patients suffering from both HIV and TB as compared to only 1% death rate in Non-HIV TB patients. In Fig. 8, year wise cluster were rationalized where 2 clusters were determined with dead and alive status year wise, cluster0 epitomizes the patients which were alive and cluster1 with dead percentage year wise. The results predicted that with each passing year death rates in HIV–TB patients declined in a linear fashion with only 2% death rate recorded in year 2014. While that of Non-HIV TB patients declined from 2% in year 1993 to 1% in year 2014.

Fig. 8
figure 8

Cluster year wise vital status in HIV and non HIV TB cases

In Fig. 9, statistical representation of HIV with TB patients determines the decrement in mortality rate from 1993 to 2014 where it has drastically reduced from 6 to 2%.

Fig. 9
figure 9

Year wise vital status of HIV TB cases

In Table 4, a trend was depicted for patients suffering from both HIV and TB in correspondence to variable age groups. The results suggested a high percentage of HIV TB cases belonged to two categorize of age groups i.e. 25–44 (comprising of middle aged adults) having 28.2% of HIV TB cases out of total HIV TB cases and 45–64 age group (comprising of old people) having 17.2% of HIV TB cases out of total HIV TB cases.

Table 4 Age wise % of HIV and non-HIV TB cases with vital status

In Fig. 10, the cluster analysis represents that age group effected in maximum capacity is 25–44 years with HIV–TB cases.

Fig. 10
figure 10

Cluster age wise total HIV and non HIV TB cases

Additional, results were represented for age wise, vital status for both patients suffering from HIV–TB with Non-HIV TB cases. However, data suggested to have lower mortality rate with patients suffering due to TB. But mortality rates were higher due to TB in different Age groups varied with screening higher of mortality rate in 65+ age group for both HIV TB (mortality = 9.3%) and Non-HIV TB (mortality = 2.1%) cases whereas lower mortality rates were observed in age between 5 and 14. In general HIV TB patients showed a much higher mortality as compared to Non-HIV TB patients in all the age groups. Figure 11, represents two clusters where cluster0 blue in colour represents age wise status for dead, cluster1 represents dead status age group wise.

Fig. 11
figure 11

Cluster age, vital status in HIV and Non HIV TB cases

The statistical measure of Non-HIV TB cases age wise with vital status of dead and alive. There is considerable rise in mortality rate in the age of 65+.

In Fig. 12, data perceives to be higher in mortality among the all age groups discussed. However, age after 45 years is more probable to have higher mortality rate as compared to other age groups.

Fig. 12
figure 12

Age wise vital status of HIV TB cases

In Table 5, a Discriminant analysis is represented where the incidence of HIV–TB is significant with different races living in US with some races getting affected with HIV TB more often as compared to other races. As observed from the results obtained from data analysis HIV TB is more common in “Black or African American” (29.3%), “Hispanic or Latino” (17.1%), “White Non-Hispanic” (12.8%). Whereas it is much less common in “Asian, Non-Hispanic” (2.6%); “Multiple Race, Non-Hispanic” (5.0%); “Native Hawaiian or Other Pacific Islander, Non-Hispanic” (2.6%); “American Indian or Alaska Native, Non-Hispanic” (5.3%).

Table 5 Race, vital status of total HIV and non-HIV TB cases

In Fig. 13, clustering analysis were performed for different races to measure the mortality rate among HIV–TB and Non-HIV TB cases. Cluster0 blue in colour represents age wise status for alive, cluster1 represents dead status race wise. Results obtained after analysis of TB cases in different races and their survival rates revealed that different races showed different mortality due to the disease. As, previous results were suggestive of higher mortality rates for patients suffering HIV–TB cases, similarly also the HIV TB patients showed higher mortality as compared to Non-HIV TB. The race wise analysis also showed that “American Indian or Alaska Native, Non-Hispanic” Race showed maximum mortality (1.5%) whereas “Asian, Non-Hispanic Race” and “Native Hawaiian or Other Pacific Islander, Non-Hispanic” Race showed minimum mortality (0.3%) in case of Non-HIV TB cases.

Fig. 13
figure 13

Cluster with race, vital status of HIV and non-HIV TB cases

In Fig. 14, HIV TB cases indicated different trends in mortality rates all together where “White, Non-Hispanic” showed maximum mortality (5.3%) followed by Black or African American, Non-Hispanic Race (4.6%), “American Indian or Alaska Native, Non-Hispanic” Race (4.4%) and “Hispanic or Latino” Race (3.7%). Whereas “Native Hawaiian or Other Pacific Islander, Non-Hispanic” Race showed minimum mortality (0%) followed by “Asian, Non-Hispanic” Race (2.4%) and “Multiple Race, Non-Hispanic” (2.8%).

Fig. 14
figure 14

Vital status of HIV and TB cases

5 Conclusion

Big data has its potential opportunities in healthcare application areas due to faced IT based technological interventions. The technical challenge exists due to impeding obstacles which occurs owing lapses in security and privacy of data. In this paper, we discuss the big data analytics with privacy and security in context with healthcare databases for patients suffering from HIV and TB. The paper briefly discusses the related works across the big healthcare databases in context with predictive data analytics with privacy and security. The current approach of study is focused on to detect patterns from healthcare databases with concerns on maintain the privacy of data and generating the patterns for future clinical decision making. Further, the study was intuitive and detect patterns which can generate knowledge while maintaining the privacy among the data so no person information was instinctively analysed to generate patterns.

A novel framework was designed to effectively and efficiently capture big data from various resources in context to maintain the privacy and security of data and detect hidden patterns for clinical decision making. In general, focus of study is impacted on big data analytics as an imperative technology to generate impounded outcomes to reduce the global burden of disease and preserve the patient personal information. Hence, we found that big data analytics techniques bound to have extensive vision where the paradigm to generate and develop sustainable for varied application domains in which the focus of study is influenced for healthcare databases. In technological context there exist huge advancement in electronic databases with volume and complexity, herewith the knowledge discovery is discovered as an exponential tool to analyse big data.

The objective of this study was to observe the big data analytical trends of patients suffering from TB–HIV in US population from 1993 to 2014 and further enabling the techniques to preserve the data privacy at each level, so empowering a secure platform for knowledge discovery. Thus, we observed the patients correlated patterns keeping in concern without hampering the personal data of patients hence, the personal information was hidden and other associated patterns were analysed with socio-economic, age groups and races inhabiting and thus seek to understand HIV–TB syntomic cause for prognosis of disease. The results attained clearly exhibited that incidence levels in different years has no uniform trend. Though a clear trend was observed in incidence levels in different age groups as age group 25–44 and 45–64 were the most infected of all the ages. Race wise analysis represented that “Black or African American” and “Hispanic or Latino” showed maximum incidence as compared to other races in US.

HIV–TB coinfection analysis suggestive that the HIV–TB coinfection has decreased significantly from year 1993, where it was 50% of total TB cases to year 2014 where HIV–TB coinfection cases were only 6% of all TB cases. Though most of the patients infected from TB received treatment in time and survived, but results also evidently exhibited that when HIV–TB coinfection occurred it caused more mortality as compared to non-HIV TB cases. It was also inferenced from the results that the mortality in HIV TB cases also declined from 1993 to 2014.