Keywords

1 Introduction

The term big data is used to define humongous data that could either be of structured, semi-structured, or unstructured type. The extensively large data makes processing difficult by using traditional available databases and software technologies. Heavy parallel software devices running on thousands of servers can be used for processing [1].

Big data initially was categorized by the four Vs—volume, variety, velocity, and veracity. However, with time other categorizations are also made as the data is emerging vastly with each passing day [2, 3].

1.1 Ten Vs of Big Data

Big data is mainly characterized by the Vs that define the different traits of the data one is dealing with the under mentioned figure and describes significant ten Vs that make big data different from the traditional data.

Volume. Generated data’s quantity is referred to as volume. Value and potential of the data are examined with the size of the data and whether it can be put into the category of big data or not it is also examined through the size of the available data. The huge scale increment makes the analysis of data a difficult process if one is using traditional available tools.

Variety. The category of the data it belongs to is known as variety. Data can come in any form. It may be structured data, unstructured data, or semi-structured data coming out of various sources like e-mails, videos, audios, transactions, etc.

Velocity. How fast the data is generated and processed to meet the required demand refers to as velocity. The speed with which information is gathered and processed to meet the required demands of the intended users.

Veracity. It can be referred to as the trustworthiness of the data that is being used. Analysis correctness depends heavily on the veracity of the source data. Data captured quality can differ immensely.

Variability. It refers to different things. It focuses on adequately understanding and interpreting the correct meaning of raw data that depends on its context. It can also be defined as the inconsistency of the speed at which data is generated and stored. Validity means how accurate the data is for its specified use. If one wants to use the results in some decision-making, subsequent analysis must be accurate enough.

Vulnerability. Vulnerability means a flaw that can leave a system open to attack. The vulnerability may also be referred to as any type of lapse in a computer system, in a set of procedures, or in anything that can hinder the security of the system.

Volatility. In the world of real-time analysis, it is important for the decision-makers to analyse till when the data provided is relevant. This relevancy of data validity is known as volatility.

Visualization. Data visualization means how the data is presented in a graphical format that is easily understood and interpreted by its users. Various complex representations like heat maps and fever charts are included here that help decision-makers to identify the hidden patterns and correlations.

Value. It is the most important trait of the data. Without this, other characteristics are of no use if we are not able to deduce the business value from the data. Big data helps in decision-making in the organization by measuring the importance of the data.

1.2 Big Data Analytics

Big data analytics means exploring large data sets containing a variety of datatypes in big data—to reveal hidden data patterns, unknown interactions, market trends, customer preferences, and other useful business information. To make companies more knowledgeable by enabling data scientists is the first goal of big data analysis. Big data comprises structured, semi-structured, and unstructured data. Tools that are used for advanced analytics such as predictive analysis, data mining, and text analysis can be used in big data analysis as well. Data visualization tools along with some mainstream BI tools can also be very effective in the analysis process of big data [4].

Big data analytics life cycle—As we are using vast data repositories to gain information that will be useful for analytics purposes, we need to refine the available data. The refinement process includes various steps as defined in Fig. 1.

Fig. 1
figure 1

Phases of big data analytics

In all the above phases, there are multiple threats that are required to take care of.

Data Collector. Data comes from various sources and with different formats, i.e. structured, semi-structured, and unstructured. In this phase, information is gathered to address various things that can be used by an organization for various purposes. From the security point of view, securing big data from the first phase is very important. Limited access control and encryption of data fields can be done here to ensure privacy here [5].

Data Storage. Data storage mainly addresses the volume challenge by making use of distributed, shared nothing architectures. Data is stored and prepared here that will be used in the next phase. Here produced data may be sensitive, so it is vital to take care of it. Data anonymization, permutation, data partitioning, etc., are some techniques that can be applied to ensure security [6].

Data Analytics. The primary aim of big data analysis process is to disintegrate the significant data from the bunch of data and to provide decisions and recommendations based on the findings after investigating the whole data. This phase is used to create knowledge. Various data mining methods can be used here. Data miners use powerful algorithms that can extract sensitive data. A security breach may also happen here [7].

Knowledge Creation. This is the final phase. Conversion of the data into some useful information is done at this step. If data seizing and sensing are done right, then big data repositories can be created in the form of knowledge repositories. It is used by decision-makers. New information and valued information are created here. Knowledge is sometimes considered sensitive here [8].

Threats Associated with Big Data Life Cycle

Various threats associated with different phases of the big data life cycle have been summarized in Table 1 [2].

Table 1 Threats on various phases of big data analytics

1.3 Applications of Big Data

In the present era, the use of the Internet has extended abruptly. Due to the vast usage of the Internet anywhere and everywhere, big data applications are also increased due to their decision-making ability. Big data is no more just a buzzing word, but its use is everywhere today. All credit goes to the technology that is nowadays not just confined to the urban areas, but rural and underdeveloped areas are also taking its advantage. Big data applications range from the water supply, smart cities, crime, health care, education, electricity, etc. (Fig. 2).

Fig. 2
figure 2

Various applications of big data

1.4 Challenges with Big Data

Generated from different devices at a very fast pace, big data brings the following challenges with itself:

Security. Data is generated at a high pace in huge quantities every day. Big data analytics will not be considered a reliable system if security algorithms will not be taken into account. Security issues can be further categorized: input, analysis of data, and output, system communication.

Inconsistent Data. More inconsistent data and incompatible data will easily appear since data is being gathered from different systems. So it will also be a challenge while doing big data analytics.

Privacy. It is different from the issue of security as it deals with the fact that whether it is possible to restore the personal information of the system with the help of big data analytics, even though the input variables are anonymous. With big data analytics being widely used, it is quite possible that private information may get exposed to other people after the analysis process. So it is also a challenge in big data analytics.

Heterogeneity. Insights of data can be achieved through the richness and nuances of the data. Though, machine algorithms cannot understand nuances as they expect comparable data. So structuring the data carefully is the first step of big data analysis. Even after applying data cleaning and data correction methods, some incompleteness of data may be there. Managing this is a great challenge.

Timeliness. In a dynamic and rapidly growing world, a second or even a microsecond between one reading and the other may led to mismatching against each other. So timeliness is a very fundamental concept while dealing with real-time data.

Communication Between the Systems. Since most of the tasks of the big data analytics system will be designed for parallel computing, big data analytics and other systems communication will impact the performance of the system immensely. So managing the cost of communication and making connections reliable are two open challenges to deal with [2].

2 Big Data Security: A Multifaceted Challenge

Big data security is a cumulative term for all the techniques and tools that are used in securing the data against all malicious activities such as theft of data, attacks, or any activity that affects negatively. The other threats include DDoS attacks, ransomware, and online stored information stealing [9, 10].

The prime reason for security concerns in big data is because big data can be accessed widely nowadays. Data is shared on a large scale by scientists, doctors, business officials, government agencies, and normal people. The current approaches are inadequate when dealing with big data security. The present technology has weak security capability for maintenance. So intruders can easily breach those. Thus, reassessment and updation of current approaches should be performed to prevent data leakage [11]. There are various challenges when one is dealing with big data security. A few of them are mentioned below.

2.1 Issues

Vulnerability to fake data generation—Before dealing with all the operational security challenges of big data, the concerns of counterfeit data generation should be kept in mind. To purposively undermine the quality of the big data analysis, cybercriminals can forge the data. For instance, if a manufacturing company uses sensor data to detect malfunctioning production processes, cybercriminals can penetrate that system and make sensors show fake results, say, wrong temperatures. This way, one can fail to notice alarming trends and miss the opportunity to solve problems that can cause severe damage. Such issues can be addressed by applying the fraud detection approach [12].

Untrusted mapper’s presence—After collection, big data firstly undergoes parallel processing. MapReduce paradigm is used here and when data splits a mapper processes that and allocates a position for storage to the data. If anyone from outside knows your mapper’s code, he/she can change it. In this way, it is ruining the information processed very effectively. Outsiders can get inside to access sensitive information [13].

Mining of sensitive information—Perimeter-based security ensures data protection at entry and exit levels. But inside the system, the work of IT professionals is a mystery. Such a lack of control over big data solutions can allow corrupted IT professionals to mine the data and sell it for their benefit. As a result, the organization will suffer huge losses. Here, data can be made more secure by adding values to it. Anonymization can also benefit the system’s security. The private details with absent names, telephones, etc., practically will not harm if someone acquires this information with malicious intentions [14].

Real-time protection of data—It is hard for organizations to maintain orderly checks as data is generated vastly on a real-time basis. However, security checks in real time or almost in real time will prove beneficial [15].

Access control granularly—Granular access control allows people to access the required sets of data but can view the only part of data they are allowed to see. The whole valuable content will not be visible to them. Vastly, it can be very useful in health care where sensitive information like names and phone numbers will remain hidden while other information may be useful for medical researchers to find new insights [16].

Privacy protection of non-relational database—Various security vulnerabilities are faced by datastores such as NoSQL that lead to privacy threats. At the time of logging and tagging, it is unable to encrypt the data and so is the case with the distribution of data to different groups while it is streamed and collected [16].

3 Security-Based Literature Survey

There are three major security considerations outline that has been taken into account while dealing with big data: anonymization, encryption, and access control [17] (Fig. 3).

Fig. 3
figure 3

Approaches towards big data security

Diversified data sources, data streams, data formats, and infrastructures may impose unique security vulnerabilities (Table 2).

Table 2 Summary of the literature survey

3.1 Existing Approaches to Handle the Big Data Security

Listed below are the different approaches to manage security as discussed by various authors.

Security by encryption—Enabling only the authorized user’s access to the information by encoding the information is known as encryption. Li et al. proposed an algorithm to avoid cloud operators reaching the user’s sensitive data. It is the amalgamation of AD2, SED2, and efficient data conflation algorithms entitled as security-aware efficient distributed storage model [18]. Aljawarneh et al. proposed a system for multimedia big data against real-time tampered data attacks. The proposed scheme is made by merging the Feistel network, AES, S-Box, and genetic algorithm. The scheme is applied over the data set of JUST university hospital [19]. Yan et al. proposed a scheme based on deduplication of encrypted data and proxy re-encryption. Deduplication is an important practice to achieve successful cloud storage, especially for big data storage. It allows only the authorized users to access the information. It supports flexible data updates offline as well [20]. Dong et al. presented a scheme for heterogeneous ciphertext transformation. It is a proxy algorithm that works on a virtual-based monitor which provides support for the realization of system functions. It is designed to protect and secure user’s data effectively. It also provides the data owner the total control over their data for modern information security [21].

Security by access control—One of the most important security components is access control systems. Due to misconfiguration of the access control policies, the security and privacy of the system are often compromised. Hu et al. have proposed a scheme for distributed big data processing clusters. The scheme aims to authorize the protection of big data processing from internal attacks [22]. Wnorong et al. have introduced the mechanism for content access. The proposed mechanism is very suitable for the content-sharing of information in big data. CBAC is used for access control decisions based on semantic similarity between the requester’s credentials and the content [15]. Siffah et al. proposed an off-chain-based sovereign blockchain where transactions are made between parties through a virtual container. Then blockchain network is used to store the output [23]. Kumar et al. proposed a scheme based on ciphertext policy with an attribute—encryption along with less computation overhead [24]. Khuntia et al. proposed a scheme for privacy preserving in the cloud to ensure big data access control. To reduce computational overhead, authors have used the concept of multi-sharing here [25]. Jasim et al. proposed a three-tier approach including cloud architecture, transaction manager, and clients. Zero trust is the basis of communication between the models [26].

Security by anonymization of the data—Control over private information gathering and its usage is information privacy. The ability to stop information from becoming public either by a group or an individual is known as information privacy. The assimilation of private information over the Internet during its transmission is one of the issues faced by the users. Privacy protection is one of the most bothering issues in big data and cloud applications, so there is an urgent need for strong customer privacy preservation techniques. Data anonymization is one of the efficient and effective ways towards privacy preservation [27]. Zhang et al. proposed a technique based on MapReduce on the cloud. A combination of highly scalable median finding algorithm and histogram technique is used here to propose for achieving cost effectiveness. Scalability is also measured here using multivariate partitioning [28]. Zhang et al. have pointed out the scalability issue in the cloud over big data. For this, a hybrid approach of top-down specialization and bottom-up generalization is used. K-anonymity parameter with workload sharing is used for selecting the component to achieve a highly scalable environment if compared with the existing approaches [29]. Al Zobi et al. have proposed a novel framework MDSBA. According to the authors, the loss of important information is the result of the avoidable generalized identical details. Through the proposed scheme, authors have expanded the k-anonymity and applied the bottom-up approach to avoid the identical widespread records more methodically and efficiently [30]. Ferrer et al. have focused their work towards dealing with the two important issues while using k-anonymity, i.e. the quasi-identifier attributes and the data controllers attributes by proposing a k-anonymity algorithm that avoids the dimensionality problem and by using mean and median to avoid the risk of disclosure by replacing the generalization method with the alternative aggregation method which is comparatively less sensitive, respectively [31]. Cui et al. proposed a deduplication-based system for a hybrid cloud used for attribute-based storage. The authors also discuss the ways to achieve semantic security along with keeping in mind the context of confidentiality to share the data with other users [32]. Mehta et al. proposed a scheme with the name improved scalable l-diversity approach based on K-anonymization. The run-time of this scheme is very less, and the loss of information is also less in comparison with other schemes [33].

4 Conclusion

Data is increasing with each passing moment over the Internet, making it impossible for traditional approaches to deal with the data. Out of the available bulky and raw data, extracting the relevant information is the important task of big data analytics. However, while dealing with the data, security is the major threat that is being faced by the analysts. The present paper discusses some of the novel approaches that can be used to ensure the security of big data. Moreover, we have noted that all the present traditional schemes cannot be applied over big data, but with certain advancements, in the future, the schemes can be improved and applied.