1 Introduction

Modern e-Science infrastructure allows researchers to address new large-scale problems whose solution was not available before, e.g. genome, climate, global warming [1, 2]. E-Science typically produces a large amount of data that must be supported by a new type of e-Infrastructure, which can store, distribute, process, preserve, and curate these data. We refer to this new infrastructure as the Scientific Data e-Infrastructure(SDI) [1, 3]. In e-Science, the scientific data are complex multifaceted objects with complex internal relations: they become an infrastructure of their own, which must be supported by corresponding physical or logical infrastructures to store, access and manage these data [47].

Typically, scientists must analyse terabytes and even petabytes of scientific data that are collected from the existing input data, the intermediate data generated in the process, and the resulting data [8]. As reported by Szalay et al. in [9], the total amount of scientific data will be doubly increased in the next decade. The production of scientific datasets involves large numbers of computationally intensive tasks, e.g. the scientific workflows [10]. These generated datasets contain important intermediate or final results of the computation and must be stored as valuable resources [11]. Currently, most popular e-Science applications are deployed in grid systems, which can offer high computational capacity and massive storage. However, building a grid system is extremely expensive and even those existing grid systems are devoted to their own specific applications, thus failing to satisfy the more general scenarios [12].

Recently, the emergence of cloud computing technologies offers a new method to develop scientific workflow systems [13]. This method has been successfully employed in many areas [14]. Foster et al. make a thorough comparison between grid computing and cloud computing [15]. As a conclusion of their work, cloud computing has several advantages over grid computing. First, cloud computing can provide the necessary high performance and massive storage for data-intensive applications such as the scientific workflows like the grid system, but with a relatively low infrastructure construction cost [16]. Then, it creates a new paradigm to gather worldwide scientists for collaborations [13]. The scientists can upload their data and set up their applications on the scientific workflow systems through the Internet [17].

With the development of cloud computing, the widely used scientific workflow systems can be adapted into cloud computing for even more promising applications. Cloud computing offers customers a more flexible way to obtain computation and storage resources on demand [18, 19]. Migrating organisational services, data and application in cloud is an important strategic decision for organisations due to the large number of benefits introduced by the usage of cloud computing, such as cost reduction and on-demand resources [20]. However, there are challenges and risks for cloud adaptation related to data security when dynamically placing data among the globally distributed data centres, while minimising the user-perceived latency [20, 21]. Because the data centres are connected with limited bandwidth in cloud, some scientific data may be insecure when they are transmitted from one data centre to another. In addition, this challenge is further complicated by the security constraints on those potentially sensitive data. What is more, new threats may also arise since the resources are shared with others [22]. In this scenario, one’s sensitive data are more likely to be exposed to others. Moreover, data information stored in data centres may even be leaked out by some malicious cloud providers. The information loss of sensitive data may cause serious loss of interest, property and personal safety. Consequently, the problem of data security has been widely recognised as one of major technical obstacles when deploying data into openly distributed systems, e.g. clouds [23, 24].

In this paper, we propose a security-aware intermediate data placement strategy for scientific workflows in cloud. Disregarding the placement of the input data, which is decided by the users, we mainly focus on the placement of the intermediate data. In our strategy, we attempt to ensure data security based on three aspects: data confidentiality, data integrity and authentication access. A security model is used to quantitatively measure the security services provided by the data centres. Then, we use an ACO (ant colony optimisation)-based algorithm to dynamically select the appropriate data centres for the intermediate data to improve the data security. This strategy is notably effective at improving data security during the execution of the scientific workflows.

The remainder of the paper is organised as follows. Section 2 describes the related work. Section 3 introduces the security problem of scientific workflows. Section 4 proposes the security model of scientific workflow systems in cloud. Section 5 presents the data placement strategy. Section 6 demonstrates the experimental results and the analysis. Finally, we draw our conclusions.

2 Related work

The growing trend towards cloud computing has provoked new data management systems such as Google File System [25] and Hadoop [26]. These systems act as data storage infrastructures in cloud that also provide some basic data management functions, e.g. unauthorised access, data block storage, and disaster recovery. The technologies in big data management [3, 27] have become an important issue. Currently, several works have begun to concentrate on the data placement of scientific workflows. Guo et al. [28] propose a model for multi-objective data placement and use a particle swarm optimisation algorithm to optimise the time and the cost in cloud computing. To process the resources more effectively, Guo et al. [29] devise an optimal data placement strategy that minimises the processing cost and the transforming time. For the data-intensive scientific workflows, the data movement time seriously affects the efficiency of the data-intensive applications. Ma et al. [30] propose a data placement method based on the Bayesian network for data-intensive scientific workflows, which could effectively reduce the data movement time among different data centres. And Er-Dum et al. [31] propose a data placement strategy based on a heuristic genetic algorithm to reduce the data movements among the data centres while balancing the loads of data centres. Moreover, Shao-Wei et al. propose a two-stage data placement strategy and a task scheduling strategy for efficient workflow execution with data dependencies [32]. These data placement strategies mainly consider the data transfer time among data centres. However, when some sensitive data are transferred across the data centres, data security threats may arise during the transmission. Consequently, data security is a notably significant and challenging problem for data placement.

Because the resources are shared among the users in cloud, the data information is exposed to multi-users and cloud providers when we place our data into data centres. This paradigm introduces some security risks with the existence of some intensive data. Thus, the data security is widely regarded as a major barrier in cloud. Several recent works have considered data security. Xi et al. [33] prove that the compression is asymptotically lossless because the aggregated estimator deviates from the true model for data-intensive computing. Xiong et al. [21] briefly introduce the security problems in cloud, particularly when deploying data. In [34], Peng et al. propose a data placement approach that places the data on the cloud side or the client side according to the privacy requirements and adjusts the data placement according to the control flow to minimise the data transfer time. However, most of the latest works fail to describe a strategy for solving the data security problem.

The closest studies to ours are [16] and [35]. In [16], Yuan and Yang propose a two-phase data placement strategy based on \(k\)-means clustering for scientific workflows to reduce the data movements. The strategy contains two algorithms that group the existing datasets into \(k\) data centres during the build-time stage and dynamically clusters newly generated datasets based on their dependencies to the most appropriate data centres during the runtime stage. However, the reduction of data movements does not imply that the performance improves because this strategy does not consider the size of the datasets and the bandwidths among the data centres. This problem has been well addressed. In [35], Zheng Pai and Cui Li-Zhen propose a genetic algorithm-based data placement strategy to address three problems: reducing the time cost of data movements across the data centres, addressing the data dependencies, and maintaining a relative load balancing of the data centres. This strategy adopts a genetic algorithm to obtain some solutions that can reduce the data transmission in the first stage. In the second stage, those solutions are readjusted using the data dependencies. Finally, the most appropriate strategy is selected according to the load balancing performance in the third stage. However, this strategy may fail to obtain the optimal solution. Because the final strategy is selected by different goals in each stage, some appropriate solutions may have been missed. Unfortunately, these works only focus on how to reduce the data transmission and fail to consider the data security. In fact, the security problem is more important in the network construction in cloud.

Therefore, it is necessary to deploy security services to protect the sensitive intermediate data in the scientific workflows. Because snooping, alteration, and spoofing are three common attacks in cloud computing, in this paper, we consider three security services (authentication service, integrity service, and confidentiality service) to guard against the common threats to the sensitive data. Snooping, which is an unauthorised interception of information, can be countered using confidentiality services. Alteration, which is an unauthorised change of information, can be handled using integrity services [36]. Spoofing, which is an impersonation of one entity by another, can be well handled using an authentication service [37]. With the three security services, the users can flexibly select the security service to form an integrated security protection against a diversity of threats and attacks in cloud. In this paper, first, we build a security model to measure the security overhead incurred by deploying three services for the sensitive data. Then, we propose a security-aware data placement strategy that can improve data security while guaranteeing the data transfer time.

3 Problem analysis of the data security for scientific workflows

3.1 Scientific workflow model

Similar to the literature [16], we first describe the scientific workflow applications as follows.

Definition 1

The scientific workflow applications can be expressed as a triple \(w = {<} T,\,C,\,DS {>}\), where \(T\) is the set of all tasks in \(w\), and \(C\) is the set of the control flow among the tasks. In this paper, the control flow is reflected through the data flow among the tasks, andDS is the collection of all datasets in \(w\).

Our work mainly relates to the input and output datasets of the scientific workflows tasks. Thus, the tasks are described as in Definition 2:

Definition 2

The task is defined as \(T_{i}={<} IDS_{i},\,ODS_{i},\,s_{Ti}, s_{Twi} {>}\), where \(IDS_{i}\) indicates the input datasets, \(ODS_{i}\) refers to the output datasets, \(s_{Ti}\) represents the task security service requirements, and \(s_{Twi}\) denotes the importance of the security services. These parameters will be introduced in Sect. 4.2.

In this paper, we divide the datasets into fixed-location and flexible-location datasets according to the different locations where the data are stored. Because of the limitations and constraints on the intellectual property, equipment and processing capacity, the fixed-location datasets are stored in specific data centres. Furthermore, these fixed-location datasets will consume some storage resources because they have fixed storage locations. Thus, the data layout of the overall strategy will be affected. There will be a detailed description and analysis in Sect. 5.

3.2 Example analysis

It is known that the data centres are connected together through the Internet. The users must use computing resources and storage resources from different data centres during the execution of the scientific workflows.

From Fig. 1a, we learn that the scientific workflow includes tasks, input datasets, generated datasets and fixed-location datasets. These data are placed in \(DC_{A}\) and \(DC_{B}\) in Fig. 1b. When executing the scientific workflow, it is necessary to access other data centres to obtain the required data.

Fig. 1
figure 1

Example of a scientific workflow

The data, particularly sensitive data, may be subjected to security threats in data centres because of the sharing feature of the cloud. As Fig. 1b shows, the sub-tasks \(\{T_{1}, T_{2}\}\), input data \(\{d_{1}, d_{3}\}\) and intermediate data \(\{dT_{1}, dT_{2}\}\) are placed in data centre \(DC_{A}\). Then, the sub-tasks \(\{T_{3}, T_{4}\}\), input data \(d_{2}\), fixed-location data \(d_{4fix}\) and the intermediate \(dT_{3}\) are placed in data centre \(DC_{B}\). For example, \(T_{3}\) must call the input data \(\{d_{2}, dT_{1}\}\) during the execution. However, the input datasets are located in different data centres. We must obtain \(dT_{1}\) in \(DC_{A}\). Furthermore, \(\{d_{1}, d_{3}\}\) is also placed in \(DC_{A}\). The information is likely to be leaked while acquiring \(dT_{1}\). If these data are strongly sensitive data, their disclosure will cause irreparable consequences. In conclusion, data security is an important issue that should not be overlooked when running scientific workflows in the clouds.

3.3 Problem description

With the increasing demand for computing resources and storage resources, the cloud computing model emerges. However, it also incurs an increasing challenge for placing data in globally distributed data centres. When deploying a scientific workflow in cloud, the data security is seriously threatened. The security problems may arise from two sources: the cloud providers and the other multi-users.

First, this paper analyses the problem of the cloud providers who offer the security services. In Fig. 1b, for example, the data \(\{d_{1},\,d_{3},\,dT_{1},\,dT_{2}\}\) are located in \(DC_{A}\). Thus, the cloud providers are certainly aware of the related information of the users’ data. These data may be leaked or even sold to others by some malicious cloud providers. The information loss of some intensive data may cause loss of interest, property or personal safety. In addition, there are data for other users of the same data centre. Because the resources are shared with other users, the users’ data could easily be stolen or tampered with by the multi-users, which ultimately results in the losses of the user. The details are given in Sect. 3.2.

To address this security issue, we should protect the security of the users’ data. The users can request different data security services according to the degree of sensitivity of the data when running scientific workflows in cloud. Accordingly, the data centres mainly adopt some measures such as certification services, encrypted data storage, data recovery, security management, security logs, and audit services to provide users with data security requirements.

In addition, we must consider whether the data centre can satisfy the data security services requirements when placing the data. If the data centre cannot satisfy the requirements of the data security services, one should deploy the data in other data centres that can provide the services.

The next section will introduce the security model of the scientific workflow systems.

4 Security model of scientific workflow systems

In this section, we illustrate the security model for scientific workflow systems from service providers, service consumers and service evaluation. This model includes the security service model of data centres, security service model for task and data, the degree of data security deficiency (DDSD), the degree of task security deficiency (DTSD) and Security Deficiency Degree of Scientific Workflow (SDDSW).

4.1 Security service model for data centres

First, we introduce the security services of the data centres from the perspective of the service providers. It is known that the distributed data centres are connected together through the Internet. These data centres provide the certification service, encryption for data storage, data recovery, security management, security logging, auditing and other measures to guarantee data security [22]. Confidentiality, integrity, and authentication are the three basic services that are used to ensure data security. Thus, we use the following definition to denote the security services provided by the data centres.

Definition 3

Let the matrix \(p_j =\left\{ {p_j^1 \;,p_j^2 \;,p_j^3 } \right\} \) represent the security service capability of data centres \(dc_{j}\), where \(p_j^1\) represents the confidential service coefficient, \(p_j^2\) represents the integrity service coefficient, and \(p_j^3\) represents the authentication service coefficient. Those coefficients represent different service factors of the security services.

Each service can be implemented using several strategies, as shown in Tables 12, and 3, according to the literature [36, 38, 39].

Table 1 Confidential service parameters
Table 2 Integrity service parameters
Table 3 Authentication service parameters

From these parameters in the tables, we can see that different security service factors demonstrate different capabilities of providing security services. A greater parameter corresponds to a higher level of security services. Let us take the confidential service as an example. In Table 1, both DES and IDEA can do encryption, but IDEA has a larger coefficient than DES, which implies that IDEA can better guarantee the data confidentiality than DES.

The service factors in those tables are calculated using different algorithms. For example, the encryption efficiencies of SEAL, DES and IDEA in a 90 MHz processor are 168.75, 15 and 13.5 KB/ms, respectively [36]. The IDEA algorithm has the maximum security factor and the lowest efficiency [38]. Therefore, the efficiency of the IDEA encryption algorithm should be divided by the encryption efficiency of SEAL, DES and IDEA. Then, we will obtain the corresponding values of the confidential security service parameters. The security service factors are only reference values.

4.2 Security service requirements of tasks

We describe the security service requirement factors with respect to the consumers. When deploying the scientific workflows in cloud, different security service requirements are required by the tasks. We first introduce the security service requirements for a task in the following definition.

Definition 4

The security service requirement for a task is represented as \(s_{Ti}=\{s_{Ti}^1, s_{Ti}^2, s_{Ti}^3\}\), where \(s_{Ti}^1\) denotes the confidential service factor, \(s_{Ti}^2\) denotes the integrity service factor, and \(s_{Ti}^3\) denotes the authentication service factor.

Different factors of the task security service requirements show different weights of the three services. If one security service \(s_{Twi}^k\) is more important, the service factor will be larger. The three components satisfy the formula \(\sum \nolimits _{k=1}^3 {s_{Twi}^k } =1\).

The task security model for a scientific workflow is shown in Fig. 2.

Fig. 2
figure 2

Security service requirement factors of tasks

As shown in Fig. 2, the security service requirements of each task are indicated by \(\hbox {s}_{Ti}\) and \(\hbox {s}_{Twi}\), where \(\hbox {s}_{Ti}\) represents the security service demands for the task, and \(\hbox {s}_{Twi}\) denotes the importance of the security services. Let us take \(T_{5}\) as an example: 0.42, 0.16 and 0.38 are the coefficients of the confidential service, the integrity service and the authentication service, respectively. Additionally, 0.2, 0.6 and 0.2 represent the weights of the task security services, and their sum equals 1.

4.3 Security service requirements of data

According to Definition 2, the security service requirements of the generated data are determined by the task that produces it. For example, we know that the security service requirements of \(T_{5}\) are \(s_{T5}=\{0.42,0.16,0.38\}\) and \(s_{Tw5}=\{0.2,0.6,0.2\}\). Thus, the security service requirements of the data generated by \(T_{5}\) are identical to that of \(T_{5}\). Then, the security service requirements for the intermediate data are described as follows.

Definition 5

The data security service requirements are represented by \(s_i =\{s_i^1, s_i^2, s_i^3\}\), where \(s_i^1\), \(s_i^2\), and \(s_i^3\) represent the confidential service coefficient, the integrity service coefficient and the authentication service coefficient of the data, respectively.

Similarly, the importance of the data security services \(s_{wi}\) is defined as \(s_{Twi}\). The security service factors satisfy the formula \(\sum \nolimits _{k=1}^3 {s_{wi}^k } =1\).

In this paper, we mainly consider the security service demands for intermediate data. The requirements of the data security service are used to describe the required security services when moving or storing the data. We must decide whether the distributed data centres can provide security services when we schedule tasks. If not, the tasks cannot be scheduled to these data centres.

Table 4 describes the security services that are provided by several data centres in cloud.

Table 4 Security service factors of data centres

Table 4 illustrates the highest level of security services that the data centres can provide. For example, the maximum confidential service factor that \(DC_{A}\) can offer is 0.64. Hence, \(DC_{A}\) can provide all security service coefficients that are less than or equal to 0.64. If the confidential service coefficient that a task requires is 0.9, this data centre obviously cannot meet their requirements, and the other data centres should be selected for this task. The same principle applies for the integrity service and the authentication service.

4.4 Degree of data security deficiency

To evaluate the security services, we propose the DDSD. The DDSD is regarded as a quantitative indicator to describe the deficiency degree for data security.

In this paper, we define the DDSD as the ratio of the VDSD (value of data security deficiency) and the BVDSD (base value of data security deficiency). We know that DSD (degree of security deficiency) has been purposed in architecture [38], but the size of the datasets can affect the transfer time because the data centres are connected with a limited bandwidth. The transmission time increases if the size of the datasets grows, and a security problem is more likely to happen. Thus, the VDSD is defined in formula (1). In this formula, \(d_{si}\) represents the size of \(d_{i}\). When the security services \(p_j\) provided by the data centre can satisfy the data security service requirement \(s_i\), the value of \(g(s_i^k ,p_j^k)\) is 0. Otherwise, \(g(s_i^k ,p_j^k)\) is the absolute value for the difference between \(s_{_i }^k\) and \(p_{j}^k\).

$$\begin{aligned} VDSD(d_i)&= DSD(s_i){*} d_{si}\nonumber \\&= \sum _{k=1}^3 {s_{wi}^k {*} g(s_i^k ,p_j^k )} {*} d_{si}, 0 \le s_{wi}^k \le 1\nonumber \\ \sum _{k=1}^3 s_{wi}^k&= 1\;,\;g(s_i^k ,p_j^k )= \left\{ \begin{array}{ll} 0,&{}\hbox { if }s_i^k \le p_j^k\\ s_i^k -p_j^k, &{}\hbox {otherwise}\\ \end{array}\right. \end{aligned}$$
(1)

BVDSD denotes the value of the data security deficiency when \(p_{j}=\{0,0,0\}\). This relationship can be described using the following formula.

$$\begin{aligned} BVDSD(d_i)&= VDSD(d_i) |_{p_j =\{0,0,0\}}\nonumber \\&= DSD(s_i){*} d_{si} |_{p_j =\{0,0,0\}}\nonumber \\&= \sum _{k=1}^3 {s_{wi}^k {*}s_i^k } {*}d_{si}\nonumber \\ 0\le s_{wi}^k&\le 1, \sum _{k=1}^3 {s_{wi}^k =1\;\;} \end{aligned}$$
(2)

Then, the DDSD is shown in formula (3), where the value of \(DDSD (d_{i})\) ranges from 0 to 1. If that value equals 0, the data security service requirements can be completely satisfied. Otherwise, a smaller value corresponds to more secure data.

$$\begin{aligned} DDSD(d_i)&= \frac{VDSD(d_i)}{BVDSD(d_i)}\nonumber \\&= \frac{VDSD(d_i)}{V}DSD(d_i) |_{p_j =\{0,0,0\}}\nonumber \\&= \frac{\sum _{k=1}^3 {s_{wi}^k {*}g(s_i^k ,p_j^k )} {*}d_{si} }{\sum _{k=1}^3 {s_{wi}^k {*}s_i^k {*}d_{si}} }\;,\;0\le s_{wi}^k \le 1\nonumber \\ \sum _{k=1}^3 s_{wi}^k&= 1, g(s_i^k ,p_j^k )= \left\{ \begin{array}{ll} 0, &{}\mathrm{if}\,\,s_i^k \le p_j^k\\ s_i^k -p_j^k, &{} \hbox {otherwise} \end{array}\right. \end{aligned}$$
(3)

For example, in Fig. 1, we assume that the size of the generated data \(dT_{1}\) is 80 GB. The security service demand is \(s_{i}=\{0.63,0.4,0.6\}\), and the importance of the security services is \(s_{wi}=\{0.2,0.5,0.3\}\). During the execution of a scientific workflow, \(dT_{1}\) is moved to the other data centre \(DC_{B}\), which can provide the security service \(p_{B}=\{0.46,0.46,0.9\}\). Thus, we can obtain the following value of \(DDSD (d_{i})\).

$$\begin{aligned} DDSD(d_\mathrm{i})&= [0.2{*}(0.63-0.46)\!+\!0.5{*}0\!+\!0.3{*}0] {*}80/[0.2{*}0.63+0.5{*}0.4\!+\!0.3{*}0.6]{*}80\\&= 2.72/40.48\\&= 0.067 \end{aligned}$$

4.5 Degree of task security deficiency

In this part, we obtain the SDDSW from the above discussion. For the task \(T_{i}={<} IDS_{i}, ODS_{i}, s_{Ti}, s_{Twi}{>}\) that was described in Sect. 3.1, the DTSD is calculated as the sum of DDSDs for the input and output datasets. In this paper, we only consider the security of the generated datasets during the execution. The security requirements of the intermediate datasets are described using the tasks that generate them. During the execution, the data centre should attempt to satisfy the maximum coefficient of the security services for all input datasets and the task itself.

According to the definition of \(DDSD (d_{i})\), the \(DTSD\) is defined by formula (4).

$$\begin{aligned} DTSD(T_i)&= DDSD(IDS_i) + DDSD(ODS_i)\nonumber \\&= \sum _{i=1}^{|{IDS_i}|} {DDSD(id_i )} +\sum _{i=1}^{|{ODS_i}|} {\sum _{l=1}^t {DDSD(od_i )}}\nonumber \\ { where}\,\,{ id}_i&\in IDS_{i}, od_i \in ODS_i, t= \left\{ \begin{array}{l} 1,\;dc=dc(d_o)\\ 2,\;dc!=dc(d_o) \end{array}\right. \end{aligned}$$
(4)

The DDSDs for the input and output datasets should be calculated differently. When the output datasets are at the data centre where the task runs, we can only calculate the DDSDs for the output datasets. Otherwise, we must calculate the DDSDs for the generated datasets and the stored datasets; then, we sum them. Because the data centres where the task runs and the data are placed may be different, their DDSD values are not necessarily identical.

We use the scientific workflow in Fig. 1 as an example.

The security service requirements for the tasks and the generated datasets are shown in Tables 5 and 6 describes the security service that is provided by the data centres. To calculate the DTSD, we use \(T_{2}\) as an example. From the above tables, we know that the data location and the position of the execution of \(T_{2}\) are different. Thus, we must calculate the DDSD for the datasets that are generated and located in different data centres. During the execution, \(DC_{A}\) must satisfy the maximum coefficient of the security services for \(dT_{1}\) and \(T_{2}\).

Table 5 Security service requirements for the tasks and the generated data
Table 6 Security service capability of the data centres

First, we calculate the DDSD of the input data of \(T_{2}\), which is named \(DDSD(i)\).

$$\begin{aligned} DDSD(i)&= [0.2 *(0.63-0.36) + 0.5 *0 + 0.3 *0.1] *80/[0.2 *0.63\nonumber \\&+\,\, 0.5 *0.46 + 0.3 *0.7] *80\nonumber \\&= 0.084/0.566\\&= 0.15 \end{aligned}$$

Then, the DDSD for data generated by \(T_{2}\) in \(DC_{A}\) is called DDSD(g):

$$\begin{aligned} DDSD (g) \!&= \! (0.3 *0 \!+\! 0.4 *0 \!+\! 0.3 *0.1) *100/[0.3 *0.36 \!+\! 0.4 *0.46\!+\!0.3 *0.7 ] *100\\&= 0.03/0.502\\&= 0.06 \end{aligned}$$

However, the generated data are located in another data centre \(DC_{B}\), so the DDSD in \(DC_{B}\) is calculated as follows:

$$\begin{aligned} DDSD (gs)&= (0.3 *0 \!+\! 0.4 *0 \!+\! 0.3 *0) *100/[0.3 *0.36 \!+\! 0.4 *0.46 \!+\! 0.3 *0.7] *100\\&= 0\\ DTSD (T_2)&= DDSD (i) + DDSD (g) + DDSD (gs)\\&= 0.15 + 0.06 + 0\\&= 0.21 \end{aligned}$$

The DTSDs for the other tasks can be calculated using the same method.

4.6 Security deficiency degree of scientific workflow

A scientific workflow is a collection of interdependent tasks, so the definition of SDDSW is:

$$\begin{aligned} SDDSW(w)=\sum _{i=1}^{|T|} {DTSD(T_i )}, T_i \in T \end{aligned}$$
(5)

From (5), we know that the \(SDDSW (w)\) can be calculated according to the example in Sect. 4.5.

When scheduling the tasks of a scientific workflow [40], we should attempt to guarantee the security services for each task. Thus, one of the goals of scientific workflow scheduling is to make a smaller SDDSW.

5 Security-aware data placement strategy

Because the architecture and the technology adopted in data centres are different, the data centres of different organisations show heterogeneous characteristics in some aspects such as security heterogeneity. In this article, this heterogeneity characteristic is manifested mainly by implementing the same security services with encryption algorithms and heterogeneous technologies. For example, some data centres use SSL to ensure the security of data transmission, whereas others use VPN. To improve data security, we fully consider this heterogeneity and the DDSD when placing the data. We use the following formula to describe the problem, which is a combinatorial optimisation under multi-dimensional constraints.

Formula (6) describes the minimum SDDSW, where \(d_{sid}\) and \(d_{sod}\) denote the size of the input and the output datasets, respectively. Formula (7) expresses the minimum data transfer time, where the number of data is \(n\), \(dc_{orig}\) is the original data centre where \(d_{i}\) is located, and \(dc_{des}\) is the destination data centre of \(d_{i}\). Then, the transfer time of a single dataset is described in (8), where \(d_{si}\) represents the size of \(d_{i}\), and the bandwidth between \(dc_{orig}\) and \(dc_{des}\) is b. If \(dc_{orig}\) is identical to \(dc_{des}\), the transmission time is 0. In addition, (9) indicates that the maximum available space of the data centres cannot exceed cs, which is the maximum storage capacity.

$$\begin{aligned} Min(SDDSW(w))&= Min\left( \sum \limits _{i=1}^{|T|} {DTSD(T_i )}\right) \nonumber \\&= Min\left( \sum \limits _{i=1}^{|T|} \left( \sum \limits _{i=1}^{| IDS_i |} DDSD(id_i)+ \quad \sum \limits _{i=1}^{| ODS_i|} \sum _{l=1}^t DDSD(od_i)\right) \right) \nonumber \\&= Min\left( \sum \limits _{i=1}^{|T|} \left( \sum \limits _{i=1}^{| IDS_i |} \left( \frac{\sum _{k=1}^3 s_{wid}^k {*} g(s_{id}^k ,p_j^k ) {*} d_{sid}}{\sum _{k=1}^3 s_{wid}^k {*}s_{id}^k {*} d_{sid} }\right) \right. \right. \nonumber \\&\left. \left. + \sum \limits _{i=1}^{|ODSi|} \sum \limits _{l=1}^t \left( \frac{\sum _{k=1}^3 s_{wod}^k {*} g(s_{od}^k ,p_m^k) {*} d_{sod}}{\sum _{k=1}^3 s_{wod}^k {*}s_{od}^k {*}d_{sod}}\right) \right) \right) \end{aligned}$$
(6)
$$\begin{aligned} Min (TimeCost(w))&= Min\left( \sum \limits _{i=1}^n {TimeCost(d_i, dc_{orig}, dc_{des})}\right) \end{aligned}$$
(7)
$$\begin{aligned} TimeCost (d_{i}, dc_{orig}, dc_{des})&= d_{si}/b\end{aligned}$$
(8)
$$\begin{aligned} ConstrainSpace (dc_{ava})&\le cs \end{aligned}$$
(9)

We obtain the data placement strategy from (6) and select the optimal strategy using (7). Thus, our strategy can improve the intermediate data security while guaranteeing the data transfer time.

Because the ACO algorithm is suitable to solve these combinatorial and optimisation problems, we transfer the restrictions into a heuristic and fitness function; then, we solve the data placement problem using ACO.

5.1 Initial setup

In this strategy, we set the number of ants to be the number of datasets. Assuming that the number of data centres is n and the number of datasets is m, the initial value of this pheromone is set to be 1/(m* n).

5.2 Heuristic function

We select the data centres according to formula (10) for datasets

$$\begin{aligned} P_{ij} (t)=\left\{ \begin{array}{ll} \frac{\tau _{ij}^\alpha (t){*}\eta _{ij}^\beta (t)}{\sum _{l\in DC} {\tau _{il}^\alpha (t){*}\eta _{il}^\beta (t)} } &{} \hbox {where}\;j\in DC\\ \tau _{ij}^\alpha (t){*}\eta _{ij}^\beta (t), &{} \hbox {if } p<p_0 \end{array}\right. \end{aligned}$$
(10)

where DC represents the collection of data centres in cloud, \(P_{ij}(t)\) is the probability in the \(t\)-th iteration when the ants choose \(dc_{j}\), \(\tau _{ij}(t)\) is the pheromone between \(d_{i}\) and \(dc_{j}\) in the \(t\)-th iterations, \(\eta _{ij}\) is the value of the heuristic function \(f(i,j)\), \(\alpha \) is the importance of the residual pheromone, and \(\beta \) is the importance of the value for the heuristic function. \(P\) is a random value in the interval [0,1], and \(P_{0}\) is the preset value in the interval [0, 1]. The two values are set to avoid an unwanted early convergence at the local optimal solution in the search process.

The heuristic function \(f(i,j)\) can be represented by formula (11)

$$\begin{aligned} f (i,j) = c (d_i) {*} DDSD (d_i) \end{aligned}$$
(11)

where \(c(d_{i})\) represents the number of tasks that need \(d_{i}\) or the usage frequency of \(d_{i}\). In this paper, we assume that the frequently used datasets are important, and we prioritise the security requirements of these datasets. In (11), \(DDSD (d_{i})\) is the DDSD of \(d_{i}\) in \(dc_{j}\).

5.3 Pheromone update

When every ant completes a search, an evaluation function is used to evaluate the data transfer time in every iterative process and to select the optimal solution. Then, we use (12) to update the pheromone on the optimal path.

$$\begin{aligned} \tau _{ij} (t+1) = (1-\rho ) {*} \tau _{ij} (t) + \varDelta \tau _{ij} (t) \end{aligned}$$
(12)

In (12), \(\tau _{ij}(t+1)\) represents the pheromone between \(dc_{i}\) and \(dc_{j}\) after the \(t\)-th iteration, and \(\rho \) is the decay parameter of the pheromone. In this experiment, \(i\Delta \tau _{ij}(t)\) is the amount of pheromone increase. We adopt a pheromone with a fixed value as shown in (13).

$$\begin{aligned} \varDelta \tau _{ij} (t) = 1/(m {*}n) \end{aligned}$$
(13)

where \(m\) is the number of data centres, \(n\) is the number of datasets, and \(\tau _{ij}(t)\) is the total amount of pheromone before updating.

5.4 Data placement strategy

Prerequisites: The number of data centres is \(m\), \(n\) represents the number of datasets, and \(n_{fix}\) denotes the number of fixed datasets. Each data centre \(dc_{j}\) can provide the security services \(p_{j}\). The security service requirement for \(d_{i}\) is \(s_{i}\), and the importance of the security services is \(s_{wi}\). The conditions to end the algorithm: the iterations have completed, or the optimal solution no longer changes. \(S_{Final}\) is used to store the optimal solution at this stage. In addition, we first place the data with higher requirements when we select the datasets. First, we rank the DDSDs of the datasets in descending order using the vector \(p_{j}=\{0,0,0\}\) in the iteration process.

figure f

5.5 Complexity analysis

Theorem 1

The time complexity of the data placement strategy security-aware intermediate (SAI) is \(O(m*n) + O(t*(c*m*n+c))\), which sets the number of ants as c, the number of datasets as n, the number of data centres as m and the maximum number of iterations as t.

Proof

At the initial stage, the pheromone should be initialled among the datasets and data centres. Then, the time complexity is \(O(m*n)\); the data centres should be chosen according to formula (10) for each dataset.

The ants must traverse all data centres to obtain the location of the datasets in a searching procedure, so the time complexity to complete one iteration for one ant is \(O(m*n)\). The time complexity of all \(t\) iterations for all ants is \(O(t*(c*m*n))\). Furthermore, each searching procedure must use the evaluation function to choose the best solution to update the pheromone. The time complexity of each procedure is \(O(c)\). Thus, the total time complexity of SAI is \(O(m*n)+O(t*(c*m*n+c))\).

5.6 Convergence analysis

This paper adopts the ACO-based strategy, which can largely avoid premature stagnation and speed up the search [41]. To validate the effectiveness and the feasibility of the strategy, the proof of the convergence algorithm is given as follows:

Theorem 2

The ACO converges to the optimal solution at the probability of approximately 1, that is, for an arbitrarily small \(\varepsilon > 0\) and a sufficiently large iteration number of t, the probability of finding the optimal solution at least one time in the first t iterations is:

$$\begin{aligned} \mathop {\lim }\limits _{t\rightarrow \infty } p^{{*}}(t)=1 \end{aligned}$$
(14)

To prove the convergence, we must first ensure that the optimal solution can be obtained during the \(n\) iterations. Theorem 2 [41] can fully prove that. Therefore, we must only demonstrate that there is more pheromone on the optimal path than on the non-optimal path in the subsequent iteration. The formal description is as follows:

Suppose \(\tau _{ij}(t)> \tau _{kl}(t)\) when \(\forall (i,j)\in s^{*}\), \(\forall (k,l)\notin s^{*}\), and

$$\begin{aligned} \mathop {\lim }\limits _{t\rightarrow \infty } \tau _{kl} (t)=\tau _{\min } \end{aligned}$$
(15)

where \(s^{*}\) is the optimal solution, \(\tau _{ij}(t)\) is the pheromone between \(d_{i}\) and \(dc_{j}\), and \(\tau _{kl}(t)\) is the pheromone between \(d_{k}\) and \(dc_{l}\) in the \(t\)-th iteration. In (15), \(\tau _{min}\) is the minimum pheromone.

The following is the proof:

In the worst-case scenario, we assume that \((i\;,\;j)\in s^{{*}}\), \(\tau _{ij} ^{{*}}(t^{{*}})=\tau _{\min }\) and \((k,l)\notin s^{*}\), \(\tau _{kl} (t^{{*}})=\tau _{\max }\). Starting from the \(t\)*-th generation, the pheromone of \((k,l)\) becomes:

$$\begin{aligned}&\tau _{kl} (t^{{*}}+1)=\max (\tau _{\min } ,(1-\rho ){*}\tau _{\max } )\\&\tau _{kl} (t^{{*}}+2)=\max (\tau _{\min } ,(1-\rho )^{2}{*}\tau _{\max })\\&{\ldots }{\ldots }\\&\tau _{kl} (t^{{*}}+t^{{\prime }})=\max (\tau _{\min } ,(1-\rho )^{t^{{\prime }}}{*}\tau _{\max } )\\&\mathop {\lim }\limits _{t\rightarrow \infty } \tau _{kl} (t^{{*}}+{t}')=\max \left\{ {\;\mathop {\lim }\limits _{t\rightarrow \infty } \tau _{\min } ,\mathop {\lim }\limits _{t\rightarrow \infty } [(1-\rho )^{{t}'}{*}\tau _{\max } ]} \right\} =\tau _{\min } \end{aligned}$$

\(\square \)

Hence, in the subsequent generation, there is more pheromone on the optimal path than any other non-optimal path, and the pheromone on the non-optimal path gradually decreases. Therefore, an approximate optimal solution can be obtained using the ACO-based strategy.

6 Experimental results and analysis

6.1 Simulation parameters

In this paper, the randomly generated scientific workflow is applied as the input datasets of the simulation. The size of a single dataset is randomly distributed in the interval [80 and 300 GB], there are 40–120 datasets, the proportion of fixed datasets for input datasets ranges from 0.1 to 0.5, and there are 10–30 data centres. The storage capacity of a data centre is set up based on the sizes of all datasets. To simulate the heterogeneity of the resources of the data centres, we find an average storage capacity and make the variation randomly distribute from 10 to 30 % of the total capacity.

For the experiment, we simulate the proposed strategy from the proportion of fixed datasets, the number of data centres, the number of datasets and the maximum usage of the datasets. Then, we assess the experimental results from the SDDSW, the total time cost of data transfer and the data movements to verify the validity of the SAI strategy.

6.2 Experimental results

Based on the above four different inputs, we analyse and compare the experimental results primarily from the following three indicators: the SDDSW, the total data transfer time, and the data movements. In the analysis of the experimental results, the SAI represents the security-aware intermediate data placement strategy. The TS (Three Stages) represents the data deployment strategy that, respectively, addresses three problems: reducing the time cost of the data movements across the data centres, addressing with the data dependencies, and maintaining a relative load balancing of data centres in three stages [35]. BR(Build-time stage and Run-time stage) denotes the data placement strategy, which contains two algorithms that group the existing datasets in \(k\) data centres during the build-time stage and dynamically clusters newly generated datasets based on dependencies to the most appropriate data centres during the runtime stage [16].

6.3 Effect of the proportion of fixed datasets

As shown in Fig. 3a, the SDDSW in the SAI is much lower than those in the TS and BR layout strategies. However, the SDDSW is not a fixed value because the proportion of datasets varies. Sometimes, TS and BR fluctuate in terms of the SDDSW. Different proportions of fixed datasets may have different effects on the performance of our strategy. Thus, our strategy can guarantee better data security than the other two strategies when the proportion of fixed datasets changes. In Fig. 3b, the SAI strategy caused 15,666.33 Sim. Units (a quantitative unit of time in CloudSim [42]) of transmission on average for different circumstances. Compared with the BR, our strategy reduces the transmission time by 5.6 %. TS is superior to BR in transmission time, mainly because the TS strategy attempts to minimise the data transfer time as the optimisation objective. In Fig. 3c, when the proportion of fixed datasets increases, more datasets cannot be moved. Therefore, there are generally more movements of the flexible datasets. As shown in Fig. 3c, BR has fewer data movements than SAI and TS in most cases because the main purpose of BR is to reduce the data movements. Thus, our strategy is better than the other two strategies in data security when the proportion of fixed datasets changes.

Fig. 3
figure 3

Effect of the proportion of fixed datasets. a SDDSW, b transmission time, c data movements

6.4 Effect of the number of data centres

Figure 4 shows the performance of the three strategies in different data centres for the SDDSW, the total transfer time and the data movements. In Fig. 4a, there is no stationary variation trend, and the SAI has a lower SDDSW than the TS and BR strategy mainly because that the number of data centres that can provide security services increases when the number of data centres goes up. The security services are randomly generated. Thus, the SDDSW does not show a special trend. In general, SAI is better than BR and TS in terms of theSDDSW. In Fig. 4b, the SAI strategy reduces the transfer time by 3.8 % compared with the TS strategy, whereas the transfer time of the BR strategy is lower than that of the SAI strategy when the number of data centres becomes 10. Because SAI regards the security-based scheduling as the main objective to compensate for the narrower bandwidth of the workflow, the transmission time increases. Overall, SAI is superior to BR and TS in terms of the total transfer time. However, when the number of data centres is 10, SAI is slightly worse than BR. In Fig. 4c, SAI does not show any advantages regarding the data movements compared with BR because SAI and TS are intending to reduce the data transfer time instead of the data movements. Thus, we can see that our strategy is better than the other two strategies in terms of the security and the performance of scientific workflows.

Fig. 4
figure 4

Effect of the number of data centres. a SDDSW, b transmission time, c data movements

6.5 Effect of the number of datasets

Figure 5 presents the statistical results of the SDDSW, the total transmission time and the data movements for these three data placement strategies using different sizes of datasets. Regarding the SDDSW, SAI has obvious advantages, particularly when there are 120 datasets. In that case, the SAI strategy can guarantee the security of scientific workflows compared with other strategies. As shown in Fig. 5b, when the number of datasets increases, the transmission time correspondingly becomes more erratic. More datasets result in more data transfer time. Our strategy is better than TS when the number of datasets is 100 and 120. In Fig. 5c, BR has fewer data movements than TS and SAI when there are 40 datasets. When there are more datasets, SAI gradually has fewer data movements than TS and BR, particularly when the number of datasets is 120. Thus, our strategy is more suitable for data-intensive applications.

Fig. 5
figure 5

Effect of the number of datasets. a SDDSW, b transmission time, c data movements

6.6 Effect of the maximum data usage

From Fig. 6a, we observe that the SDDSW changes according to the maximum usage of the datasets. As the maximum data usage increases, the SDDSW correspondingly grows because the probability of transmitting data across data centres increases. Apparently, the results of our strategy are notably lower than those of the other two strategies, particularly when the maximum data usage is 2. In this case, the scientific workflow is not too complex. Thus, our strategy is more effective to safely deploy data. Similarly, the data transfer time increases when the maximum data usage increases, as shown in Fig. 6b. When the scientific workflow becomes more complex, more data must be transmitted to the data centres. Therefore, the data transfer turns more frequently, which results in more transmission time. Our strategy produces less transmission time than the other two strategies in some cases. However, SAI cannot effectively reduce the data movements in Fig. 6c because our strategy mainly aims to optimise the data security.

Fig. 6
figure 6

Effect of the maximum usage of datasets. a SDDSW, b transmission time, c data movements

Therefore, by ensuring the data transfer time, we can conclude that our strategy improves the data security for scientific workflows in cloud. Compared with BR and TS, SAI better ensures the data security. Furthermore, the data transfer time of SAI is guaranteed. In some cases, SAI costs more than BR in data transfer time mainly because SAI satisfies the security service requirements when placing the data.

7 Conclusions

In this paper, we proposed a data placement strategy that can automatically allocate datasets to data centres to improve the data security. The strategy is executed by analysing the security model for scientific workflow systems from the service providers, service consumers and service evaluation. A security model is introduced to quantitatively measure the security services that are provided by the data centres. Then, we utilise an ACO-based algorithm to dynamically select the appropriate data centres for intermediate data to improve the data security while considering the data transfer time. This strategy is effective at improving data security, and the data transfer time is guaranteed during the execution of the scientific workflows.

Currently, there are only a few studies on the data placement in scientific workflows. Furthermore, these studies mainly focus on how to reduce the data transmission across the data centres and rarely consider the security requirements, cost, and other factors. It is known that cloud computing provides a pay-demand computing model where the users can access applications and data from anywhere in the world and pay for what they use. Therefore, in the predictable future, we will exploit more efficient tactics to reduce the cost.