Keywords

1 Introduction

Cloud computing is a powerful solution that delivers interesting services to customers over the Internet. It refers to provide on-demand computing services and a pool of resource with a pay-per-use system. In addition, it offers a wide network access and fast elasticity. The cloud providers can deploy their services over a public, private, community or hybrid environments. In case of public cloud the infrastructure is located on the premises of the cloud provider and it’s open for use by the general public. Private and community cloud infrastructures are intended for use by a single or multiple organizations respectively and they may located either on or off premises. While, the hybrid cloud combines the two aspects of the public cloud and the private cloud. The delivered services over a cloud platform are either infrastructure as service, platform as service or software as service [17].

These characteristics make the cloud technology beneficial for various scientific fields such as biology, astronomy and chemistry. Where it allows to execute a set of computational tasks or what is called a scientific workflow with high performance and minimum cost and time. In fact, scientific workflow is a model that makes it easier for scientists to conceptualize and manage the scientific analysis process in terms of both tasks and data [5], it is generally modeled as Directed Acyclic Graph (DAG) that range a set of computational tasks (nodes) that expose data dependencies between them (edges) in order to solve a scientific problem [9].

Cloud computing offers exceptional opportunities for the complex scientific workflow in terms of cost, performance and reliability. Despite, these opportunities brings various challenges. For instance, it is critical to achieve a high performance and reliability with an optimal use of resources and a minimum of cost and deadline. In the literature, There has been numerous studies that investigate this field by selecting different parameters of Quality of Services (QoS) according to their targeted cloud platform and workflow system. They trying to achieve the best of performance, cost and deadline under different constraints. A closer look to the literature reveals that most of researchers have overlooked the data security problem in the cloud platforms. In fact, this problem is still insufficiently explored, although it is very important for customers to use a secured platforms, especially for their sensitive scientific workflows. In that purpose, we propose in this paper a novel workflow scheduling system for the hybrid cloud environments which consists of an economical distribution of tasks between the various cloud service providers, in order to provide customers with shorter execution times and high security services of within their budget.

Our proposed scheduling system is composed of three modules:

  • Pre-Scheduler wherein each task or dataset is assigned to be executed or stored either in the private or in the public cloud.

  • Security Enhancement Module this module is concerned with adding the dataset’s required security services while minimizing the cost overhead generated by these services.

  • Post-Scheduler where each task or dataset is assigned to be executed or stored in the suitable Virtual Machine (VM) while meeting with the budget and deadline constraints.

The performance of our proposed system is evaluated using the Cloud Workflow Simulator (CWS) [14], an extension of the CloudSim simulator [3]. The result shows that our proposed strategy slightly increases the deadline but it does not affect the cost.

The rest of this paper is organized as follows. Sect. 2 presents the related work, we describe the system model and our main assumptions in Sect. 3, the proposed system is explained in Sect. 4, the performance evaluation and the extensive simulations are detailed in Sect. 5 with a highlight of the main results and at least we conclude the paper with Sect. 6.

2 Related Work

Nowadays, cloud computing becomes a popular solution for many information system issues. In the area of scientific workflow systems, many studies have conducted the important benefits offered by the cloud in term of both performance and cost [8, 9].

Thus, with the emergence of this new computing paradigm, many approaches are developed to address the problem of workflow scheduling on the cloud platforms. This challenging problem aims to find an appropriate orchestration of workflow tasks onto resource pool by satisfying the QoS requirements. It is a multi-objective problem because the amount of the allocated resources affect the execution time, cost and performance. It becomes more complex while considering multiple QoS constraints [6, 15, 22].

Many studies have tackled with the later problem by considering several aspect under different constraints. For example, in [13] authors proposed an approach that aims to improve cost of the workflow by optimizing the use of virtual resources and network bandwidth. While, in [10] the authors focused on how to find an optimized solution to achieve better cost-makespan while maximizing the reliability of executing workflows under user specified deadline and budget constraints. In [24] the authors have proposed a workflow scheduling approach which minimizes cost and makespan at the same time under the constraint of deadline.

In this area, there are some systematic reviews that analyze and describe a large variety of the workflow scheduling approaches and they classify them according to different aspects. For example, In [23] the classification was made regarding to the scheduling process, task and resource. While, in [15] the authors classified the existing approaches based on the type of the algorithm utilized, objectives and properties. Kaur et al. [11] categorized them into heuristic, meta-heuristic and hybrid schemes. While, in [16] the authors classified the scheduling schemes according to cloud environments and they described their architecture, key features, and advantages.

In fact, most of the workflow scheduling studies have focused on deadline, budget and other constraints and they ignored the security requirements. Whereas, the majority of companies have private data and they risk irreparable consequences and the loss of trust from their clients if these data are disclosed. Therefore, in the case of applications that handle sensitive data, the main concern is how to provide a secured scheduling strategy [7, 11, 16].

In the literature, only a few works have dealt with the secured workflow scheduling problem. Among them, we can cite that of Zeng et al. [27] where they presented a security-aware and budget-aware workflow scheduling strategy, in order to provide customers with shorter makespan and meet their security requirements. Li et al. [12] proposed a security and cost aware scheduling algorithm for heterogeneous tasks of scientific workflow to minimize the total workflow execution cost while meeting the deadline and risk rate constraints. They based on the meta-heuristic optimization technique, particle swarm optimization. While, Shishido et al. [20] examined the effect of both Particle Swarm Optimization (PSO) and Genetic-based algorithms (GA) on attempts to optimize workflow scheduling. An hybrid cloud optimization approach, which combines firefly and bat algorithms was proposed by Arunarani et al. [2].

Chen et al. [4] investigated the problem of scheduling workflows with security-sensitive intermediate data with the objective of minimizing both the makespans and monetary execution costs. While, a cost and energy aware data placement method, for privacy-aware applications over big data in hybrid cloud was proposed by Xu et al. [26]. Wen et al. [25] modeled the problem of scheduling workflows with data privacy protection constraints while minimizing both execution time and monetary cost. In [1], the authors formulated a model for task scheduling and propose a heuristic algorithm which is based on task’s completion time and security requirements in which they considered task interaction issues as a security threat. Hammed and Arunkumar [19] proffered a method of scheduling that enabled the users to incur less execution time and cost. They based on the multi-populated genetic algorithm with secured frame work for sensitive data.

Table 1 illustrates a brief comparison between the mentioned works that treat the workflow scheduling problem under the security constrains. The comparison is made according to the objective function, the targeted constraints and the cloud environment model single-cloud or multi-cloud. Among that, we observe that none of these studies were addressed the budget, deadline and security constrains in the multi cloud environments. For that we propose in this paper a security, deadline-, and budget aware workflow scheduling strategy in hybrid Cloud environment.

Table 1. Secured workflow scheduling works.

3 System Model and Assumptions

In this section, we extend the system model proposed Makhlouf and Yagoubi [13], as we relied on the same application model and data transfer model proposed by them, and we highlight our main assumptions.

A scientific workflow application consists of a set of computational tasks that expose dependencies between them in order to solve a scientific problem. These dependencies are mainly data, where the output of a task can be the input of another tasks. Scientific workflows are generally modeled as a Directed Acyclic Graph DAG(V,E), where each Vertex (V) represents a task and the Edges (E) represent the task dependencies which indicate the precedence constraints. For example, edge e(ij) represents that \( task_{j} \) should start its execution after \( task_{i} \) finish, in this case the \( task_{i} \) is called predecessor of the \( task_{j} \) and the \( task_{j} \) is called the successor of the \( task_{i} \). A task can have one or more predecessors, it cannot start execution until all of its predecessors are finished and all its input files are available. Weight of the edge e(ij) indicate the amount of data transferred form \( task_{i} \) to \( task_{j} \).

In order to maintain the sensitive data, we assume that each dataset will be accompanied by a security level which reflect its degree of sensitivity. For example, \(SL_{i}\) represent the security level of the dataset \(D_{i}\), where \(SL_{i}\) takes different values from lowest security level 0 to the highest one 5. A dataset can be manipulated by one or more tasks, and a task can manipulate one or more datasets.

We target the hybrid cloud system as an execution environment, we assume that the client disposes his private cloud and needs to allocate more resources in the public cloud to run his scientific workflow application. A cloud offers an unbounded set of VMs that can be rapidly provisioned and released, it’s characterized by On-demand self-service, broad network access, resource pooling, rapid elasticity and measured service according to the NISTFootnote 1. We suppose that a VM can run one task a time and our pricing model is rounded to \( 1\$/\) h in the public cloud. We aim to secure the sensitive data and not to exceed the budget and deadline specified by users.

4 Proposed System

To address the security, deadline, and budget-aware workflow scheduling issue in the hybrid cloud environment, we propose a three-modules system as it’s shown in Fig. 2. The first module is the Pre-Scheduler, wherein each task or dataset is assigned to be executed or stored either in the private or in the public cloud. The second module is the Security Enhancement Module, this module is concerned with adding the dataset’s required security services while minimizing the cost overhead generated by these services. The third module is the Post-Scheduler, where each task or dataset is assigned to be executed or stored in the suitable VM while meeting with the budget and deadline constraints. Figure 1 illustrates an example of whole proposed system, where datasets {D1, D2, D3} and tasks {T1, T2, T4, T6} are assigned to the private cloud and the rest to the public cloud.

Fig. 1.
figure 1

System model

Fig. 2.
figure 2

Principal system components

4.1 Pre-scheduler

Since the private cloud is lower in cost according to the public cloud, in this module we start by assigning the sensitive data (Form the highest security level to the lowest one) and the tasks that manipulate them in the private cloud if the required resources are available.

In order to reduce the data transfer cost, we assign to the private cloud all the datasets manipulated by tasks located in the private cloud.

At least of this phase it remains only the datasets and tasks that require the unavailable resources in the private cloud. So, we assign them to be stored or executed in the public cloud. Simply, this module take as input the workflow and the list of available resources in the private cloud and it gives an execution plan as output. The execution plan refers to the list of tasks which must be executed in the private cloud and that which must be executed in the public one and similarly for the dataset. Algorithm 1 shows by details the principle functioning of our scheduler. And, Table 2 illustrates the used symbols and their descriptions.

4.2 Security Enhancement Module

Considering that the private cloud is secure and that threats come from the public one, we aim in this module to secure the sensitive data that are assigned to be stored in the public cloud, their dependent tasks and the data flowing inter the public/private cloud. To this end, we have proposed adding certain security services and cryptographic functions. Since these services produce a significant overhead to the total cost and may increase the execution time, we have chosen them basing on the data security level, budget and deadline specified by the user.

This module takes as input the execution plan generated by the Pre-scheduling module and specifies the suitable encryption/decryption algorithms for each sensitive data either located in the public cloud or transfered from the private cloud to the public one or vice versa. Also, it calculates the total overhead generated by these algorithms in order to preserve the user constraints (budget & deadline).

figure a
Table 2. Descriptions of symbols used in Pre-Scheduling algorithm

4.3 Post-Scheduler

In this phase we aim to schedule tasks inside each cloud platform (private cloud, public cloud). For that, we considered two schedulers (private scheduler and public scheduler). Where, we relied for each of them on the scheduler proposed by Makhlouf and Yagoubi [13] as it’s improve its performance. In addition, we extended them in order to take into consideration the cost overhead produced by the security enhancement module.

Private Scheduler schedule tasks assigned to the private cloud, it works basing on the principle proposed in [13]. While, for the total execution cost it consider the additional cost generated by:

  1. 1.

    Encryption algorithms concerning tasks that send sensitive data to the public cloud

  2. 2.

    Decryption algorithms concerning tasks that receive sensitive data from the public cloud

Public Scheduler schedule tasks assigned to the public cloud, it based on the scheduler proposed in [13]. While, for the total execution cost it consider the overhead generated by:

  1. 1.

    Encryption/Decryption algorithms for tasks that manipulate data stored in the public cloud.

  2. 2.

    Encryption algorithms for tasks that send sensitive data to the private cloud.

  3. 3.

    Decryption algorithms concerning tasks that receive sensitive data from the private cloud.

5 Performance Evaluation and Results

In order to evaluate our scheduling model, we implemented it by extending the work in [13]. To do this, we have supposed a security level of data (SL) in the Synthetic Workflow [21, 28]. Since synthetic workflows do not support the security level, we have generated the security level of each data following a normal distribution :

$$\begin{aligned} SL\leftarrow \sqrt{\frac{\displaystyle \sum \limits _{i=1}^n \left( x_{i}-x_{m}\right) ^2}{n}} \end{aligned}$$

With \(x_{i}\) is the size of the data i and \(x_{m}\) is the average of the n data size. In statistics, the normal distribution is the most important probability distribution. It is used to model unbiased uncertainties random errors of additive type, and symmetric distributions of processes and natural phenomena [18].

The selected applications include LIGO in Fig. 3a (Laser Interferometer Gravitational-Wave Observatory), a data-intensive application, and MONTAGE in Fig. 3b, an I/O bound workflow.

Fig. 3.
figure 3

Scientific workflow structures

As in [13], we simulated Workflows whose size does not exceed 200 tasks. We have fixed budget to 20 virtual machines and measured the impact of security level on Workflow cost and the Deadline. We compared our approach (SLp) with the standard approach (SDp) in [13] which does not support the level of security.

5.1 Impact of the Security Level on the Cost

Fig. 4.
figure 4

Impact of the security level on the Cost

From Figs. 4a and 4b, we simulated the execution of the MONTAGE and LIGO workflows respectively and measured the costs in VMs number. We note that regardless of the size of the workflow, our policies give same results as standard policy. It is possible to explain these results by the fact that in our resources Cloud model, a virtual machine can execute only one task at a time and it is charged 1 $ for each interval of 60 min (one hour) of operation. But a partial usage of a billing interval is rounded. The proof of this assumption will be shown during the next simulation.

5.2 Impact of the Security Level on the Deadline

Fig. 5.
figure 5

Impact of the security level on the Deadline

From Figs. 5a and 5b, we simulated the execution of the MONTAGE and LIGO workflows respectively and measured the Deadline. We note that, regardless of the size of the workflow, our policies give bad results unlike the standard policy which gives excellent results by reducing Deadline and this whatever the size of the workflow. This result is due to the overload of data encryption/decryption operations when they are transferred to the public cloud. This overload increases the Deadline allowing the consumption of the entire of the last 60-min interval. This consumption does not affect the cost but increases Deadline.

6 Conclusion

The present paper introduce a novel workflow scheduling strategy for the hybrid cloud environments which consists of an economical distribution of tasks between various cloud service providers, in order to provide customers with shorter execution times, lower cost and high security services within a limited budget and deadline. The result shows that our strategy increases the deadline but it does not affect the cost. We have shown that the deadline increase is due to the encryption/decryption operations. So that increase does not affect the cost, our strategy exploits to the maximum the virtual machines, especially the last 1-h slot time of virtual machines.

Future studies in this field can deal with the addition of security services at the data and task level at the same time, achieve a more economical cryptographic strategy to ensure the security requirements, find a scheduling strategy that takes into consideration more parameters and more constraints such as energy.