1 Introduction

Cloud computing has received increasing attention in both research and business for about one decade due as it provides a great deal of benefits, e.g., elastic resource provisioning, pay-for-use, economies of scale, high reliability, dynamic customization, etc. [175]. But although resources appear to be infinite to users, a cloud has limited resources in the real world. Thus a cloud should have enough resources for satisfying the peak demand of its users’ requests to satisfy all Quality of Service (QoS) requirements of users for its reputation [45].

When there are insufficient resources for a local cloud meeting the peak demand of its users, three methods can be exploited. The first one is to reject some unimportant requests, e.g., cheap requests, to make room for important requests whose rejections cost much more [106]. While this method would reduce the cloud provider’s reputation [45], and may further result in the loss of potential users. Second, the cloud provider increases infrastructures enough for the peak demand. While, in real production environments, the peak resource demand of users is usually much more than the average one, but transient [14, 99], in a cloud, which leads to lots of idle resources most of the time if using the second method. Besides, most of the small to medium enterprises have insufficient capital for infrastructure investments. The third method, hybrid cloud (a.k.a. cloud bursting), is a cost-efficient way to address the problem, which elastically scales up or down the cloud capability based on demand by combining local infrastructures (clusters, grids, or private clouds) and one or more public clouds. Surveyed by the European Network and Information Security Agency (ENISA), most of the small to medium enterprises prefer a mixture of cloud computing models (public cloud, private cloud) [26]. Nowadays, both commercial and open-source virtualization tools support basic cloud bursting functionalities, e.g., VMware [182], Open Nebula [150], OpenStack [151]. As a new computing paradigm, hybrid cloud computing plays a crucial role not only in developing cloud computing, but also in integration of cloud computing and Internet of Things (IoTs) [25] as others, e.g., edge computing [165], fog computing [51], mobile cloud computing [62].

For providing services, cloud providers must solve the problems of provisioning the optimal number of resources based on demand, i.e. resource provisioning, and mapping users’ workloads to available resources efficiently, i.e. workload scheduling [130]. Workload scheduling is to decide the order (priority) of workloads to be executed on available resources in various scheduling units, e.g., virtual machines (VMs), tasks, user requests, while resource provisioning judges which and what amount of resources should to be allocated to the scheduled workloads. For optimally providing services, the cloud provider should schedule workloads based on the characteristics of available resources, e.g., heterogeneity [55, 89], reliability [18], and provision resources considering the features of workloads to be run, such as various QoS [72], interdependences between service components [100], interferences between workloads [195], etc. Thus, workload scheduling and resource provisioning are intimately related with each other, and both of which are essential to a cloud management.

However, for a service provider, the hybrid cloud resource management not only has all of the challenges for provisioning services both on a private cloud, e.g., server consolidation [183], and on public clouds, e.g., elasticity management [10], but also introduces new ones, such as the heterogeneity between clouds in terms of various resources [131, 187], the decision of which services and/or which part of a service to be outsourced to the public cloud [100, 145], the performance overhead caused by the network connection between clouds with much lower bandwidth [73, 131], and so on.

In this paper, we surveyed the articles about workload scheduling and resource provisioning for hybrid clouds in recent 10 years. We first presented a comprehensive taxonomy of workload scheduling and resource provisioning in hybrid cloud environments to investigate these 146 related research works in detail. Then, we discussed the challenges which have not been addressed, and suggested several promising directions for future research, based on the detailed investigation for these related papers. To our best knowledge, no work has extensively and thoroughly reviewed workload scheduling and resource provisioning in hybrid clouds. We believe our review work is helpful to academia and industry concerning hybrid clouds.

The rest of this paper was organized as follows. Section 2 presented the hybrid cloud architecture, which is helpful to understand the remainder of the paper, the survey works related to the resource provisioning and the workload scheduling in hybrid clouds, and the method of collecting related works reviewed. Section 3 introduced in detail the comprehensive taxonomy of workload scheduling and resource provisioning on hybrid cloud management and investigated related works in depth. Section 4 summarized the challenge and the opportunity for future work. And finally, Sect. 5 concluded the paper.

2 Background

In this section, we first provided a simple overview of hybrid cloud architecture, which is helpful to understand the remainder of the paper. Then, we presented the previous work surveying hybrid cloud resource management, and the search method for the related literature reviewed in this paper.

2.1 Hybrid cloud architecture

In a hybrid cloud environment, as shown in Fig. 1, users request and pay for services with various QoS requirements from the service provider with local or public resources including computing, storage, network resources, and so on. The service provider provisions its local resources on which it schedules user requests to process, and rents resources from public clouds when local resources are not enough to satisfy any QoS requirement.

Fig. 1
figure 1

The hybrid cloud architecture

The local resources can be provisioned as either physical resources, e.g., clusters and grids, or virtualized resources which are managed with virtualization tools, e.g., Xen [16], KVM [103], and Docker [64]. Virtualization introduces many benefits, e.g., better isolation and manageability of resources, while with significant performance overheads [120, 185]. The public computing resources are usually provisioned in the form of VM, e.g., Amazon EC2 [2], Alibaba Cloud [1], etc. The local and public resources are different in various characteristics, which will be explained in detail in Sect. 3.3, and the resource provisioning policies should be designed with their own peculiarities, respectively. In a hybrid cloud, the public resources may be provisioned by more than one public clouds to avoid vendor lock-in. Then a broker of multi-cloud service composition may be introduced to save the service provider the complex issue of choosing the best fit public cloud [12]. While the use of broker may lead to the service provider losing some optimization opportunities, e.g., the communication cost between public clouds, as the execution of workloads on the public clouds are transparent to the provider.

In this article, we focused on the hybrid cloud resource management including workload scheduling and resource provisioning on hybrid clouds.

2.2 Related survey work

Although there have been plenty of researches surveying resource management on a cloud, only a few works concerned the hybrid cloud.

Bittencourt et al. [22] compared the performance of seven scheduling algorithms on hybrid clouds, where only two work was specially designed for hybrid clouds. And they assessed the impact of communication links on schedules, concluding that the increase of bandwidth reduced the costs and the makespan, and that HCOC [21, 23], a scheduler in hybrid clouds, outperformed MDP [196] designed for utility computing.

Fadel and Fayoumi [68] surveyed 19 works published over 5 years ago, tackling the issue of cloud bursting whose challenges were choosing the best workload to burst and choosing the best resource to provision.

Chopra and Singh [42] investigated only six task scheduling methods for hybrid clouds, considering only three aspects: their optimization criteria, multi-core processor awareness and the number of workflows supported. Their criteria involved only the cost for renting public computational resources and workflows’ finish time.

Manikandan and Suguna [134] reviewed 10 papers about resource provisioning for clouds, where there was only one work [217] focusing on hybrid clouds.

Bhosale and Bandari [19] reviewed three works [27, 173, 191] of Aneka cloud platform developed by CLOUDS Laboratory in the University of Melbourne, which provisioned resources for multiple non-interactive workloads involving a large number of files in hybrid clouds.

de Assunção et al. [53] investigated whether a local cluster can benefit from using clouds to improve its requests’ performance by evaluating seven scheduling strategies designed based on conservation [143], aggressive [121], and selective backfilling [172], in terms of various performance metrics, e.g., average weighted response, job slowdown, number of deadline violations, number of jobs rejected and the cost for using clouds.

In this article, we reviewed 146 research works studying on the workload scheduling or/and resource provisioning for various service deliveries, e.g., scientific computing, web services, infrastructures (i.e., VMs), in hybrid clouds. We presented a comprehensive taxonomy to categorize these related works for helping us to summarize the challenges which have not been addressed and propose promising directions for future research. We hope that our work is helpful for both research and business in hybrid cloud management.

2.3 Literature search

The literatures we reviewed include the followings:

  1. (1)

    the relevant papers obtained by querying the Engineering Village Compendex database [67] and the Web of Science Core Collection [190] with the searching conditions (in the form of the query statement in Engineering Village Compendex database), (“hybrid cloud” OR “hybrid clouds” OR “cloud bursting”) AND (“task scheduling” OR “job scheduling” OR “application scheduling” OR “request scheduling” OR “service scheduling” OR “task management” OR “job management” OR “request management” OR “application management” OR “service management” OR “service migration” OR “resource scheduling” OR “resource provision” OR “resource provisioning” OR “resource management”), to cover all high quality research papers;

  2. (2)

    the relevant references of the papers obtained in (1), (2) and (3) (a recursive procedure);

  3. (3)

    and the relevant literatures citing the papers obtained in (1), (2) and (3), which were achieved by Google Scholar [77] (a recursive procedure).

3 Taxonomy

This section presented the detailed taxonomy for workload scheduling and resource provisioning in hybrid cloud environments. We classify related works in six ways according to the properties of the hybrid cloud optimization problem they solved, as shown in Fig. 2. This classification can help us to review related works in detail and summarize them for leading out challenges and opportunities of optimizing the use of resources in hybrid clouds. The taxonomy is detailed as followings.

  1. (1)

    Requirement. When exploiting cloud resources, users have various QoS requirements for their workloads, e.g., response time, security, etc. In reviewed works, these requirements are treated as objective(s) of optimizing one or more QoS metrics, or constraints of restricting some metric values, illustrated in Sect. 3.1 in details.

  2. (2)

    Workload types. There are various types of workload to be executed on clouds. Distinct types of workload have requirements of various characteristics of resource, and thus should be managed by different methods to achieve the best result [37]. The workload type is a useful differentiating factor for related literatures to understand the applicability of a scheduling/provisioning method to a type of workload. Workload types concerned are detailed in Sect. 3.2.

  3. (3)

    Resource characteristics. Private and public resources have differences in various aspects, e.g., cost–performance ratio, security, reliability, etc., which will be described in Sect. 3.3, leading to different types and different amounts of hybrid resources required by various workloads with diverse QoS requirements. In general, cloud resources are heterogeneous because of continuously infrastructure updates during operation in a cloud, which is one of basic challenges complicating scheduling in clouds [56, 57]. The mix of private and public clouds make various resources more heterogeneous, which brings much more challenges. While, existing related works simplified the resource optimization problem by ignoring more or less heterogeneity or diversity of hybrid cloud resources, such as ignoring the heterogeneity between private resources and public resources, illustrated in Sect. 3.3.

  4. (4)

    Private resource cost model. The total costs of cloud operations are mainly composed of investment costs for buying infrastructures and operational costs which consist of the electricity costs for power, software copyright costs, hardware/software maintenance costs, and so on [171, 184]. In general, investment costs are consider as “sunk costs”, and operational costs are evaluated by consumed power due to its very largest part of operational costs [71]. When concerning the cost of executing workloads on hybrid clouds, one should consider costs of both private and public resources used. The ways how to deal with the private resource costs by related works are shown in Sect. 3.4.

  5. (5)

    Public resource costs. When renting resources from public clouds, service providers, which is also public cloud users, are charged based on the amount of rented resources and the rent time. In existing related literatures, there are three types of resources, computing (VMs), storage, and network bandwidth, concerned to be charged by the public cloud. The cost model for each type of resources for each work is presented in Sect. 3.5.

  6. (6)

    Factors of workload processing time. The performance of workloads, e.g., finish time, response time, is one of the most concern in clouds. While there are a number of factors affecting the performance, and one work cannot address all of these factors. Therefore, each work has its own concern on some factors to solve the scheduling or provisioning problem in a hybrid cloud environment with simplification which was believed reasonable. Performance factors concerned by related works are shown in Sect. 3.6.

    Fig. 2
    figure 2

    The taxonomy of workload scheduling and resource provisioning in hybrid cloud environments

3.1 Requirements

Service providers have various requirements in different hybrid cloud environments concerning various QoS when managing hybrid resources. These requirements can be handled as optimization objectives or constraints in a hybrid cloud resource management problem. The requirements concerned by current works are as follows.Footnote 1 Table 1 summarizes requirements concerned by each work.

Table 1 Classifying literatures based on requirements

3.1.1 Profit or cost

For a service provider, the profit is of the most concern. The profit is the difference between the revenue and the cost (Fig. 3),

$$ Profit = Revenue - Cost. $$
(1)

A user pays for its required services according to service level agreement (SLA) contracts with service providers. The revenue of a service provider is the accumulated payment from all of its users for their service requests. Thus, when user requests and the payment of each request are known, the revenue of the service provider is a constant (C), and then, the profit maximization is equal to the cost minimization for a service provider.

$$ Profit = C - Cost \Rightarrow \max {Profit} = C - \min {Cost} . $$
(2)

In a hybrid cloud environment, the cost of a service provider includes the costs for operating the local resources and renting the public resources, which would be illustrated in Sects. 3.4 and 3.5, respectively, as well as the penalty cost due to SLA violations,

$$ Cost = Cost_{pri} + Cost_{pub} + Cost_{penalty}. $$
(3)

The service provider must pay the penalty when it is in breach of any SLA contract with users. Meanwhile, some potential users may be lost, and thus the revenue and the profit may be reduced, if the service provider has violated some SLA contracts which reduces its reputation.

Fig. 3
figure 3

The taxonomy by concerned requirements

In the most of related works, the profit or the cost was concerned as the/an optimization objective (ProfitFootnote 2—profit maximization, Cost—cost minimization) or a constraint (Budget—the upper limit to the cost).

From Table 1, we can see that the profit/cost is one of the most concerns for providing services in hybrid clouds as it is one of the most important factors considered by cloud providers for increasing their incomes.

3.1.2 Application performance

A user always wants to get the result of its requests as soon as possible. Thus, the turnaround time, the time between submitting a request instance and receiving the result, usually is concerned as a QoS metric. For a batch job, the turnaround time is generally expressed as the finish time, or the makespan which is the period elapsed from the submission to the completion, while for a web service, it is expressed as the response time influenced by many factors, e.g., the communication delay between the user and the private cloud, the queuing delay, the processing delay, the communication delay between the private and public clouds.

For batch tasks/jobs, the finish time and the makespan are equivalent when the turnaround time is considered as the optimization objective (Finish/Makespan—finish time/makespan minimization). The finish time can be also considered as a constraint metric, Deadline—the time before which the job must be finished, which depends on various factors, e.g., the started time (Start), the execution time, the transfer time of input data (Transfer), etc.

For web services, the response time was concerned by some works as an optimization objective or a constraint metric (Response). While, most of works focusing on optimization of hybrid cloud resource management for web services only concerned a part of the response time, e.g., the queuing delay (Queuing), the processing delay (Delay), the communication delay between the private and public clouds (Communication). The part of the response time was also considered as an optimization objective or a constraint for each related work.

As shown in Table 1, the performance (turnaround time) is one of the most concerns in hybrid clouds as it is one of the most important factors considered by cloud users paying for qualified services delivered by clouds. The fail of satisfying performance requirements would increase the cost of cloud providers due to a violation of contracts, and hurt their reputation, leading to a lose of some potential users.

3.1.3 Public resource amount

The amount of public resources was considered by several works in two ways: some works concerned that the number of VMs may be limited in a public cloud, in practice, such as a user can run up to 20 instances at a time with a default plan in Amazon EC2; the public resources should be sufficient for satisfying users’ requirements as the private resources are fixed.

In the first case, the public resource amount was considered as a constraint metric expressing the limitation of rented public resources in VM number (VM Number) [33, 40, 41, 74, 119, 138, 162, 162, 213, 214] or in amount of each resource type (Amount Limit) [131].

In the second case, the public resource amount was concerned as a metric of the objective or a constraint. When concerned as an objective metric, the amount was minimized with satisfying the performance requirements of provided services (Amount Minimum) [3, 28, 29, 34, 35, 176, 178], which indirectly reduces the cost of renting public resources. When the amount was considered as a constraint metric, a lower bound was set, representing the minimum requirements of users (Amount Required) [170] to indirectly guarantee the application performance.

3.1.4 Resource utilization

To improve the efficiency of resources, the resource utilization is one metric should be concerned. In general, the resource utilization and efficiency are positively correlated. Thus, there is a few work tried to maximize the resource utilization (Utilization) [33, 213]. While, a high utilization may reduce the reliability of infrastructures [17], and thus, some work restricted the resource utilization to an upper bound (Utilization Limit) [24].

3.1.5 Security

Security is a essential factor for whether users, especially enterprise users, exploit the public cloud because of the security and privacy issues of internal data or codes. Thus, several works have concerned the security as a constraints (Security) to restrict the location for processing workloads requiring high security. There usually are multiple levels for security requirements, and most of the related works considered simple two-level security model: the high security level requiring the workload must be processed in the private cloud, and the low security level representing the workload can be processed in both the private and public clouds.

3.1.6 Reliability/availability

Due to the increasing functionality and complexity of hybrid cloud computing, failures are inevitable, which may degrade the performance of processing workloads. For example, a task can be finished with its requirement without any failure, while its requirement may be violated if there is a failure which interrupts the task’s execution. Thus, a few works improved the reliability or the availability of processing workloads, which had a great influence on the performance (Reliability,Arantes2017,Choi2015,Ben-Yehuda2012,Liu2015 and Availability) [90, 91].

Usually, the reliability and the availability can be both improved by redundant executions of workloads. While, the redundant execution requires more resources, and thus costs more. Therefore, there is a tradeoff between the reliability (or availability) and the cost.

3.1.7 SLA violation

There will be SLA violations when there is no resource with enough power for satisfying a workload, or when the service provider puts the profit first and there are some requests whose rejections cost less than their acceptions. Thus, there are some works concerned SLA violation, e.g., request rejections/failures, as an objective or a constraint metric (SLA) to minimize the number of SLA violations [30, 50, 137, 159, 164, 216] or to restrict the number to an upper bound [94, 107, 118, 158, 177, 202,203,204].

3.1.8 Others

A very few work respectively concerned the load balance between the private and public clouds (Balance) [80, 125, 192], which may improve the turnaround time by finishing workloads in clouds close to one another, the length of request queues (QLength) which was constrained within a bound for optimizing the queue time [119], or the resource contentions between services (Contention) [86, 87], e.g., the supports of library and operating system versions for each service, the conflicts of communication ports for services, etc.

If all rented (homogeneous) VMs have a same price, minimizing the cost and minimizing the total rent time of public VMs are equivalent. With this assumption, a very few work concerned the optimization of the total rent time instead of the cost for public VMs (RentTime) [213] (Fig. 4).

Fig. 4
figure 4

The taxonomy by workload types

3.1.9 Single- or multi-objective

Cloud computing provides services for users usually with multiple QoS requirements. There are two ways to concern multiple metrics presented in the previous paragraphs simultaneously. The first one is to consider one metric as the optimization objective while others as constraints, formulating the hybrid cloud resource management as a single objective optimization problem. With another way, some works considered multiple metrics as objectives and modelled the hybrid cloud resource management as a multi-objective problem.

For a single objective optimization problem, there is always an optimization solutions not inferior to any other solution, while there usually is no such solution for a multi-objective problem. Therefore, all Pareto-efficient solutions should be solved to provide candidate solutions for service providers, where a Pareto-efficient solution outperforms any other solution in at least one objective, i.e., an objective cannot be improved without sacrificing other objectives.

Most of works optimizing multiple objectives focused on the tradeoff between cost and performance, two of the most important factors concerned, where the performance is improved by increasing allocated resources, generally, leading to more costs for a workload. While there is few research concerning other factors, e.g., security, reliability, availability, when considering the tradeoff between/among multiple objectives.

3.2 Workloads

There are mainly two types of workloads focused by related works: batch jobs and long-running web services, explained as follows. There are also a few works focusing on the infrastructure (VM) service delivery using hybrid clouds regardless of workloads.

3.2.1 Batch jobs

Batch jobs, e.g., recommendation computing, financial analysis, weather forecast, generally require complex calculations, which take from a few seconds to a few days to complete, and thus mostly are not sensitive to short-term performance fluctuations. A batch job is usually divided into multiple tasks which are respectively dispatched to available resources within a duration so that all tasks of the job are completed as soon as possible or within its deadline.

Batch jobs can be classified into two categories, jobs with independent tasks and workflow jobs. There are plenty of jobs consisted of a number of trivially parallel tasks (called Jobs with Independent Tasks), such as parallel image rendering, data analysis. For these jobs, any two tasks do not depend on each other, and thus, all tasks can be executed in parallel. Such jobs are a kind of very common application in the parallel and distributed systems [75, 88]. Thus, a number of researchers focused on completing jobs with independent tasks on hybrid clouds.

While, a number of scientific computing jobs consist of multiple tasks with logic or data dependency relationships (called Workflow Jobs), i.e., a task can be started only when all tasks it depends on are finished. A workflow job is usually abstracted into a directed acyclic graph (DAG) where nodes are tasks and edges represent the dependences between corresponding tasks. More difficult than focusing on jobs with independent tasks, the work scheduling workflow jobs on hybrid clouds should take task dependences into account for maximizing the degree of task parallelism to improve the performance of workflow jobs.

3.2.2 Web services

Web services are long-running services handling short-lived latency-sensitive requests, where each request takes only a few milliseconds to a few hundred milliseconds. Such services are used for end-user-facing products such as web search, online video, business transaction, and for internal infrastructure services (e.g., distributed databases). Web services should return results to their requests as soon as possible as the request respond time has significant influence on the service providers’ profit [82].

A web service usually consists of several tiers (or components). For example, the 3-tier web application architecture consisting of presentation, application and data tiers, has been widely used. For a web service, tiers have different requirements of resources due to their different functionalities, which motivates the use of virtualization for consolidating the instances of web service tiers to benefit from the complementarity of tiers’ resource requirements.

While, a few works considered a web service instance as a whole for simplification (One-tier Web Service). Such works studied on methods vertically (reconfiguring the resources allocated to service instances) and/or horizontally (tuning the number of service instances) scaling web services to improve the cost of the service provider with guaranteed performance requirement.

For the works focusing on Multi-tier Web Services, the vertical and horizontal scale of instances should be concerned for each tier. Meanwhile, these works should improve the performance of the connection between instances of different tiers by reducing the communication distance using instance deployment/migration, which has significant influence on the response time of requests. Such things increase much more complexity for providing web services.

3.2.3 Literature review

From Table 2, we can see that there are about half of literatures focusing on independent tasks. This is mainly because its suitability of being outsourced to public clouds as there is no data transfer between tasks, which makes the performance of tasks not degraded by the scarce network resource between two clouds as (almost) no communication between tasks. There are also 21.2% researches optimizing the execution of workflow applications on hybrid clouds as the problems of resource heterogeneities, poor network resources between clouds, etc. can be addressed by carefully designing methods of workload scheduling to guarantee performance requirements (Fig. 5).

Fig. 5
figure 5

The taxonomy by hybrid resource characteristics

Table 2 Classifying literatures based on workload types

Web services have stringent requirements in performance due to their interacting with users, and are tunable for each component in instance number [188]. Thus it is more challenging when providing them in hybrid clouds. Therefore, there are relatively fewer studies focusing on providing web services on hybrid clouds, being only about a quarter of studies focusing on batch jobs in amount.

There are only about 11.6% literatures focusing on infrastructure (VM) service delivery regardless of workload features in the perspective of an IaaS provider, as it is more useful and more efficient when executing applications on clouds concerning the application characteristics while it is more simple than providing web services. When using these research result, a service provider must decide the amounts of required private resources and rented public resources, according to its load and service characteristics, which is a challenging question.

3.3 Hybrid resource characteristics

Resources of a hybrid cloud are composed of Local Resources and resources rented from public clouds (Public Resources). There are many different characters between local and public resources, as shown in Table 3.

Table 3 The comparison between local resources and public resources

In general, local resources have better performance and less costs than public resources, and thus almost all of related works used local resources whenever possible as their high performance cost ratio. While, the amount of local resources is limited, thus, the public resources are rented when the workloads are so high that local resources cannot satisfy all QoS requirements. Even though a public cloud provisions “unlimited” resources, there are some restrictions for public cloud users in the form of resource amount (VM instance number).

Usually, local resources are heterogeneous (Heter) as servers are gradually provisioned and replaced over the operation of the private cloud (or clusters, grids) [55]. While, plenty of related works (about 65.5% as shown in Table 4) considered local resources as homogeneous for simplicity. There are two degrees of homogeneity considered by existing related works: (i) all of local resources, physical machines (PMs), VMs or CPU cores, are homogeneous (Homo); (ii) similar to the public cloud, local resources are homogeneous for VMs with one type (HomoType). For simplicity, 88.2% related works, as shown in Table 4, treated local resources the same as public resources for seamlessly combining the private and public clouds (Seamless). In general, a public cloud provisions homogeneous VM instances with one type (HomoType), while, some related works regarded all public resources as homogeneous for simplicity (Homo) or as heterogeneous for generalization (Heter).Footnote 3

Table 4 Classifying literatures based on resource characteristics

Compared to the public cloud, the local cloud provided more security/private services as it only serves internal users. While local resources are often regarded as less reliable as it is too expensive to maintain resources with high reliability, for example, traditional and desktop grids have yearly resource availability averages of 70% or less [105]. Public clouds, in contrast, have SLAs that guarantee resource availability averages of over 99%. Thus, some related works tried to migrate services to public clouds for increasing reliability, which could prove prohibitively expensive [18].

3.4 Cost model of local resources

For service providers, the investment costs of local infrastructures, which cannot be reduced by runtime resource managements, are usually considered as “sunk costs”. Thus, the vast majority of works in resource managements concerned the reduction of the operational costs consisted of the electricity costs for power, hardware/software maintenance costs, and so on [171, 184]. As these electricity costs make up the largest part of operational costs, and above 90% of energy is consumed for computing (\(Energy_{Com}\)), networking (\(Energy_{Net}\)) and cooling (\(Energy_{Coo}\)) [71], the electricity costs for powering PMs, network and cooling equipments can be regarded approximately as the operating cost of local resources (private cost for short),

$$\begin{aligned} Cost_{Pri}\approx & {} price_{e} \cdot (Energy_{Com} + Energy_{Net} \\ &\quad + Energy_{Coo}) + C, \end{aligned}$$
(4)

where \(price_{e}\) is the unit price of electricity, C is a constant representing relatively fixed private costs including software copyright costs, electricity costs for powering auxiliary equipment, salaries of staffs, etc. (Fig. 6).

Fig. 6
figure 6

The taxonomy by local resource cost model

For a decade, reducing the electricity costs in a private cloud, a cluster, or a grid has been studied by many works [179, 200], which are worth borrowing for optimizing costs in hybrid cloud environments. While, most of existing works focusing on hybrid cloud resource managements, about 67.1% as shown in Table 5, did not consider the costs of local resources, or considered them costless (-) or a Fixed value, considering that operational costs of local resources were much less than the rent cost of public resources. Most of works concerning the costs of using local resources, about 75.1% (\(=24.7\%/(1-67.1\%)\)) as shown in Table 5, regarded the cost model the same as that of public resources (Same), without considering their difference. Only 7.5% related works concerned the computing Energy cost for the local resources with simple energy consumption model, e.g., the linear relationship between consumed energy of the private cloud and the number of tasks/requests [194, 197, 198], the linear relationship between energy consumed by a data center and the number of active PMs/VMs [128]. There was also a very few related works assumed that the service provider pay for used network bandwidth (Net) in the local cloud [118].Footnote 4

Table 5 Classifying literatures based on concerned local resource costs

3.5 Cost of public resources

Service providers should pay for various resources, e.g., computing (C), network (N), and storage resources (S), rented from public clouds. Usually, public resources are charged on a basis of unit time (\(Time_{unit}\)), for example, Amazon EC2 charges its VMs hourly. The resource with the lease time less than one unit is charged for one time unit, for example, if one rents a VM with $0.2/h price for 2.2 h, it needs to pay $0.6 ($0.2/h \(\times \) \(\lceil 2.2 \rceil \) h) instead of $0.44 (Fig. 7).

Fig. 7
figure 7

The taxonomy by public resource costs

In general, the computing resources provisioned by public clouds are in the forms of VMs. Thus, the cost for renting a VM (\(Cost_{VM}\)) is

$$ Cost_{VM} = Price_{VM}\cdot \left\lceil \frac{Time_{VM}}{Time_{unit}} \right\rceil , $$
(5)

where \(Time_{VM}\) is the lease time of the VM.

There are usually three price model for rent VMs, On-demand, Spot and Reserved With on-demand model, the public cloud provisions VMs as soon as its users pay and withdraws them when their rents are due. The price of on-demand VMs is usually stable during a relatively long period. Spot VMs are bid by multiple users. The public cloud provisions spot VMs for a user only when it bids higher than the spot price which varies depending on market supply and demand, and will withdraw them either when their rents are due or when the spot price increases to higher than the user bid. For most of time, spot VMs are cheaper than on-demand VMs, providing cost saving opportunities with proper bid strategy, while they are more expensive sometimes. The reserved VMs are rented and paid by users for a long time, e.g., weeks, months. Reserved VMs are cheaper than on-demand VMs in unit price, saving cost for service providers often having relatively high workload. Thus, the combination of these three kind of VMs help service providers to optimize their costs satisfying various requirements, while almost all related works, as shown in Table 6, only concerned on demand VMs. Only 3.2% and 0.7% related works respectively exploited spot VMs and reserved VMs in hybrid cloud environments.

Table 6 Classifying literatures based on public VM price model

Even though, in real world, most of public clouds charge their provisioned VMs in the discrete form of time (Discrete), e.g., Hourly, plenty of related works (67.8% as shown in Table 7) simplified the VM price model for their hybrid environments. 13% related works considered that the public VMs were charged by seconds or continuous time (Continuous). Many other related works (21.2%) used the price model of VMs with the bill unit being Workload (task or request) or Resource unit (e.g., VM) regardless of the length of the rent time. Very few work considered that there is a constant saved cost for using a public VM [212], optimizing cost benefits by migrating some workloads to a public cloud. Several works even did not concerned the cost for renting the public resources (-), which optimized the amount of rented public resources.Footnote 5

Table 7 Classifying literatures based on charged unit of public VMs

For network resources, public clouds charge their users on the basis of network bandwidths (BW) and time. The time for using public network resources is proportional to the amount of transferred data (\({Data_{Transfer}}/{BW}\)). Thus the cost for renting network resources in a public cloud

$$\begin{aligned} Cost_{Net}& = {} Price_{BW}\cdot BW\cdot \left\lceil \frac{Data_{Transfer}}{BW} \right\rceil \\ & \approx {} Price_{BW}\cdot Data_{Transfer}, \end{aligned}$$
(6)

where \(Price_{BW}\) is the price of network resources in bandwidth unit and time unit. The network resources are paid for only the uplink and downlink data transfers in public clouds whose internal bandwidths are free to use. Thus, the data transfers between two clouds are charged. The broker is not recommended for service providers as the data transmission between two public clouds is transparent to them in both performance and cost, losing some opportunities for optimization in workload scheduling or resource provisioning (Fig. 8).

Fig. 8
figure 8

The taxonomy by factors of processing time

Public clouds charge the storage resources according to the amount of stored data (\(Data_{STO}\)) and its during (\(Time_{STO}\)),

$$ Cost_{STO} = Price_{STO}\cdot Data_{STO} \cdot \left\lceil \frac{Time_{STO}}{Time_{unit}} \right\rceil , $$
(7)

where \(Price_{STO}\) is the cost for a unit of data, e.g., KiloBytes, MegaBytes, GigaBytes, per time unit.

Then a user should pay for rented computing, network and storage resources in a public cloud

$$ Cost_{public} = \sum _{VM} Cost_{VM} + Cost_{Net} + Cost_{STO}. $$
(8)

Although public clouds charge their computing, network and storage resources, in general, there only about 6.2% related works, as shown in Table 8, concerning the costs of these three types of resources, and more than half of related researches only considered the cost of rented VMs for the sake of simplicity.

In hybrid cloud environments, service providers should minimizing their total cost. The outsourcing of workload may reduce the private cost while increase the public cost by increasing the rented public resources. Thus, a service provider should concern the tradeoff between the private and public costs to achieve the optimal overall cost.

Table 8 Classifying literatures based on public resources charged

3.6 Factors of processing time

In a cloud, the turnaround time of a workload (a task or a request) depends on various factors, e.g., the computing time (C), the transfer time for input data (T), the startup time of provisioned resources (S), the queue time (Q), the recover time when there is a failure (F), etc.

The computing time of a workload is decided by its computing load, e.g., the number of instructions to be executed, and the computing power of the resource (PM or VM) it is assigned to. The VM performance can be affected by the heterogeneity in underlying hardware [89], such as, VMs with same type (resource configuration) provided by heterogeneous architectures, e.g., POWER and X86, can have different performance. The evaluation of the computing time can be conducted by exploiting either the linear model of the load and the resource capacity or other complex models captured by some data analysis tools, e.g., machine learning [55].

The data transfer time is influenced by the bandwidth of the transmission link and the data amount. When outsourcing workloads to public clouds to improve the performance, service providers should concern the heterogeneity and the dynamics of available network resources [73]: (i) the bandwidth between two clouds is much less than that within a cloud; (ii) the bandwidths in public clouds may fluctuate much as the shared public resources by various users. While, less than half of related researches, as shown in Table 9, concerned the delay of data transfer and no related work considered the fluctuation of bandwidths, to our best of knowledge.

Table 9 Classifying literatures based on factors of workload processing time

The startup time/delay, concerned by only 8.2% related literatures, is the difference between starting a VM and scheduling the workload on it, which is also known as bootstrapping time, service initiation time or VM provisioning delay, made up by the time loading VM image, starting the operating system, installing software, configuring network and so on. Thus the available network is a factor affecting the startup time by influencing the image loading time. The types of services and VMs as well as cloud service providers are also important factors influencing the startup time [135]. The startup time ranges from seconds to dozens of minutes [135, 161], which may have a significant impact on the performance of applications, especially for latency-sensitive web services.

The queue time quantifies how long it takes to start executing a task/request from its arrival, which is an important factor for the performance of workloads, e.g., the finish time of batch jobs, the response time of web services. The queue time fluctuates strongly in time, and thus using the average value as the evaluation/prediction one, which is used by most of related works concerning the queue time for web services, may lead to many SLA violations. Using a percentile of the queue time, e.g. 90th, 95th, or a complex tools, e.g., stochastic process analysis [5], may be more suitable to evaluations.

As the increased scale and complex of clouds, the failures of processing workloads are inevitable, which also influence the workload performance. Service providers should apply some recovery approaches to handle these failures, which take some time and consume some resources.

All of these above factors contribute to the turnaround time of workloads, while all of existing related works, to the best of our knowledge, either concerned only a portion of them or used the average one as the turnaround time for homogeneous workloads (P), with simplifications.Footnote 6

4 Challenges and directions

In this section, we resumed the main issues related to hybrid cloud resource management still requiring research efforts and put forward some advice for future research direction.

4.1 Potential of using distributed public clouds

To avoid vendor lock-in and to improve the cost, service providers rent resources from multiple public clouds instead of only one public cloud, as there is no public cloud always having best cost-performance ratio due to the commercial competition. While the usage of multiple public clouds increases the complexity as it introduces several resource heterogeneities. The service provider should be careful to dispatch workload among clouds as the introduction of several public clouds may degrade the performance due to the low network performance between every two public clouds, which has been concerned by few works. Especially in the era of big data, there are plenty of data analysis applications whose performance is largely limited by the network resources. Thus, there is a tradeoff between benefits from more diversity of used public clouds and the overall workload performance.

4.2 Cost evaluation

It is necessary to establish a cost model for hybrid clouds to provide the cost optimization objective function and an evaluation method of resource management strategies for service providers. While modelling the cost is difficult as there are different costs for different resources or different resource amounts and as the price models of private and public resources is very different. In the private cloud, the cost of resources is influenced by many factors [112], the utilizations of computing and network infrastructures, supply of cooling, etc., each of which has various challenges need to be addressed [52]. In a public cloud, the service provider pays for their rented resources according to the resource amounts and the rent times. While, the prices of public resources, especially spot VM instances, vary with time [129], which should be modelled as a time series model for example. Intuitively, there is a positive correlation between the total cost and the allocated workload size in a cloud, thus, there is a tradeoff between the costs of the private and public clouds, which has not concerned by related works as most of related works ignored the cost of operating the private cloud.

4.3 Performance evaluation

In general, the performance requirements of users/workloads are defined as QoS, e.g., the finish time of batch jobs and the response time of web services. While, almost all related works used resource amount to express the requirements of workloads/users which simplifies the hybrid cloud resource management problem. Thus, existing works have to be employed with the relationship between QoS values and resource amounts in hybrid clouds, scarcely studied by researches. Therefore, it is necessary to establish the model mapping QoS requirements to various resources to address the problems, how many resources and which hybrid resources should be provided to satisfy the QoS requirements?

4.4 The VM provisioning delay

On a cloud, starting a VM instance needs seconds or minutes [161]. Ignoring the time consumed by VM provisioning may violate QoS requirements, e.g., deadline constraints of batch workloads, response time requirements of web services, which is what have done by almost all of related works. Thus the evaluation and the concern of the VM provisioning delay, the duration between the request and the running of a VM instance, is essential to the service provisioning in clouds. There are many variables for estimating VM provisioning delays should to be considered [135, 146], e.g., the virtualization technology, the instance type, the VM image loading, the software installing, the network configuration, the time of the day, the data center location, etc. The heterogeneous between private and public clouds also should be considered, due to the infrastructure heterogeneous and the available information for the service provider. Therefore, the evaluation of provisioning delays is still a challenging open problem need to be solved.

4.5 Workload prediction

To eliminate the negative impact of the VM provisioning delay on the workload performance, the prediction of workload sizes is necessary for provisioning VMs in advance. Thus, the time of predicting workload sizes have to be no longer than the VM provisioning delay. While few forecasting models were fast enough under these highly dynamic hybrid cloud circumstances [206], which may result in a prediction delay and provision insufficiency to deal with traffic bursts.

4.6 Reliability

The diversity, frequency, and number of failures are all increased with the hardware and software complexities in cloud platforms [105, 168]. Failures may increase the penalty cost of service providers by decreasing the workload performance, and thus violating SLA. There are many service providers having lost substantial revenue because of failures [168]. While failures are hard to diagnose/forecast or hard to repair due to the high dynamics when clouds are operated, the complex relationships among failures, the different characters of various resource/workload reliabilities (e.g., hardware vs. software, data vs. process) [144, 168], etc. Existing related works concerning reliability in hybrid clouds simplified the reliability analysis by assuming the reliability of running a workload in a resource was known. Reliability models which are applicable to real comprehensive hybrid cloud environments need to be researched to avoid more penalty costs for service providers.

4.7 Security

Users, especially enterprise users, have requirements of their data security and privacy. Security issues are one of the most factors enterprises move their data on to a public cloud [155]. Existing related works concerning security or privacy considered two levels of data security, private and public. The workload with private level cannot be outsourced to the public cloud, i.e. having location constraints, while the workload with public level can be processed in both the public and private clouds. While, the data protection technologies, e.g., encryption algorithms, data integrity auditing, access control, etc., provide opportunities for outsourcing some private workload or data to public clouds, to our best knowledge, which have employed by no research work related to hybrid cloud resource management, to overcome the problem of lacking private resources for the private workload or data, or to reduce the overheads of moving some executing workloads from the private cloud to the public cloud to make room for some new private workloads. However, there are resource overheads consumed by data protection technologies, and thus there is a tradeoff between the overheads of using protection technologies and of moving executing workloads for idling some private resources.

4.8 Optimization for hybrid workloads

In many cases, there are complementaries of resource requirements for different types of workloads in time or/and amount, e.g., compute-intensive batch jobs vs. network-intensive web services. In production environments, a lot of service providers provide hybrid services. For example, Google clusters concurrently run long-running services handling short-lived latency-sensitive requests and batch jobs that take from a few seconds to a few days to complete [180, 207]. Thus, the resource efficiency can be improved by consolidating heterogeneous workloads with different characteristics, i.e., concurrently executing hybrid workloads, in hybrid clouds. While, existing related works did not focus on executing hybrid workloads in hybrid clouds, which is one of the most promising directions to improve the profit of service providers.

5 Conclusion

In this paper, we presented a taxonomy to classify the research works of resource provisioning and workload scheduling in hybrid clouds according to various factors considered, the optimization objective, the constraints, the workload type, the heterogeneities of hybrid resources, the cost model of local and public resources, and the concerned factors of turnaround time, and investigated the current research status based on the taxonomy. Then, we presented several issues as well as research directions about hybrid cloud management still requiring research efforts, the potential of using multiple public clouds, the cost model of hybrid resources, the performance evaluation in hybrid clouds, the concern of VM provisioning delay, the reliability guarantee, the security guarantee, and optimization for hybrid workloads in hybrid clouds. We believe our survey work is helpful for industrial circles and academic interested in hybrid clouds.