Keywords

1 Introduction

The introduction and adoption of the workflow technology has been widely noticed in the last years in several domains, such as business or eScience. Reasons that contributed towards such direction are its offered high level abstraction, design, and runtime flexibility, and the continuous development of the necessary middleware support for enabling its execution [1]. Such a technology has encompassed the fulfillment of different domain-specific requirements in terms of enforced functionalities and expected behavior of the underlying infrastructure for different types of applications. Focusing on eScience applications as the foundations for the case study evaluation driving this work, simulation workflows are a well-known research area, as they provide scientists with the means to model, provision, and execute automated and flexible long running simulation-based experiments [2]. Ordinary simulation-based experiments typically enclose the following characteristics: (i) the gathering and processing of large amounts of data, (ii) the transfer and consumption at irregular time intervals of multiple distributed simulation services during (iii) long periods of time. Due to the access and resource consumption behaviour exhibited by such services, previous works have targeted the migration and adaptation of such environments. These environments can be deployed, provisioned, and executed in Cloud infrastructures in order to optimize the provisioning and usage of computational resources, while minimizing incurred monetary costs [36].

The introduction and adoption of Cloud computing in different domains has contributed in the creation and expansion of existing and new Cloud services and providers. Nowadays, the number of applications partially or completely running in different Everything-as-a-Service Cloud offerings has substantially increased. The existence of a wide variety of Cloud services offering different and frequently optimized Quality of Service (QoS) characteristics has introduced a broadened landscape of alternatives for selecting, configuring, and provisioning Cloud resources. These offer the possibility to host the different application components with special resources consumption patterns in a distributed manner, e.g. computationally or memory intensive ones in compute optimized or memory optimized virtualized resources, respectively. However, such a wide spectrum of possibilities has become a challenge for application developers for deciding among the different Cloud providers and their corresponding services.

Previous works targeted such a challenge by assisting application developers in the tasks related to selecting, configuring, and adapting the distribution of their application among multiple services [4, 7]. Previous findings identify the existence of multiple decision points that can influence the distribution of an application, e.g. cost, performance, security concerns, etc. [8]. This work incorporates such findings towards the development of the necessary support for assessing application developers in the selection and configuration of Infrastructure-as-a-Service (IaaS) offerings for migrating scientific applications to the Cloud. More specifically, the focus of this research work is to provide an overview, evaluate, and analyze the trade-off between the performance and cost when migrating a Scientific Workflow-based Simulation Environment (SWfSE) to different Cloud providers and their corresponding IaaS offerings.

The contributions of this work build upon the research work presented in [9], and can be summarized as follows:

  • the selection of a set of viable and optimized IaaS offerings for migrating a previously developed simulation environment,

  • a price analysis of the previously selected IaaS offerings,

  • an empirical evaluation focusing on the performance and the incurred monetary costs, and

  • an analysis of the performance and cost trade-off when scaling the simulation environment workload.

The rest of this work is structured as follows: Sect. 2 motivates this work and frame the challenges that will be addressed. The case study simulation environment used for evaluating this work is introduced in Sect. 3. Section 4 presents the experiments on evaluating the performance and incurred costs when migrating the simulation environment to different IaaS offerings, and discusses our findings. Finally, Sect. 5 summarizes related work, and Sect. 6 concludes and presents our plans for future work.

2 Motivation and Problem Statement

Simulation workflows, a well-known topic in the field of eScience, describe the automated and flexible execution of simulation-based experiments. Common characteristics of such simulation workflows are that they are long-running as well as being executed in an irregular manner. However, during their execution a wide amount of resources are typically provisioned, consumed, and released. Considering these characteristics, previous works focused on migrating and executing simulation environments in the Cloud, as Cloud infrastructures significantly reduce infrastructure costs while coping with an irregular but heavy demand of resources for running such experiments [5].

Nowadays there exists a vast amount of configurable Cloud offerings among multiple Cloud providers. However, such a wide landscape has become a challenge for deciding among (i) the different Cloud providers and (ii) the multiple Cloud offering configurations offered by such providers. We focus in this work on IaaS solutions, as there exists a lack of Platform-as-a-service (PaaS) offerings that enable the deployment and execution of scientific workflows in the Cloud. IaaS offerings describe the amount and type of allocated resources, e.g. CPUs, memory, or storage, and define different VM instance types within different categories. For example, the Amazon EC2Footnote 1 service does not only offer VM instances of different size, but also provides different VM categories which are optimized for different use cases, e.g. computation intensive, memory intensive, or I/O intensive. Similar offerings are available also by other providers, such as Windows AzureFootnote 2 or RackspaceFootnote 3. The offered performance and incurred cost significantly vary among the different Cloud services, and depend on the simulation environment resource usage requirements and workload. In this work, we aim to analyze the performance and cost trade-off when migrating to different Cloud offerings a simulation environment developed and used as case study, as discussed in the following section.

3 The OPAL Simulation Environment

A Scientific Workflow Management System (SimTech SWfMS) is being developed by the Cluster of Excellence in Simulation Technology (SimTechFootnote 4), enabling scientists to model and execute their simulation experiments using workflows [2, 10]. The SimTech SWfMS is based on conventional workflow technology which offers several non-functional requirements like robustness, scalability, reusability, and sophisticated fault and exception handling [11]. The system has been adapted and extended to the special needs of the scientists in the eScience domain [10]. During the execution of a workflow instance the system supports the modification of the corresponding workflow model, which is then propagated to the running instances. This allows running simulation experiments in a trial-and-error manner.

Fig. 1.
figure 1

System overview of the SimTech Scientific Workflow Management System (SWfMS).

Fig. 2.
figure 2

Simplified simulation workflows constituting the OPAL simulation environment [12].

The main components of the SimTech SWfMS shown in Fig. 1 are a modeling and monitoring tool, a workflow engine, a messaging system, several databases, an auditing system, and an application server running simulation services. The workflow engine provides an execution environment for the workflows. The messaging system serves as communication layer between the modeling- and monitoring tool, the workflow engine, and the auditing system. The auditing system stores data related to the workflow execution for analytical and provenance purposes.

The SimTech SWfMS has been successfully applied in different scenarios in the eScience domain; one example is the automation of a Kinetic Monte-Carlo (KMC) simulation of solid bodies by orchestrating several Web services being implemented by modules of the OPAL application [13]. The OPAL Simulation Environment is constituted by a set of services which are controlled and orchestrated through a main OPAL workflow (the Opal Main process depicted in Fig. 2). The simulation services are implemented as Web services and divided into two main categories: (i) resource management, e.g. distributing the workload among the different servers, and (ii) wrapped simulation packages depicted in [14, 15]. The main workflow can be divided in four phases as shown in Fig. 2: preprocessing, simulation, postprocessing, and visualization. During the preprocessing phase all data needed for the simulation is prepared. In the simulation phase the workflow starts the Opal simulation by invoking the corresponding Web service. In regular intervals, the Opal simulation creates intermediate results (snapshots). For each of these snapshots the main workflow initiates the postprocessing which is realized as a separate workflow (Opal Snapshot process in Fig. 2). When the simulation is finished and all intermediate results are postprocessed, the results of the simulation are visualized.

4 Experiments

4.1 Methodology

As shown in Fig. 2, the OPAL Simulation Environment is comprised of multiple services and workflows that compose the simulation and resource management services. The environment can be concurrently used by multiple users, as the simulation data isolation is guaranteed through the creation of independent instances (workflows, services, and temporal storage units) for each user’s simulation request. The experiments must therefore consider and emulate the usage of the environment by multiple users concurrently.

Table 1. IaaS Ubuntu Linux On-demand instances categories per provider (in January 2015) for the European (Germany - DE, and Ireland - IRL) and USA regions.

The migration of the simulation environment to the Cloud opens a wide set of viable possibilities for selecting and configuring different Cloud services for the different components of the OPAL environment. However, in this first set of experiments we restrict the distribution of the simulation environment components by hosting the complete simulation application stack in one VM, which is made accessible to multiple users. Future investigations plan to distribute such environment using different Cloud offerings, e.g. Database-as-a-Service (DBaaS) for hosting the auditing databases. We therefore focus this work on driving a performance and cost analysis when executing the OPAL Simulation Environment in on- and off-premise infrastructures, and using different IaaS offerings and optimized configurations.

Table 1 shows the different VM categories, based on their characteristics and offered prices by three major Cloud providers: Amazon AWS, Windows Azure, and Rackspace. In addition to the off-premise VM instances types, multiple on-premise VM instances types were created in our virtualized environment, configured in a similar manner to the ones evaluated in the off-premise scenarios, and included in such categories. The on-premise VM instances configurations are based on the closest equivalent to the off-premise VM configurations within each instance category. The encountered providers and offerings showed two levels of VM categories, i.e. based on the optimization for custom use cases (Micro, General Use, Compute Optimized, and Memory optimized), and based on a quantitative assignment of virtualized resources. This fact must be taken into consideration in our evaluation due to the variation in the performance, and its impact on the final incurred costs for running simulations in different Cloud offerings. The pricing model for the on-premise scenarios was adopted from [16] as discussed in the following section, while for the off-premise scenarios the publicly available information from the providers was used [17], taking into account on-demand pricing models only.

4.2 Setup

The scientific workflow simulation environment is constituted by two main systems: the SimTech SWfMS [2, 10], and a set of Web services grouping the resource management and the KMC simulation tasks depicted in [14, 15]. The former comprises the following middleware stack:

  • an Apache Orchestration Director Engine (ODE) 1.3.5 (Axis2 distribution) deployed on

  • an Apache Tomcat 7.0.54 server with Axis2 support.

  • The scientific workflow engine (Apache ODE) utilizes a MySQL server 5.5 for workflow administration, management, and reliability purposes, and

  • provides monitoring and auditing information through an Apache ActiveMQ 5.3.2 messaging server.

The resource management and KMC simulation services are deployed as Axis2 services in an Apache Tomcat 7.0.54 server. The underlying on- and off-premise infrastructure configurations selected for the experiments are shown in Table 1. The on-premise infrastructure aggregates an IBM System x3755 M3 serverFootnote 5 with an AMD Opteron Processor 6134 exposing 16 CPU of speed 2.30 GHz and 65 GB RAM. In all scenarios the previously depicted middleware components are deployed on an Ubuntu server 14.04 LTS with 60 % of the total OS memory dedicated to the SWfMS. Figure 3 depicts the topological representation of the migrated to the Cloud Opal Simulation Environment. As previously introduced, the evaluation in this work is geared towards the analysis of the performance and cost when using different instance categories among different providers. Consequently, we provisioned for the driven experiments a total of 16 Ubuntu 14.04 virtual machines, each one hosting an Apache Servlet Container, an ActiveMQ Message Broker, and a MySQL Database Server, as the fundamental middleware components of the Opal Simulation Environment. Such middleware components host the different simulation Web services, JMS-basedFootnote 6 message events, and auditing and engine databases, respectively (see Fig. 3).

Fig. 3.
figure 3

Opal simulation infrastructure cloud topology - specified as depicted in [8].

For all evaluation scenarios a system’s load of 10 concurrent users sequentially sending 10 random and uniformely distributed simulation requests/user was created using Apache JMeter 2.9 as the load driver. Such a load aims at emulating a shared utilization of the simulation infrastructure. Due to the asynchronous nature of the OPAL simulation workflow, a custom plugin in JMeter was realized towards receiving and correlating the asynchronous simulation responses. The latency perceived by the user for each simulation was measured in milliseconds (ms). Towards minimizing the network latency, in all scenarios the load driver was deployed in the same region as the simulation environment.

On-Premise Cost Model. The incurred monetary costs for hosting the simulation environment on-premise are calculated considering firstly the purchase, maintenance, and depreciation of the server cluster, and secondly by calculating the price of each CPU time. [16] proposes pricing models for analyzing the cost of purchasing vs. leasing CPU time on-premise and off-premise, respectively. The real cost of a CPU/hour when purchasing a server cluster, can be derived using the following equations:

(1)

where \(C_{T}\) is the acquisition (\(C_{0}\)) and maintenance (\(C_{1..N}\)) costs over the Y years of the server cluster, k is the cost of the invested capital, and

(2)

where TCPU depicts the total number of CPU cores in the server cluster, H is the expected number of operational hours, and \(\mu \) describes the expected utilization. The utilized on-premise infrastructure total cost breaks down into an initial cost (\(C_{0}\)) of approximately 8500$ in July 2012 and an annual maintenance cost (\(C_{1..N}\)) of 7500$, including personnel costs, power and cooling consumption, etc. The utilization rate of such cluster is of approximately 80 %, and offers a reliability of 99 %. Moreover, the server cluster runs six days per week, as one day is dedicated for maintenance operations. Such a configuration provides 960K CPU hours annually. As discussed in [16], we also assumed in this work a cost of 5 % on the invested capital. The cost for the off-premise scenarios was gathered from the different Cloud providers’ Web sites.

Table 1 depicts the hourly cost for the CPUs consumed in the different on-premise VM configurations. In order to get a better sense of the scope of the accrued costs, the total cost calculation performed as part of the experiments consisted of predicting the necessary time to run 1 K concurrent experiments. Such estimation was then used to calculate the incurred costs of hosting the simulation environment in the previously evaluated on- and off-premise scenarios. The monetary cost calculation was performed by linearly extrapolating the obtained results for the 100 requests to a total of 1 K requests. The scientific library Numpy of Python 2.7.5 was used for performing the prediction of 1 K simulation requests. The results of this calculation, as well as the observed performance measurements are discussed in the following section.

Fig. 4.
figure 4

Performance analysis per provider and VM category.

4.3 Evaluation Results

Performance Evaluation. Figure 5 shows the average observed latency for the different VM categories depicted in Table 1 for the different Cloud providers. The latency perceived in the scenarios comprising the selection of Micro instances have been excluded from the comparison due to the impossibility to finalize the execution of the experiments. More specifically, the on-premise micro-instance was capable of stably running approximately 80 requests (see Fig. 4(a)), while in the off-premise scenarios the load saturated the system with 10 requests approximately in the AWS EC2 and Windows Azure scenarios (see Fig. 4(b) and (c), respectively). For the scenario utilizing Rackspace, the VM micro instance was saturated immediately after sending the first set of 10 concurrent simulation requests.

With respect to the remaining instance categories (General Purpose, Compute Optimized, and Memory Optimized), the following performance variation behaviors can be observed:

  1. 1.

    The on-premise scenario shows in average a latency of 320 K ms. over all categories, which is 40 % higher in average than the perceived latency in the off-premise scenarios.

  2. 2.

    However, the performance is not constantly improved when migrating the simulation environment off-premise. For example, the General Purpose Windows Azure VM instance shows a degraded performance of 11 %, while the Windows Azure Compute Optimize VM instance shows only a slightly performance improvement of 2 %, when compared with the on-premise scenario.

  3. 3.

    The performance when migrating the simulation environment to the Cloud improves by approximately 56 % and 62 % for the AWS EC2 and Rackspace General Purpose VM instances, respectively,

  4. 4.

    54 %, 2 %, and 61 % for the AWS EC2, Windows Azure, and Rackspace Compute Optimized VM instances, respectively, and

  5. 5.

    52 %, 19 %, and 63 % for the AWS EC2, Windows Azure, and Rackspace Memory Optimized VM instances, respectively.

When comparing the average performance improvement among the different optimized VM instances, the Compute Optimized and Memory Optimized instances enhance the performance by 12 % and 6 %, respectively.

Fig. 5.
figure 5

Average simulation latency per provider and VM category.

Fig. 6.
figure 6

Cost comparison (in January 2015 prices).

Fig. 7.
figure 7

Cost comparison extrapolated to 1 K simulation requests (in January 2015 prices).

Figure 4 shows the perceived latency for the different requests. During the execution of the simulation environment in the Rackspace infrastructure that the performance highly varies when increasing the number of requests (see Fig. 4(d)). Such performance variation decreases in the on-premise, AWS EC2, and Windows Azure infrastructures (see Fig. 4(a), (b) and (c), respectively). In all scenarios, the network latency does not have an impact in the performance due to the nature of our experimental setup described in the previous section.

When comparing the performance improvement among the different VM instances categories, the Windows Azure infrastructure shows the greatest when selecting a Compute Optimized or Memory Optimized VM instance over a General Purpose VM instance (see Fig. 4(c)).

Cost Comparison. Figures 6 and 7 present an overview of the costs per hour of usage published by the Cloud providers (referring to Table 1), and the expected costs for running 1 K experiments among 10 users. The following pricing variations can be observed:

  1. 1.

    The provisioning of on-premise resources shows in average an increase of 65 %, 55 %, 69 % of the price, for the micro, general purpose, and compute optimized VM instances, respectively. However,

  2. 2.

    the provisioning of on-premise memory optimized instances incurs in average a 16 % less monetary costs.

  3. 3.

    Amazon EC2 instances are in average 36 % low-priced, when comparing it to the on-premise costs and the remaining of the public Cloud services considered in this work.

  4. 4.

    The incurred costs of hosting the simulation environment on-premise is 25$ in average.

  5. 5.

    When migrating the simulation infrastructure off-premise, the cost descends in average 80 %, 12 %, and 94 % when utilizing the AWS EC2, Windows Azure, and Rackspace IaaS services, respectively.

  6. 6.

    When comparing the incurred costs among the different VM categories, the Memory Optimized categories are in average 61 % and 47 % more expensive when compared to the Compute Optimized and General Purpose VM categories, respectively.

  7. 7.

    Among the different off-premise providers, Windows Azure is in average 900 % more expensive for running the simulation environment.

4.4 Discussion

The experiments driven as part of this work have contributed to derive and report a bi-dimensional analysis focusing on the selection among multiple IaaS offerings to deploy and run the OPAL Simulation Environment. With respect to performance, it can be concluded that:

  1. 1.

    The migration of the simulation environment to off-premise Cloud services has an impact on the system’s performance, which is beneficial or detrimental depending on the VM provider and category.

  2. 2.

    The selection of Micro VM instances did not offer an adequate availability to the simulation environment in the off-premise scenarios. Such a negative impact was produced by the non-automatic allocation of swap space for the system’s virtual memory.

  3. 3.

    When individually observing the performance within each VM category, the majority of the selected off-premise IaaS services improved the performance of the simulation environment. However, the General Purpose Windows Azure VM instances showed a degradation of the performance when compared to the other IaaS services in the same category.

  4. 4.

    The perceived by the user latency was in average reduced when utilizing Compute Optimized VM instances. Such an improvement is in line with the compute intensity requirements of the simulation environment.

The cost analysis derived the following conclusions:

  1. 1.

    There exists a significant monetary cost reduction when migrating the simulation environment to off-premise IaaS Cloud services.

  2. 2.

    Despite of the improved performance observed when running the simulation environment in the Compute Optimized and Memory Optimized VM instances, scaling the experiments to 1 K simulation requests produces in an average increase of 9 % and 61 % with respect to the General Purpose VM instances cost, respectively.

  3. 3.

    The incurred monetary costs due to the usage of Windows Azure services tend to increase when using optimized VM instances, i.e. Compute Optimized and Memory Optimized. Such behavior is reversed for the remaining off-premise and on-premise scenarios.

  4. 4.

    Due to the low costs demanded for the usage of Rackspace IaaS services (nearly 40 % less in average), the final price for running 1 K simulations is considerably lower than the other off-premise providers and hosting the environment on-premise.

The previous observations showed that the IaaS services provided by Rackspace are the most suitable for migrating our OPAL Simulation Environment. However, additional requirements may conflict with the migration decision of further simulation environments, e.g. related to data privacy and transfer between EU and USA regions, as Rackspace offers a limited set of optimized VMs in their European region.

5 Related Works

We consider our work related to the following major research areas: performance evaluation of workflow engines, workflow execution in the Cloud, and migration and execution of scientific workflows in the Cloud.

When it comes to evaluating the performance of common or scientific workflow engines, a standardized benchmark is not yet available. A first step towards this direction is discussed in [18], but propose approach is premature and could not be used as the basis for this work. Beyond this work, performance evaluations are usually custom to specific project needs. Specifically for BPEL engines not much work is currently available. For example [19] summarize nine approaches that evaluate the performance of BPEL engines. In most of the cases, workflow engines are benchmarked with load tests with a workload consisting of 1–4 workflows. Throughput and latency are the metrics most frequently used.

There are only few Cloud providers supporting the deployment and execution of workflows in a Platform-as-a-Service (PaaS) solution. The WSO2 Stratos Business Process Server [20] and Business Processes on the Cloud is offered by IBM Business Process ManagerFootnote 7. These offer the necessary tools and abstraction levels for developing, deploying and monitoring workflows in the Cloud. However, such services are optimized for business tasks, rather than for supporting simulation operations.

Scientific Workflow Management Systems are exploiting business workflows concepts and technologies for supporting scientists towards the use of scientific applications [2, 21]. Zhao et al. [6] develop a service framework for integrating Scientific Workflow Management Systems in the Cloud to leverage from the scalability and on-demand resource allocation capabilities. The evaluation of their approach mostly focuses on examining the efficiency of their proposed PaaS based framework.

Simulation experiments are driven in the scope of different works [14, 15]. Later research efforts focused on the migration of simulations to the Cloud. Due to the diverse benefits of Cloud environments the approaches evaluate the migration with respect to different scopes. The approaches that study the impact of migration to the performance and incurred monetary costs is considered more relevant to our work. In [4] the authors examine the performance of X-Ray Crystalography workflows executed on the SciCumulus middleware deployed in Amazon EC2. Such workflows are CPU-intensive and require the execution of high parallel techniques. Likewise, in [3] the authors compare the performance of scientific workflows migrated from Amazon EC2 to a typical High Performance Computing system (NCSA’s Abe). In both approaches the authors conclude that migration to the Cloud can be viable but not equally efficient to High Performance Computing environments. However, Cloud environments allow the provisioning of specific resources configurations irregularly during the execution of simulation experiments [22]. Moreover, the performance improvement observed in Cloud services provide the necessary flexibility for reserving and releasing resources on-demand while reducing the capital expenditures [23]. Research towards this direction is a fertile field. Juve et al. [24] execute nontrivial scientific workflow applications on grid, public, and private Cloud infrastructures to evaluate the deployments of workflows in the Cloud in terms of setup, usability, cost, resource availability, and performance. This work can be considered complementary to our approach, although we focused on investigating further public Cloud providers and took into account the different VM optimization categories.

Further Cloud application migration assessment frameworks, such as the CloudSim [25] or CloudMIG [26], focus on estimating the benefit of using Cloud resources under different configurations. However, the vast majority rely on the usage of simulation techniques, which require the definition of the corresponding behavioral model for each Cloud. Moreover, such approaches solely target the application’s QoS dimension, while in our work we aim at bridging and comparing the trade-off between the observed performance and the incurred monetary costs.

6 Conclusion and Future Work

Simulation workflows have been widely used in the eScience domain due to their easiness to model, and because of their flexible and automated runtime properties. The characteristics of such workflows together with the usage patterns of simulation environments have made these type of systems suitable to profit from the advantages brought by the Cloud computing paradigm. The existence of a vast amount of Cloud services together with the complexity introduced by the different pricing models have become a challenge to efficiently select which Cloud service to host the simulation environment. The main goal of this investigation is to report the performance and incurred monetary cost findings when migrating the previously realized OPAL simulation environment to different IaaS solutions.

A first step in this experimental work consisted of selecting a set of potential IaaS offerings suitable for our simulation environment. The result of such selection covered four major deployment scenarios: (i) in our on-premise infrastructure, and in (ii) three off-premise infrastructures (AWS EC2, Windows Azure, and Rackspace). The selection of the IaaS offerings consisted of evaluating the different providers and their corresponding optimized VM instances (Micro, General Purpose, Compute Optimized, and Memory Optimized). The simulation environment was migrated and its performance was evaluated using an artificial workload. A second step in our analysis consisted on extrapolating the obtained results towards estimating the incurred costs for running the simulation environment on- and off-premise. The analyses showed a beneficial impact in the performance and a significant reduction of monetary costs when migrating the simulation environment to the majority of off-premise Cloud offerings.

The efforts in this work build towards the assessment for the migration of Cloud applications to the Cloud, as defined in [27]. More specifically, in this work we cover the subset of tasks relevant to the selection and configuration of Cloud resources to distribute the application, w.r.t. their performance and the incurred monetary costs. Despite our efforts towards analyzing and finding the most efficient Cloud provider and service to deploy and run our simulation environment, our experiments solely focused on IaaS offerings.

Future works focus on analyzing further service models, i.e. Platform-as-a-Service (PaaS) or Database-as-a-Service (DBaaS), as well as evaluating the distribution of the different components that constitute the simulation environment among multiple Cloud offerings. Investigating different autoscaling techniques and resources configuration possibilities is also part of future work, e.g. feeding the application distribution system proposed in [28] with such empirical observations.