1 Introduction

The recent appearance, evolution and massive expansion of social media-based technologies, such as online social media networks, blogs, tweets and news media, in conjunction with what currently is known as Internet of Things, results in a vertiginous data production on an huge scale [2]. Such data provide the opportunity for social scientists to conduct a wide variety of research analysis and has demonstrated to be of great interest. However, performing a longitudinal analysis of this huge data becomes a Big-Data problem since the volume of this data is produced continuously all around the world. This fact hampers data harvesting, storage and analysis using traditional tools or processing infrastructures.

To tackle this problem, new processing paradigms and computational environments have arisen and become of real interest within a wide range of research fields. One of the main contributions to this matter has been the Map/Reduce paradigm and its Open-Source implementation (Hadoop) [3]. Simultaneously, the growing trend of the Cloud computing paradigm, due to its benefits in terms of storage, computing power and flexibility offers a possibility to handle this massive amounts of data at reasonable cost. Moreover, Cloud computing provides several features that become of interest in conjunction with Hadoop, such as high availability and distributed environment provisioning.

As an example, the Cardiff Online Social Media Observatory (COSMOS) [7] considers the use of both Cloud and Hadoop technologies to conduct longitudinal analysis of social media data [6]. COSMOS platform is aimed at providing mechanisms to capture, analyse and visualise data harvested from online repositories and feeds, in particular interactive and openly accessible sites such as Twitter.

On the other hand, since Hadoop requires a distributed environment (e.g. a cluster or virtual cluster) to perform any Hadoop application execution, the number of resources dedicated to this task (i.e., number of virtual machines dedicated to a Hadoop virtual cluster) determines the application performance. Hence, the Cloud pay-per-use model must be taken into account to minimise the cost since the number of virtual machines (VMs) hired for a certain study is related to operational expenses. Nowadays, the Cloud resource hiring mechanisms are performed without any kind of exhaustive study, and this fact conditions the overall performance of the applications and the associated costs.

The difficult challenge of making a trade-off analysis on the number of workers versus processing time and resource cost is tackled in this work. More precisely, this paper provides a deep and neat study of the Map/Reduce framework founded on a rigourous theoretical framework. Specifically, we will use the Timed Process Algebra Bounded True Concurrency (BTC)  [15] which has been developed by the authors of this paper. The main feature of BTC that makes it suitable for this work is that it takes into account that the resources in a system must be shared by all the processes. It should be pointed out that we extended Communicating Sequential Processes (CSP) [9] syntax to consider the duration of actions by means of a timed prefix operator. The operational semantics was also extended to consider the context (resources) in which processes are executed.

Thus, the main aim of this paper is to conduct a formal study where the parameters of the system can be easily modified to obtain new results, in contrast to the usual simulation methodology where it is necessary to carry out a huge number of experiments to draw conclusions that may not be extrapolated when initial conditions change. Furthermore, our conclusions will be obtained after the study of all possible executions, without having to limit the accuracy of them depending on the number of experiments.

Furthermore, this work goes beyond analysing relevant system properties such as correctness, non-determinism, absence of deadlocks or termination. Special attention is paid to carry out performance analysis with the ultimate goal of optimising the performance and cost parameters. To make this formal analysis, the BAL tool  [16] (no acronym) (also developed by the authors of this paper) will be used.

Summing up, the main motivation of this work is to develop a formal model to allow users and service managers to evaluate cost and performance in terms of the deployment strategy, or to choose the best deployment strategy in terms of the expected cost/performance, with the objective of a better Cloud resource hiring in terms of user’s time and costs restrictions.

In detail, our specific contributions in this work are:

  • A process-algebraic formalisation of the Map/reduce paradigm.

  • A model of the behaviour of the application SentiStrength for the Hadoop module.

  • Performance evaluation from the process-algebraic formalisation.

  • Validation of the results against an actual implementation.

  • Use of the results to reduce the operational expenses on IT resources by helping on choosing the optimal resource hiring.

This paper is structured as follows. In Sect. 2, related work is presented. Next, Sect. 3 provides a brief overview of the formalism (process algebra BTC) and tool (BAL) used in this paper. The formal modelling is carried out in Sect. 4, describing the Map/Reduce workflow and its specification by means of BTC, which is one of the main contributions of this paper. Subsequently, the performance evaluation carried out from the formal model is presented in Sect. 5. In Sect. 6, it can be found the validation of the performance results obtained by the formal model, using results obtained empirically on a real Cluster environment. After that, the advantages that our work introduce in cost reduction and performance expectations are described in Sect. 7. Finally, the conclusions derived from the work and the suggested ideas for future work are presented in Sect. 8.

2 Related works

To the best of the authors knowledge, this work is the first formal approach for modelling and analysing Hadoop performance. Moreover, no work has been found which formally studies any of the most well-known implementations of the Map/Reduce paradigm.

From a formal point of view, there exist some studies regarding the Map/Reduce paradigm. In  [12], an abstract model of Map/Reduce computation is constructed with the proof assistant Coq  [18] to verify actual running code of Map/Reduce applications (which are expressed in terms of invariants). Thus, its aim is to analyse the running code of Hadoop applications, not to address cost–performance tradeoffs.

Closer to the aim of this work is the one presented in  [20] where CSP is used to formalise the Map/Reduce framework. The goal is to build a formal model to reflect the interaction between the components, as well as the whole process of one task. The model is used mainly to verify the error handling component and the authors propose as future work to translate the model into the automatic model checking tool FDR  [8] and focus on the verification of the data consistency in error-handling process. In contrast, as the process algebra BTC used in this work enables us to develop a more complete model that can capture both the temporal characteristics of the system and the context in which it runs (resources). This fact has a direct impact on the studies that can be performed on the model. Therefore, in addition to finding errors in behaviour, the functionality of Hadoop under different resource reservation configurations can be also studied (with the improvement in quality of service (QoS) and cost reduction that entails). Furthermore, the complete study can be carried out with a single tool.

In [10], the architectural issues of Hadoop are analysed empirically, while the work presented here focus on tuning system parameter. Nevertheless, it would be of great interest to address these architectural design issues using formal methods and use it in conjunction with the work presented in this paper. This analysis can have a significant impact on improving the performance of Hadoop.

In  [4], an extensive experiment was performed to study how the job configuration parameters affect the observed performance of Hadoop. The main difference with this work lies in the method used to perform the analysis. In this work formal methods are used, which provide a clear advantage: the ability to recreate real environments without the temporary and economic problems that conducting the experiments in such real environments entails. Namely, taking a set of resources for long periods of time, experiments that last too long, delays in execution, difficulty in repeating experiments or difficulty to analyse real cases of large volume.

To conclude this section, all the works mentioned above (except  [10]) are intended to be useful to designers and developers who seek to redress errors and optimise the performance of Hadoop. In contrast, this work is also useful for end users of applications developed to be executed on Hadoop, as they will know the number of resources that need to be allocated to get the desired results according to their requirements.

3 Process algebra BTC and BAL tool

3.1 Process algebra BTC

In this section, we will not make an attempt to explain in depth the timed process algebra BTC. As the newest approach can be found in [14], we shall limit our explanation to a brief overview.

We should highlight the two main reasons that led us to choose this process algebra to carry out this model. First, the specifications obtained using this algebra are quite easy to understand. Concurrency, parallelism, resource sharing, conflicts, mutual exclusion and non-determinism are represented in a straightforward manner.

Second, BTC is a timed algebra, i.e., it allows to do performance analysis of systems.

Moreover, BTC is based on the notion of true concurrency where concurrent behaviour can be distinguished from an interleaving one by considering that \( (a|b) \not \equiv (a.b + b.a)\). This means that more than one action can be allowed to be executed at the same time, while in interleaving behaviour, just one action can be executed at any one time. Actually, BTC considers the two alternatives. BTC is able to deal with different types of resources at the same time (heterogeneous resources) and draws a clear distinction between preemptable and non-preemptable ones (a preemptable resource can be preempted from a process and reallocated without side effects while with a non-preemptable one problems can arise). This last feature will not be used in the model presented because our study is limited to the use of processors, but it is interesting to have it just in case that we want to extend the study by adding more resources from other types.

BTC deals with three kinds of actions: timed actions (\(\mathcal{A}\textit{ct}_T\)) which use time (and resources if needed); untimed actions (\(\mathcal{A}\textit{ct}_U\)) which use neither time nor resources (actions for synchronisation); and special actions (\(\mathcal{A}\textit{ct}_S\)), which use resources but no time. Let \(c \in \mathcal{A}\textit{ct}_S\) be a special action, which is used to deal with non-preemptable resources which must be requested and released. Thus, for each non-preemptable resource there is an action c to request the resource and the corresponding conjugate action \(\widehat{c} \in \mathcal{A}\textit{ct}_S\) which is executed when the resource is released.

The syntax of BTC is defined by the following BNF (Backus Normal Form) expression:

$$\begin{aligned} P \, {::=} \, { stop} \; |\; a.P \; |\; \langle b,\alpha \rangle .P \; | \; P\, \,\oplus \,\, P \; | \; P\, \,+\,\, P \; | \; P\parallel _{A}P\; |\; recX.P \end{aligned}$$

where \(A\subseteq \mathcal{A}\textit{ct}_U \), \(a\in (\mathcal{A}\textit{ct}_U \cup \mathcal{A}\textit{ct}_S)\), \(b\in \mathcal{A}\textit{ct}_T\), and \(\alpha \in \mathbb {N}\), \(\mathbb {N}\) represents the set of natural numbers. Furthermore, it is assumed a set of process variables Id ranged over by X, \(X'\), a set of processes \(\mathcal{P}\) ranged over by P, Q, R, and a set of actions \(\mathcal{A}\textit{ct}=\mathcal{A}\textit{ct}_U \cup \mathcal{A}\textit{ct}_T \cup \mathcal{A}\textit{ct}_S\).

Furthermore, this syntax has been extended with a view to representing both the actions that use the shared resources and the number of resources the system has at its disposal:

$$\begin{aligned} \llbracket P \rrbracket _{Z,\mathcal{N}} \end{aligned}$$

where \(Z = \{Z_1,\ Z_2,\ \ldots ,\ Z_m\}\). This means Z is a set consisting of a set for each different type of shared resource (preemptable/non-preemptable). Moreover, \(m \in \mathbb {N}\) is the number of different types of shared resources available in the system and \(x_i \in \mathbb {N}\) is the number of actions which need at least one resource of type i for their execution. Each of these sets is defined as follows:

$$\begin{aligned} Z_i = \{b_1,\ b_2,\ \ldots ,\ b_{x_i},\ c_1,\ c_2,\ \ldots ,\ c_{x_i}\} \end{aligned}$$

and represents the set of actions which require for their execution at least one of the shared resources of the type i. Note that in \(Z_i\), conjugate actions are not included just for the sake of clarity since if action \(c \in Z_i\) then \(\widehat{c}\) is assumed. It is considered

$$\begin{aligned} \mathcal{N}= \{n_1,\ n_2,\ \ldots ,\ n_m\} \end{aligned}$$

where \(n_i \in \mathbb {N}\) represents the number of shared resources of type i available in the system.

By incorporating these changes, BTC is able to model processes which need to use different types of resources in their execution (heterogeneous resources), dealing with preemptable and non-preemptable resources and taking into account the number of all available resources in the system at any given time. This is an important step since this algebra is able to model any kind of delay which can appear in a system, i.e., BTC is able to deal with delays related to the synchronisation of processes and delays related to the allocation of resources.

By means of the operational semantics, the language operators are provided with a meaning and an accurate interpretation by describing how a process is able to convert into another. BTC operational semantics can be found in [15].

3.2 Toy example: a dining hall

Although the complete definition of algebra BTC can be found in [15], we think that their mathematical basis may not be trivial to understand. Therefore, we incorporate this section where a toy example will be used to help the reader’s understanding.

In this example, a dining hall is modelled. Every user must pick up a tray from a tray dispatcher (there are only 25 trays) and go and get his food, which is served by three waiters. Then, the user eats his food, and when he finishes he pays at the till (there is only one).

There are three types of resources: tray, waiter and till. For the first one, the system has 25 units (there are 25 trays), for the second one there are three units (three waiters) and for the last one there is just one resource (the till). Then the set \(\mathcal{N}\) is:

$$\begin{aligned} \mathcal{N}= \{25,3,1\} \end{aligned}$$

Now, we are going to identify the actions which need a resource for their execution (in this case, there is just one action per type of resource) and we are going to obtain the sets \(Z_1\), \(Z_2\) and \(Z_3\).

The tray resource is preemptable and the remainder are non-preemptable. The actions which need a preemptable resource are \(get\_food_i\) and \(pay\_food_i\). \(get\_food_i\) needs a resource of type waiter while the action \(pay\_food_i\) needs one of type till. Both of them have a duration (we work with time characteristics) and their resources can be preempted and reallocated without side effects.

But, there is an action (\(re\_tray_i\)), which needs a non-preemptable resource of type tray. That means, when a user gets a tray it is not possible to know how long he needs it because it depends on the amount of time required for getting the other resources that he needs, so the duration has not been settled. Besides, a tray that has been picked up by a user cannot be preempted, as the user wants his own tray, not any one’s else. Therefore, he requests a tray (\(re\_tray_i\)), reduces the number of trays at the users’ disposal (in \(\mathcal{N}\)) and keeps it until he does not need it any more. At this moment, he releases the tray (\(\widehat{re\_tray_i}\), the conjugate action) and increases the number of trays allowed in the system. Therefore, the set Z is composed of

$$\begin{aligned} Z_1 = \{re\_tray_i\}\,Z_2 = \{get\_food_i\}\, Z_3 = \{pay\_food_i\} \end{aligned}$$

Thus, a system composed of m users trying to eat will be modelled by the following BTC specification:

$$\begin{aligned} \begin{array}{l} \llbracket \hbox {sys}\_\hbox {dining}\_\hbox {hall} \rrbracket _{Z,\ \mathcal{N}} \equiv \llbracket user_1 \parallel \ldots \parallel user_i \parallel \ldots \parallel user_m \rrbracket _{Z,\ \mathcal{N}}\\ user_i \equiv \langle re\_tray_i,\mathcal{N}\rangle .\langle (get\_food_i,5),\mathcal{N}\rangle .\langle (eat_i,15),\mathcal{N}\rangle .\\ \qquad \quad \,\,\langle \widehat{re\_tray_i},\mathcal{N}\rangle .\langle (pay\_food_i,3),\mathcal{N}\rangle \\ \end{array} \end{aligned}$$

3.3 BAL tool

With the formal resource-aware model, the timed characteristics of the system are captured. The next step is concerned with carrying out performance evaluation: we want to be able to estimate the minimum time needed to reach a given state. By applying the rules of the operational semantics, a transition graph is built and the problem to solve is finding the shortest path from the initial node. Usually, the number of states in the transition graph for a real system is huge, so the ability of a tool to perform this task automatically becomes relevant. For this purpose, we used the BAL tool which after checking that the system specification is syntactically correct, builds the relevant transition graph and calculates the minimum time needed to reach the final state from the initial one. Evidently, the most delicate part in the tool is that concerning to graph analysis where it is necessary to join different algorithms of pruning, dynamic load balancing and parallelisation. For interested readers, more information about the tool can be found in  [16] where it is presented. A general view of the BAL tool can be found in Fig. 1.

Fig. 1
figure 1

Structure of the BAL tool

4 Formal modelling

The basis of Map/Reduce paradigm is to split up the input data into smaller data chunks and distribute them to the worker nodes where they are processed and afterward, the results are combined and collected (Fig. 2).

Fig. 2
figure 2

Map/Reduce paradigm

Hadoop implements the Map/Reduce paradigm, where there are two main tasks: Map and Reduce. Moreover, Hadoop introduces two new tasks without interfering with the main tasks to perform the processing; these tasks are: SetUp and CleanUp. Besides, Hadoop implements various sub-tasks within the Map and Reduce tasks that need to be modelled (Fig. 3).

The first task, SetUp, translates and sets the configuration specified by the Hadoop application. For example, this configuration is used to specify the input and output data (blocks) paths. In addition, there are also optional parameters that can be specified, such as results combination specifications within tasks or intermediate/output partitioning parameters.

After the SetUp task, each block of information flows through the inherent tasks to the Map/Reduce paradigm. These tasks are Map and Reduce.

The Map task performs the first computational processing of the input block of information. It consists of two sub-tasks, which are in charge of reading the information from the Hadoop Distributed File System (HDFS) (Record Reader) and after that, perform the first processing (Map sub-task).

Once a block has finished its Map task, it is derived to the Reduce task, which can be performed by other resource since it is independent from the Map task (as specified in the Map/Reduce paradigm).

Then, the Reduce task performs a second processing that is more complex than the one carried out by the Map task. In this task, the input is preprocessed by shuffling and sorting it in the Shuffle and Sort sub-tasks respectively. Once preprocessed, the blocks are processed within the Reduce sub-task and the results are then stored in the HDFS (Output sub-task).

Finally, the latest task, called CleanUp, coordinates the temporary files clean up and updates the performance results obtained from the application execution. This behaviour is applied to every input information block.

Fig. 3
figure 3

Hadoop Map/Reduce tasks per Block

On the other hand, understanding how this paradigm works under a distributed environment (e.g., Virtual Clusters) is not so trivial since Hadoop tries to maximise the resources utilisation. In more detail, the Hadoop architecture consists of a Master resource and a number of Worker resources. The Master resource coordinates all Worker resources, distributing the information blocks, allocating the tasks to perform and keeping track of their performance.

Some peculiarities worth mentioning relating to the way in which Hadoop works need to be highlighted. First, until all blocks have not finished their own SetUp task (\(t_sm\)), none of them can begin the next task. Second, the Map and Reduce tasks (of different blocks) can overlap over time (Fig. 4). But with an important special feature: any block which must run its Map task has priority in resource allocation. This means that, while a Map task remains to be performed, none of Reduce task can begin. Hence, when the Map tasks are finishing, resources are released and Reduce tasks can begin. Consequently, at some point, half of the resources are performing Map tasks, and the other half Reduce tasks (\(t_{h}\)). And finally, all resources will be performing Reduce tasks (\(t_{em}\)) until all of them finish.

At this point, a new synchronisation occurs since no CleanUp task starts until all blocks have completed their Reduce task (\(t_{er}\)).

With this behaviour, Hadoop keeps all available resources busy, exploiting the possible concurrency that exists within the Map and Reduce tasks.

Fig. 4
figure 4

Map/Reduce operation

All Hadoop applications follow this workflow and behaviour within a distributed environment.

At this point, we can start to develop the formal model of the system using process algebra BTC. For a better understanding, this task has been split into four phases.

  1. 1.

    Each single element in the system must be taken into consideration for the purpose of deciding how it is going to be modelled and which are its characteristics: what is a process, what is a resource and what type of resource—preemptable/non-preemptable.

    By studying the system, it is found that there is only one type of shared resource: Worker which models the behaviour of the VMs worker nodes. Blocks will be modelled as processes.

  2. 2.

    For each type of shared resource, we define a set consisting of the actions which need at least one resource of this type for its execution. In this case, there is only one type of shared resource thus, the set Z consisting of the set \(Z_1\) that contains the actions that need the resource Worker for its execution. Therefore, without loss of generality, we can identify \(Z_1\) with Z. Since the resource Worker is non-preemptable, a process must make a request when an action needs a resource of this type for its execution and release it when it is not necessary, that is, when the whole action finishes. Therefore, just two actions on its Z set can be found. The action \(act\_worker\) is meant to request the resource, and the conjugate action \(\widehat{act\_worker}\) to release it. Therefore, \( Z = \{act\_worker \}\) (in BTC syntax, conjugates actions belong by default to set Z).

  3. 3.

    It will be necessary to establish the number of resources of any type available in the system and how long each action takes.

    The non-preemptable resources work with actions that use resources but not time constraints, therefore we focus our attention on the number of resources. The number of workers available in the system is represented by the N set. In the performance evaluation carried out, this value will change to study different scenarios, with the intention to compare and make the decision that best suits the requirements.

  4. 4.

    The traces for the different processes still have to be incorporated to achieve the whole specification of the system. The trace of the BLOCK processes is quite intuitive, they must go through all the phases that define the paradigm Map/Reduce:

    $$\begin{aligned} BLOCK \equiv SETUP.MAP.REDUCE.CLEAN \end{aligned}$$

    The SETUP process models the SetUp phase and is in charge of establishing the system parameters. The next phase (Map) cannot start until all the processes have not completed their SetUp phase, so it is necessary to synchronise. This synchronisation between processes has been modelled by the untimed actions: synR and synRR. The number of these actions depends on the amount of data to be processed. Therefore, we must define as many untimed actions as blocks to be analysed. Hence, the specifications of these processes are:

    $$\begin{aligned}&SETUP \equiv \langle setup, tS \rangle .synR.synRR \\&SYN\_SETUP \equiv synR. \ldots .synR.synRR.synRR \ldots .synRR \end{aligned}$$

    Now, the MAP and REDUCE processes that model the Map and Reduce phases must be modelled. Overlapping of the Map and Reduce phases among different blocks has been identified. The Reduce phase for each block does not start until its own Map phase has finished. But, due to the fact that every available resource is dedicated to perform the Map phase while any block remains in this phase (priority in Map phase), then the Reduce tasks cannot begin for any block while there is any block on the Map phase. This behaviour has been formalised by the process OVERLAP and the overlap between processes has been modelled by the untimed actions: synS and synSS:

    $$\begin{aligned} MAP\equiv & {} \langle \widehat{act\_worker}\rangle .synS.\langle recordReader, tRr\rangle . \\&\langle map, tM\rangle .\langle \widehat{act\_worker}\rangle .synSS\\ REDUCE\equiv & {} \langle \widehat{act\_worker}\rangle .\langle shuffle, tSh\rangle .\langle sort, tSrt\rangle .\\&\langle reduce, tR\rangle .\langle output, tOpt\rangle .\langle \widehat{act\_worker}\rangle \\ OVERLAP\equiv & {} synS. \ldots .synS.synSS.synSS. \ldots .synSS \end{aligned}$$

    Again, a synchronisation is necessary before starting the last phases (CleanUp). All processes must have finished their Reduce phases before CleanUp phases begins. The synchronisation has been modelled by the untimed actions: synC and synCC. The specifications of these processes are:

    $$\begin{aligned}&CLEAN \equiv synC.synCC.\langle clean, tC\rangle \\&SYN\_CLEANUP \equiv synC. \ldots .synC.synCC.synCC \ldots .synCC \end{aligned}$$
  5. 5.

    Finally, the whole system, where all these processes are running in parallel, is formalised as follows:

    $$\begin{aligned}&\llbracket \hbox {sys}\_\hbox {Hadoop} \rrbracket _{Z,\ \mathcal{N}} \equiv \llbracket BLOCK \parallel \ldots \parallel BLOCK \parallel OVERLAP \parallel \\&\qquad \qquad \qquad \qquad \qquad \qquad SYN\_CLEANUP \parallel SYN\_SETUP \rrbracket _{Z,\mathcal{N}} \end{aligned}$$

    Once the trace of each of these processes is added, the system specification is completed.

    In the specification, input parameters are used to be able to model any special characteristic of the applications running on Hadoop. These parameters are used to specify the duration of the different timed actions in the specification, Table 1 shows the relationship between the parameters and tasks. As shown in the specification, the name of each action matches the name of the corresponding task/subtask.

Table 1 Input parameters

Finally, Fig. 5 shows the complete specification of the whole Hadoop system modelling by BTC. This specification reminds the BSP programming model [19] where every superstep consists of three ordered phases: simultaneous local computation in each process, communication actions among the processes, and a barrier synchronisation which waits for all of the communication actions to complete. However, this is not the case in all the phases of the specification of Hadoop. We find barrier synchronisations after the Setup, Reduce and CleanUp tasks, but this is not the case between Map and Reduce where an overlapping of the two phases occurs and we can see some blocks in the Reduce phase while other blocks are still in the Map phase.

Fig. 5
figure 5

BTC specification of Hadoop

5 Performance evaluation

In the specification developed, both system timed characteristics (duration of actions) and the context (resources) in which processes are executed have been considered. This means, it has been taken into account that the workers in the system must be shared by all the processes that are analysing blocks. Therefore, if there is more need of workers than available workers, then not all of actions can be simultaneously executed. A process has to wait until it allocates the resources needed to continue its execution. This means that there exist two kinds of delays in the execution of a process: delays related to the synchronisation of processes, and delays related to the allocation of resources. The former is usual in a (theoretical) concurrent context, but the latter is only taken into account if we consider a limited number of available resources.

Thanks to this important characteristic, this specification can be used for performance evaluation in particular, accurate conclusions can be drawn about the temporal behaviour of the system under different scenarios. Up to now, the resource allocation in distributed environments and cloud services contracting are carried out without performing any previous comprehensive study. Hadoop runs on a Cloud pay-per-use model and the cluster easily grows in size and consists of hundreds or thousands of machines. Hence, the allocation and contracting of resources can become an arduous task since contracting an unsuitable number of resources may result in a loss of time and/or money: as it is not known in advance which is the number of resources required to perform a task, unnecessary resources may be contracted (which have an economic cost) or, on the contrary, a reservation with few resources might result into tasks taking longer to finish. One of the main challenges for the users of these technologies is to know how the systems behave according to the set of resources.

Therefore, the focus now lies on the study of Hadoop framework to obtain the utmost performance with the minimum number of resources or minimum cost. For this purpose, a temporal analysis of the Hadoop behaviour has been performed. The results show the performance in terms of the number of resource needed. The results of this analysis allow users to know the configuration that best suits their requirements as well as Cloud providers help to establish their service catalogue.

The number of resources needed depends mainly on the type of application and the volume of data to be processed. Therefore, to be able to carry out the performance evaluation, we have had to chose a concrete application: the SentiStrength [17] tool for Hadoop module which conducts longitudinal analysis of social media data. This application is used by COSMOS  [7] project whose objective is to translate the underlying social observation and analysis mechanisms into an embedded research tool that supports the development and execution of social media research analysis. Although this platform provides a real-time data ingest, analysis and visualisation, there are studies that require to perform longitudinal analysis over a wide time period, such as sentiment analysis conducted by the SentiStrength [17] tool within a Hadoop-based application.

For this study, the application has performed the sentiment analysis of \(\approx 10\) million tweets (\(\approx \)1.5 Gb of plain text), which have been split into 30 blocks of equal size. The number of tweets analysed has been chosen following the advice of COSMOS project researchers who consider it as a significant amount of data for this study. The block size has been chosen because it is the block size automatically set by Hadoop for the case studied.

At this point, we have the specification that models the Hadoop behaviour, the application to be studied and the volume of data to be processed. Next, it is necessary to include this information into the specification using the BAL tool. It is an easy task since the BAL tool is user-friendly and includes a graphical interface and assistant. The next steps are:

  • Replacing the input parameters in our specification to model the behaviour of the application SentiStrength for Hadoop module (duration of actions). The variables tS, tC, tRr, tM, tSh, tSrt, tR and tOpt must be replaced with the respective duration time (Table 2). This can be done directly in the specification of the system by BAL editor or using an option that does it automatically. This values have been provided from real experiments performed on our real Cloud infrastructure (with a similar performance to the provided by Amazon EC2) with a 5-worker Hadoop virtual cluster.

  • Providing the number of data blocks.

  • Establishing the number of workers (variable n).

  • The BAL tool checks the syntax of the specification by means of its syntactic analyser.

  • If the system specification is syntactically correct, the tool builds the relevant transition graph by applying the rules of the operational semantics. This graph is a weighted directed graph, where weights are always positive numbers as we can abstract the information about actions and consider only the information about time (duration of actions).

  • BAL carries out the performance analysis, this is, the minimum time needed to reach the final state from the initial one is calculated. The result is the time that the application SentiStrength takes to analyse this number of tweets.

Table 2 Parameter values

Following this methodology, the system has been studied for several numbers of workers (from 2 to 9) and the results are shown in Table 3.

Table 3 30 blocks results

According to the results, the time that the system takes to analyse the tweets is inversely proportional to the number of workers used which makes sense and is within expectations: time decreases as the number of resources of the system increases. But it follows an exponential behaviour (as shown in Fig. 6) so the improvement obtained by increasing the number of resources is smaller and smaller (which is reflected in the smoothing of the slope of the curve). The improvement in time obtained when going from 2 to 3 workers is 14 m 30 s while the improvement obtained in the case of moving from 7 to 8 workers is only 1 m 3 s.

Using this information, users could make a decision about the number of workers that fulfills their timing requirements. But this is not enough because, as stated above, Hadoop runs on a Cloud pay-per-use model where the hiring of these services usually is made in time slots, so that, not always an increase in the number of workers entails an increase in the price. This issue will be studied in Sect. 7, but first it is necessary to check that the data obtained by the model are consistent with those obtained by experimentation on a real system. The validation of the model is presented in the next section.

6 Validation

We used the Cloud infrastructure deployed at the University of Castilla-La Mancha (UCLM), known as Vesuvius, to compare and validate the real performance observed against the results obtained from the formal model presented in this work. A physical compute node from the UCLM local Cloud infrastructure has been used. It is composed of 2 Xeon e5462 CPU (4 Cores), 32 GB of main memory and 60 GB of storage. The Cloud infrastructure headnode has the same configuration but with 1TB of storage shared between the compute nodes that belong to the infrastructure using NFS through a Gigabit Ethernet network. The operating system (OS) deployed is CentOS 6.2 Linux [5].

Fig. 6
figure 6

Performance comparison

The virtualisation software deployed across the Cloud infrastructure is KVM [11], and to manage all compute nodes and VMs, OpenNebula [13] is deployed within the Cluster headnode.

As a result, a number of VMs has been deployed to validate the proposal presented in this work. In detail, we have deployed 3 virtual clusters (VCs) consisting of 5, 6 and 7 VMs, that is, 4, 5 and 6 workers, respectively. Although with the formal model we have obtained the performance for bigger VCs, the limitation imposed by the number of cores available within the physical compute node (8 cores) limits the number of VMs that can be deployed.

Since the Cloud infrastructure hardware is equivalent to the Amazon EC2 Cloud, and to perform a significant comparison, the configuration of every VM is similar to the specifications of an Amazon EC2 M1.small instance, that is: 1 EC2 compute unit, 1.7 GB RAM and 8 GB HDD. As a result, the performance achieved in the UCLM infrastructure can be considered equivalent to the provided by the Amazon service. Moreover, it allows us to consider also the pricing rates established by Amazon for each VM hiring.

Finally, the Hadoop is deployed over these VCs, configured and the COSMOS sentiment analyser application is executed multiple times. The input is a group of files with \(\sim \)10 million Tweets (\(\sim \)1.5 Gb of plain text) divided in 30 blocks.

The results obtained from the execution on the Cloud infrastructure compared with the results achieved by the formal model (Fig. 6) show that, for the given parameters, the performance obtained by the formal model adjusts significantly when compared with the effective performance on a real Cloud environment.

Although the performance obtained from the real Cloud environment has an associated variability, due to the Cloud computing resource management (conditioned by external processes and OS related actions), the trend adjusts significantly to the average time for each VC. This fact is very important since it can be useful to determine the average performance and measure the variability (percentage range), allowing to evaluate the risk of accepting the Service Level Agreements (SLAs) with time restrictions. This extra information can be useful to protect service providers to minimise the number of violated SLAs, but also for Cloud users, since they perceive an improvement on the QoS received in terms of confidence.

Furthermore, this result indicates the limitations on effective performance that can be achieved when increasing the number of VMs within a Hadoop VC, letting us to choose the best deployment strategy in terms of expected performance and the associated cost in terms of the number of VMs of each VC.

7 Performance–cost tradeoff

The formal model developed in this work can be used to reduce the operational expenses on IT resources by helping on choosing the optimal service provider and resource hiring. For example, it is possible to find various IaaS providers (e.g., Amazon EC2, IBM Blue Cloud, HP Flexible Computing). But, although with the formal model it is possible to determine the optimal infrastructure that suits the performance and QoS requirements, it is also necessary to know and evaluate the economical cost that those IaaS (Infrastructure as a Service) providers apply to determine the most suitable service provider. That is, the model can help on choosing the best cost–performance/QoS ratio agreement, which leads to achieving the highest performance with the lowest acceptable cost.

Once the formal model has been validated, to provide a better example of how the formal model can help users to achieve the best cost–performance ratio agreement, the performance evaluation has been extended with a higher number of information blocks (300 blocks). These blocks include data of a real study that considers the analysis of 100 million Tweets and covers the sentiment analysis of \(\sim \)1 month period of observation.

Since the virtual infrastructure used to validate the formal model follows the specifications provided by Amazon EC2 provider M1.small instances, the hiring costs stated by Amazon have been considered. To this end, we use the “Simple Monthly Calculator” [1] tool, which provides an automated hiring cost in terms of the number of resources, time required and location (for this example: the EE.UU. east (Virginia) costs). Amazon EC2 evaluates the cost in hours time slots.

Table 4 shows the time and cost required to perform a longitudinal sentiment analysis of 100 million Tweets (divided into 300 blocks) within the Amazon EC2 Cloud. The time has been obtained for different configurations (1 master VM + # workers VMs) by the BAL tool using the model presented in this paper. And the cost has been calculated by means of the “Simple Monthly Calculator”. It is interesting to note that the cost is not directly related with the number of resources used, since the cost also depends on the number of time slots.

Table 4 300 blocks results

We now face a multi-objective optimisation problem where we try to optimise simultaneously two objectives: execution time and cost. For non-trivial multi-objective optimisation problems, there does not exist a single solution that simultaneously optimises each objective. Thus, the optimal decision needs to be taken in the presence of trade-offs between the two conflicting objectives.

According to multi-objective optimisation theory, “the Pareto optimum solution set is defined as a set consisting of feasible solutions in each of which there exist no other feasible solutions that will yield an improvement in one objective without causing degradation in at least one other objective” [21]. An individual solution is called non-dominated if none of the objective functions can be improved in value without degrading some of the other objective values, and the Pareto frontier is the set of non-dominated solutions.

According to these definitions, points representing 2 workers to 7 workers are not on the Pareto frontier because they are dominated by 8 workers point which improves both objectives, cost and time, simultaneously. Nevertheless, 8 workers and 9 workers points are not strictly dominated by any other, and hence they belong to the Pareto frontier.

As a result, we conclude that (i) if we focus on time requirements, 9-workers solution should be chosen because it gives us better performance, but (ii) if we focus on cost requirements, 8-workers solution should be taken. The user (human decision maker) has the final decision on how many workers he wants to hire, but the results shown in this work provide hints on the overall performance–cost relationship that can help on taking the best decision. Moreover, this evaluation reveals the importance of performing a previous analysis using the formal model in conjunction with the service costs to achieve the best possible agreement.

8 Conclusions and future work

The main goal of this work has been accomplished: applying formal methods, it is possible to determine the performance of a Hadoop-based application within Cloud environments and best performance–cost agreements in advance.

To this end, a model of the system has been developed using the process algebra BTC. This model supports input parameters to model the specific characteristics of the application running on Hadoop to be evaluated. This task is done automatically and requires no prior background in formal methods. Then, we have chosen a particular application (COSMOS Sentiment Analyser for Hadoop as a case study) to conduct the performance evaluation by BAL tool. The data obtained by the formal model have been validated with data from real experiments performed on a private Cloud in the University of Castilla-La Mancha.

Once it has been verified that the model fits the real system, it has been conducted a study with two objectives; first, to allow users and service managers to evaluate cost and performance in terms of the deployment strategy. Second, to choose the best deployment strategy in terms of the expected cost/performance (for a better Cloud resource hiring in terms of QoS and costs).

The work presented has led to new ideas for future work. The first one will be a formal study focused on the architectural design issues and possible solutions to improve the performance of Hadoop. Then the results from this new work and the ones presented in this paper can be combined to improve the overall performance of Hadoop.

Moreover, there are two kinds of Hadoop application, from the perspective of performance, with clearly different behaviours. The first ones, applications that behave linearly in terms of the amount of data to be analysed, i.e., the processing time is proportional to the amount of data to be processed. The second ones, applications that show dependency on the data being processed, that is, the processing time does not depend on the amount of data but on the data themselves. In this work we have studied an application of the first type but we have in mind expand our work analysing the second type applications.