1 Introduction

The domain of parallel and distributed simulations (PDS) has evolved over the last few decades to support large-scale complex simulations. In general, traditional simulation techniques are developed for closed environments such as an HPC cluster. With the inception of the cloud paradigm, new directions have opened up for the PDS community. Researchers can run complex simulations on a pay-as-you-go model, as available cloud resources (computing, storage, and network) are accessible as a service to any IP-enabled device. To efficiently utilize these resources under the multi-tenant environment, cloud providers use different techniques including virtual machine (VM) migration, VM consolidation, physical host sharing, dynamic bandwidth allocation, and resource scheduling schemes. However, cloud resource utilization remains an open area of research [10]. For PDS community, a cloud provides a different execution platform compared to an HPC cluster environment. Since it provides a shared environment with different applications running alongside PDS processes. The distributed simulation often comprises a large number of participating logical processes (LPs) laid out across the data centers, and communicating with one another using time-stamped messages. However, in a cloud environment, this naive process placement affects the simulation performance as it may be running alongside compute-intensive or data-intensive applications [23]. Moreover, there is a higher probability that frequently communicating processes are placed at distinct physical nodes connected multiple hops apart. This is due to the fact that the process is placed randomly without considering their communication patterns.

The execution of traditional PDS frameworks fails to perform well over the cloud environment, due to the fact that PDS processes generate a large number of time-stamped messages for destination processes connected through multi-hop links inside the data center [17]. This inherent nature of such simulations can clog the underlying network incurring huge delays, affecting the performance of PDS. Over the years, the PDS community has been focused on the cluster environment. Recently, the community has started exploring PDS over the cloud, but no well-known works exist that enhance the distributed simulation performance based on data center network characteristics. In fact, many of the existing frameworks are designed to improve performance using message aggregation, scheduling techniques and/or adoption of a conservative approach. On the contrary, our focus is to reduce the multi-hop traffic using process migration and mitigate the communication delays incurred due to long-haul communications.

In this paper, we proposed an efficient process placement technique to reduce communication delay between frequently communicating processes in PDS. The main contributions of the proposed work are as follows:

  • Provide a framework to simulate PDS over the cloud where processes are executed at different physical systems connected to one another through a multi-hop link.

  • Propose a placement algorithm termed as find-rack-mate (FRM). The algorithm intelligently placed processes to reduce the overhead in terms of hop count, as PDS processes frequently communicate with one another sharing their state information. In this work, we use clustering-based placement to reduce this overhead, in contrast to traditional random placement techniques that place processes without considering their communication pattern.

  • Evaluate PDS over cloud paradigm, we built a data center simulation framework in OMNeT++ to support PDS and its process migration.

The framework serves as a foundation for researchers to test new approaches improving PDS process placement within cloud data center environment.

The remainder of this paper is organized as follows. Related work is presented in Sect. 2. Problem formulation is described in Sect. 3. Section 4 details different clustering-based and proposed process placement technique. The results are discussed in Sect. 6. Section 7 covers the discussion and future directions. Section 8 concludes the research contribution.

2 Related work

Most of the research work in the domain of PDS is based on energy efficiency, load balancing, and simulation performance in a cluster environment. However, communications over the underlying network also play an important role to determine the overall efficiency and robustness of a PDS. In this section, we summarize and review the relevant works on PDS. The existing contributions are organized into cloud-based and cluster-based simulation frameworks.

2.1 PDS frameworks for cloud environment

There exist few works based on PDS execution in cloud paradigm. The execution of PDS over cloud introduces new challenges due to differences from classical cluster environment, such as process synchronization, workload varies at different nodes and significant network traffic. However, with no requirement for an initial investment to use the underlying execution environment, its use to execute simulations is getting popular among academics and industry.

Traditional PDS over cloud environment results in many deadlock scenarios result in increased execution times. To avoid such scenarios, in [28], the authors propose a deadlock-free scheduling algorithm for the execution of PDS over a virtual environment. Additionally, some techniques tend to exploit data locality by scheduling processes accordingly to reduce data transfers, thus, improving the PDS performance. Such data-aware techniques are proposed in [24] mainly focused on work stealing to improve the performance of PDS. However, these techniques are suitable where the simulation relays on large data chunks. In [17], the authors propose a barrier-based protocol to execute an optimistic simulation over the cloud environment. The objective of the protocol is to dynamically handle the compute resource optimization in parallel and distributed simulation.

Over the years, different techniques to improve its performance over the cloud have been proposed, for instance, priority-based work consolidation [16], resource sharing among parallel jobs using gang scheduling algorithm [25], process synchronization based on simulation instances [27] and use of shared event queue among cloud multi-core systems [23]. But the techniques lead to bottlenecks when a large number of threads are involved in the simulation. More recent approach in [14] proposes a federation-based approach to improve the performance of large-scale simulation in a data center. Here, the proposed design is based on a federation comprising federates grouped into multi-core systems. Each federate contains a number of VMs managed by a federate-level coordinator using hierarchical resource management. Most of the aforementioned attempts are made to improve PDS performance over the shared multi-tenant environment.

Even though multi-threaded design is useful but poses new challenges due to the multi-tenant nature of the cloud environment, for instance, scheduling overhead due to multiple processes sharing a single node can affect PDS performance. Moreover, fault tolerance is important for performance, inadequate fault handling mechanisms often leads to re-execution of the entire simulation [6]. A more comprehensive study in [3] analyzes existing PDS frameworks in terms of usability and adaptability and concludes that the existing techniques have many limitations for use in the cloud paradigm. Thus, there is a requirement for new frameworks that can efficiently run PDS on the cloud.

2.2 PDS frameworks for cluster computing environment

In PDS domain, most of the work exist covers the execution in the cluster system environment. Here, we have briefly covered a few contributions in PDS. In [19], authors presented a master–worker paradigm to support PDS execution of simulations in a distributed environment termed as Aurora. The proposed framework provides computation as web services where PDS processes communicate with one another using timestamped messages. A master process distributes tasks among the workers. After job completion, the workers report back to the master process. The framework avoids local causality constraint by using a conservative synchronization algorithm to achieve high performance and interoperability using the gSOAP toolkit. Moreover, to further improve the performance of PDS, a multi-threaded framework is proposed, i.e., ROSS-MT [13]. It is designed to reduce delays incurred during synchronization. In contrast to standard PDS frameworks, the ROSS-MT used threads for communication instead of processes. The advantage of using threads is the access to shared memory space to improve performance. However, both the contributions Aurora and ROSS-MT have no support cloud architecture.

Similarly, to improve the performance of PDS over cluster environment, in [8], the authors analyze the overhead of global virtual time (GVT) computations. They use a separate thread for communication and computation and only consider GVT calculation to enhance the PDS performance over the cluster. Similarly, in [26], a framework is proposed that appropriately assigns virtual machines to reduce the execution time of PDS. The framework is evaluated in a cluster environment against first-in-first-out and max–min allocation algorithms. The results demonstrate improved PDS efficiency by reducing the execution time for large-scale system simulation.

The performance of PDS over cluster systems is still an open area of research that demands a new process placement, GVT computation, and task distribution algorithms that can support large-scale real-time simulations. With the multi-disciplinary research, the role of distributed simulation in various fields is such as high-speed experimentation, and big data analytics are the emerging areas of research [21].


Discussion Generally, a huge amount of traffic is generated within the data center. The sources of this traffic include maintenance, VM migration, and process replication. In this study, our proposed framework is designed to support PDS over the cloud with the focus to reduce network cost among frequently communicating processes, thus, improving the overall performance of PDS. Table 1 lists some contributions targeting the PDS performance. It is worthwhile to mention that no previous work exists that provides a cloud data center environment for PDS simulation and supports process migration based on network communication. Thus, reducing network communication means improving overall PDS performance. Other work covers the performance of PDS either using load balancing approach or reducing lock times for shared memory space. The former is difficult to manage in a cloud environment, while the latter is limited by the number of threads supported.

Table 1 Summary of recent works in parallel and distributed simulation

3 System model

In PDS, a significantly large number of processes take part in the simulation, which is randomly placed on different physical nodes across the network. These processes communicate with one another by exchanging messages. Each message has an associated time-stamp value. The physical nodes inside the data center are placed in the form of racks. All nodes residing within the same rack are connected to the top of the rack switch (ToR). Further up the network, multiple racks are connected through an aggregate switch while core switches are used to connect the aggregate switches. This corresponds to a typical three-tiered data center as illustrated in Fig. 1. Note that the separate layer of aggregate switches is added to two-tiered data center architecture to provide scalability in terms of computing servers. To benchmark the proposed model, we used similar three-tiered data center architecture with \(\eta \) number of physical servers placed in every rack, whereas \(\nu \) represents the number of racks in the data center.

Fig. 1
figure 1

Traditional three-tier data center architecture

The number of hops during an exchange of messages between two processes depends upon the number of switches the message passes through. Thus, a one-hop communication means that the two processes reside on different nodes on the same rack. Similarly, a three-hop communication means that the two processes reside on different racks with the same aggregate switch. Finally, a five hop message exchange means that communication between processes involves a core switch.

The cost of a n hop communication is n times the cost of single hop communication. Though, negligible it may be, there is a cost associated with the communication between processes on the same node. Therefore, it is pertinent to note that irrespective of the placement of processes, the cost of a message exchange is invariant to their role as a sender or receiver. Thus, a matrix \(M = [m(i,j)]_{|P|\times |P|}\) representing number of messages exchange between processes i and j is strictly triangular and for no loss of generality we shall take it to be upper triangular. The order of the square matrix M is the total number |P| of processes. Let N denote the set of nodes, then the information regarding a certain node scheduled with specific processes is the operation:

$$\begin{aligned} f : N \rightarrow 2^P \end{aligned}$$

such that f[N] is the partition of P. Similar to f, the assignments of nodes to racks R, and racks to aggregate switches A are, respectively, described as

$$\begin{aligned} g : R \rightarrow 2^N \quad \mathrm{and}\quad h : A \rightarrow 2^R \end{aligned}$$

Note that processes, nodes and racks are referred by their index. For instance, \(i\in P\) means the process \(P_i\) and \(k\in N\) would be the kth node, etc.

To cater for the cost of communication between processes, the number of hops a message encounters is modeled with the help of various 2-arity predicates on P as mentioned below:

$$\begin{aligned} F(i,j) = 1 \Leftrightarrow \{P_i,P_j\}\subseteq f(k) \quad \mathrm{for\,some}\,\, k \in N \end{aligned}$$

The predicate for the (one-hop) communication of process running on different nodes on the same rack is:

$$\begin{aligned} G(i,j) = 1 \Leftrightarrow&F(i,j) = 0 \quad \mathrm{and}\quad \exists \, k\in R , a,b\in N\\&\mathrm{with\,} a\ne b\quad \mathrm{s.t.\,} \{N_a,N_b\}\subseteq g(k) \quad \mathrm{and}\\&P_i\in f(a) \quad \mathrm{and}\quad P_j\in f(b) \end{aligned}$$

when the processes are located on different racks under the same aggregate switch (three-hop communication), then the predicate is designed as:

$$\begin{aligned} H(i,j) = 1 \Leftrightarrow\,&F(i,j) = 0 \quad \mathrm{and}\quad G(i,j) = 0 \quad \mathrm{and\,}\\&\exists \, k\in A,a_1,a_2\in R, b_1,b_2\in N\\&\mathrm{with}\quad a_1\ne a_2, b_1\ne b_2 \quad \mathrm{s.t.\,} \{R_{a1},R_{a2}\}\subseteq h(k)\\&\mathrm{and}\quad N_{b1}\in g(a_1) \quad \mathrm{and}\quad N_{b2}\in g(a_2)\\&\mathrm{and}\quad P_i \in f(b_1) \quad \mathrm{and}\quad P_j \in f(b_2) \end{aligned}$$

Lastly, the (five hop) communication involving the core switch happens to be

$$\begin{aligned} K(i,j) = 1 \Leftrightarrow F(i,j) = G(i,j) = H(i,j) = 0 \end{aligned}$$

Here, only one core switch is assumed in the data center, otherwise extension to multiple cores follows the same above mentioned pattern.

If \(\epsilon \) is the cost of communication between processes residing on the same node and c is the cost of a single hop communication, then the total cost, T, of communication eventually depends upon assignment operations \(f,g,\mathrm{and\,} h\)

$$\begin{aligned} c\sum _{i=1}^{|P|-1}\sum _{j=i+1}^{|P|} m_{i,j} \Big ( \dfrac{\epsilon }{c} F_{i,j} + G_{i,j} + 3 H_{i,j} + 5 K_{i,j} \Big ) \end{aligned}$$

The optimization problem is to find the argument

$$\begin{aligned} (f,g,h) \in \big ({\mathcal {P}}(P)\big )^N \times \big ({\mathcal {P}}(N)\big )^R \times \big ({\mathcal {P}}(R)\big )^A \end{aligned}$$

that minimizes T(fgh). The constraints due to hardware in the proposed setup are

$$\begin{aligned} g(1) = \,\{1,2,..., 5\} \\ g(2) =\,\{6,7,...,10\}\\ g(3) =\,\{11,12,..,15\}\\ g(4) = \,\{16,17,...,20\} \end{aligned}$$

Moreover, some additional constraints are the assumption of only two aggregate switches, only one core switch and that only one process is run per node.

4 PDS process placement framework

Execution of PDS without knowledge of process locations can lead to performance issues, mainly due to traffic generated by other applications hosted in the same data center. Moreover, the performance of optimistic simulations in a cloud environment may suffer due to network status with messages stuck on their way to destinations, which may lead to a large number of rollbacks. Interestingly, the literature lacks any works that propose efficient PDS process placement inside the data center to reduce the aforementioned communication delay. Although research articles exist on VM migration techniques with a focus to reduce overall data center energy consumption; thus, focused on the benefits of cloud providers. The existing energy-aware migration techniques are not well-suited for PDS where LPs are lightweight and frequently generate timestamped messages. Therefore, in this study, we proposed a framework that provides PDS-based three-tier data center environment for process placement. Further, the framework provides PDS process states migration by swapping process states variable with other PDS processes instead of migrating an entire VM. In general, intra-rack node communication is more favorable compared with inter-rack node communication—the latter incurring more hops. Based on this, we propose a locality-aware criterion for process placement, so that to automatically restructure the underlying network for efficient placement of processes in the simulation. Traditionally, in PDS, processes are placed sequentially without knowing their communication patterns, referred to as the random placement method. This is considered the baseline placement strategy. As an initial attempt, we used different clustering techniques to group processes based on their communication patterns. Later in the study, to overcome shortcomings in the mentioned classical clustering techniques, we proposed our locality-aware process placement algorithm.

Fig. 2
figure 2

Proposed framework architecture design

5 Implementation details

PDS process placement framework for data center was developed atop OMNeT++ and INET framework. The framework comprises two main modules: PDS controller and PDS processes, as illustrated in the framework architecture (Fig. 2). The framework execution steps are listed in Algorithm 1, while the detailed functionality of the two modules is presented below.

figure a

5.1 PDS controller

The PDS controller acts as a manager for the entire simulation. It knows the PDS process placement and underlying communication among the PDS processes. The network listener module is designed to sense all outgoing and incoming traffic at the PDS processes. It uses this information to build a communication matrix for the processes. The simulation manager module keeps track of the PDS process placement in the data center. During the simulation initialization phase, the module develops a process table showing process distances in terms of hop count from every other process. The simulation manager then activates an execution logic module after regular time intervals. The execution module takes the compiled communication matrix and implements a policy to generate PDS process placements. The new placements are handover to the simulation manager, further disseminating them to PDS processes through an external interface.

Fig. 3
figure 3

Find-rack-mate execution model—initially, a pair with the most communications is selected. Thereafter, processes are selected based on their communication with already selected processes. At every stage, the process becomes the part of cluster based on maximum communication with already selected processes. Note that the process id is used to break any ties

5.2 Find-rack-mate (FRM) algorithm

The proposed find-rack-mate (FRM) algorithm is designed to find an efficient PDS process placement based on frequently communicating processes. The proposed algorithm communicates with the execution logic module to acquire all the required data for placement recommendations. It takes in the communication pattern from the network listener module and generates rack mates for every participating rack inside the data center. Initially, the processes are placed at random physical systems inside the data center. Based on the initial deployment, the simulation runs for a predefined time interval. During this period, the network listener module generates a communication matrix based on inter-process communication. After the simulation interval ends, the matrix is forwarded to the FRM algorithm to update process placements based on underlying communication pattern.

Algorithm 2 describes the overall functionality of the proposed FRM algorithm. The algorithm starts off by determining candidates for populating the first rack. First, \(\textit{MCP}(P)\) routine finds the highest message-exchanging pair of processes \((P_a,P_b)\) in the communication matrix. This pair is assigned to the current rack; that is, processes \((P_a, P_b)\) are assigned to the first and second positioned physical nodes in the rack. Once done the pair is added to a comparison window W and the current rack pointer is updated. Now starting from the updated position to the maximum number of processes that can reside inside a rack. The routine \(\textit{MFCVP}(W(1))\) returns a process from the communication matrix that communicates most with the first process in the highest communicating process pair \((P_a, P_b)\); that is, the first value of the comparison window W hence W(1). The returned process is termed as \(P_x\). Similarly, a process that communicates most with the second process in the comparison window is represented as \(P_y\). Thus, \(\textit{MCP}()\) finds the highest communicating pair in the entire matrix, whereas \(\textit{MFCVP}()\) finds a process that communicates mostly with a particular process; therefore, it searches only a single row in the communication matrix. Since process \(P_a\) and \(P_b\) are already a part of the rack; therefore, we compared \(P_a\) communication with \(P_x\) and \(P_b\) communication with \(P_y\), whichever is higher is added to the current position in the rack and that process is removed from the comparison window. The next empty slot in the comparison window is assigned to the process that is just added to the rack for further comparisons. Next step is to remove all the processes from the unassigned process list that are already added to the rack. The sample execution of the algorithm is illustrated in Fig. 3.

figure b

5.3 PDS process

The PDS process is the actual process that executes on the same or different physical system while communicating with other processes and the controller. Here, the simulation logic module contains the actual simulation logic used to modify state variables. The placement controller module receives all the new placements from the controller. In the case of the PDS process migration call, the controller communicates with the simulation manager and performs state swapping. An external interface module is used for communication between the processes and the controller.

5.4 Clustering module

The proposed framework supports PDS process placement based on different clustering techniques. Once new placements are computed, the framework external interface reads cluster information to place the PDS processes accordingly. In this study, we used different cluster techniques to determine similarity between LPs. Here, we used the sensed communication patterns as the criterion to determine similarity; that is, to identify frequently communicating LPs. That is, initial PDS processes communication patterns are recorded for a number of iterations, later used as input for cluster formation. The clustering techniques used for our initial study are density-based spatial clustering of applications with noise (DBSCAN), hierarchical agglomerative clustering (HAC) and k-means clustering. The techniques are briefly discussed as follows:

  1. 1.

    DBSCAN The DBSCAN algorithm starts by picking an initial unvisited data point. A neighborhood of data points is defined for the point based on a distance function (\(\epsilon \)). A criterion referred to as minPoints, the minimum number of points allowed in a neighborhood, is used to either reject the neighborhood as noise or accept it by creating a new cluster. This process is repeated for the data points included in the newly created cluster. For the next cluster, a new unvisited data point is selected and the same process is repeated.

  2. 2.

    Hierarchical agglomerative clustering The hierarchical agglomerative clustering (HAC) is a bottom-up clustering approach. It starts off by making every data point a cluster. At each iteration, two clusters with the smallest distance are merged into a single cluster. Note that the merging is controlled by a predefined distance threshold. The merger process is repeated until all the data points fall into one cluster or specified number of clusters is reached.

  3. 3.

    k-means clusteringk-means is the most well-known clustering technique. It starts off by initializing the number of required clusters. For each cluster, a central point is selected at random. The algorithm starts off by assigning all data points to clusters based on a distance function. The mean of all data points assigned to a cluster is set as the centroid of that cluster. This process is repeated many times until there is no or very small alteration in the resultant clusters.

The proposed framework reads the clustering outcome through its cluster interface module. Note that the technique can either be implemented inside the framework or used as an external tool for cluster formation. In this work, we used an external tool for cluster formation, i.e., Weka. The tool provides a flexible environment to create clusters, which in turn are used for PDS process placement. The next sections cover the performance of the PDS processes using different clustering techniques and the proposed FRM algorithm.

6 Performance evaluation

This section contains a detailed summary of our findings. We implement the proposed process placement algorithm presented in the previous sections, and we compare its communication cost to different clustering techniques used to solve the placement problem in PDS.

6.1 Environment and parameters

The proposed PDS process placement framework is used to evaluate the FRM and clustering-based techniques. We used different clustering techniques including our algorithm to determine clusters within the simulated process communication data. These clusters are used for efficient process placement. Initially, the processes are placed randomly in the simulated data center where the simulation comprises 1.5 million events with a total of 1133 messages exchanged between twenty processes. As a baseline model for comparison, we ran the entire simulation with random process placement and recorded the communication-related statistics. The statistics include the number of messages sent by a process, the number of messages received by a process, the total number of hops incurred by a process, the total number of messages (sent/received), and the mean hop count of a process. On the other hand, for the remaining process placement techniques, we ran the simulation for four hundred thousand events using random process placement and recorded the relevant statistics. These statistics were used to determine process clusters using different clustering techniques. In this study, we kept the number of clusters—groups of processes—fixed to four corresponding to the total number of racks in the simulated three-tier data center. Table 2 contains the list of parameters used for evaluation.

Table 2 Simulation configuration and system specification

In the simulated data center, a message is transmitted from one node to another took either one, three or five hops. Note that a message consumes less network bandwidth when it requires less number of hops to reach its destination. In this work, the goal was to minimize the hop count (the total number of hops), as this can improve the performance of PDS over cloud environment. Since the cloud charges its users based on usage of computing, storage, and bandwidth. Thus, it seems more suitable to place frequently communicating processes on the same rack, reducing the total network load inside the data center.

6.2 Efficiency of communication costs

For each technique, we measure the efficiency in terms of total hop count after altered process placement. Table 3 summarizes the communication statistics per rack. The results show a break down of message communication costs in terms of one, three, and five hops, as we consider four racks in the simulated three-tier data center. The last column shows the total number of messages, the sum of all one, three, and five hop messages. Furthermore, the results for the proposed FRM algorithm are compared against baseline random placement and placement using different clustering techniques. Similarly, Fig. 4 illustrates the per rack performance with the x-axis representing the four racks and the y-axis representing the number of messages sent with one hop, three hops, and five hops, respectively.

Table 3 Quantitative comparison of hop counts from five different clustering algorithms including our proposed algorithm
Fig. 4
figure 4

Per rack performance of the proposed algorithm against five different clustering techniques

Generally, in PDS, communicating processes are placed randomly across the simulated network. This strategy results in poor simulation performance because the technique lacks any locality awareness and hence ends up generating more network traffic. Using this strategy, out of the total 1133 messages sent during the simulation, \(21.5\%\) were one hop messages, \(26.8\%\) were three hop messages and \(51.6\%\) were five hop messages. It is evident that each rack has more five hop messages compared to messages with one hop and three hops. This is a clear case of more traffic across the network.

The very first clustering technique used to cluster processes was the DBSCAN algorithm. Our results show that each rack sent a higher number of five hop messages compared to messages sent with one hop and three hops. Using this technique, out of the total 1133 messages sent through the simulation, \(23.3\%\) messages were one hop, \(27.7\%\) messages were three hops and \(48.8\%\) messages were five hops. This represents a minimal improvement with one hop messages increased by only \(1.85\%\) while five hops messages decreased by just \(2.7\%\).

The next clustering algorithm used to cluster processes into different racks was based on a hierarchical clustering algorithm. Using this technique, the increase in the number of one hop messages was \(1.2\%\) only; however, a decrease of \(4.67\%\) was observed in messages sent using five hops. This was a significant improvement compared to when using random process placement.

Out of all the traditional clustering algorithms, the best performing algorithm was k-means. The algorithm surpassed random process placement method, DBSCAN and HAC in terms of performance. The number of messages taking one hop to reach their destinations increased by \(8.29\%\), messages taking three hops increased by \(3.26\%\) and ones taking five hops decreased by \(11.56\%\). These results demonstrate an improvement using k-means compared to clustering techniques presented earlier.

The proposed algorithm outperformed the baseline random process placement method, and all other techniques discussed including the best performing k-means-based process placement. Out of the total messages sent during the simulation, \(34.95\%\) of the messages took just one hop, \(26.12\%\) messages took three hops and \(38.92\%\) messages took five hops to reach their destination. In comparison with the random method, one hop messages increased by a significant \(13.4\%\), the messages with three hops increased slightly by \(0.7\%\) and messages with five hops decreased by a significant \(12.71\%\).

Figure 5 depicts the performance of all the aforementioned techniques in terms of hops incurred after newly placed processes at the data center level for the entire run of the simulation. Here, the x-axis lists the techniques, whereas the y-axis presents the total number of messages sent.

Fig. 5
figure 5

Comparison of the proposed algorithm against five different clustering techniques

Table 4 Number of hops comparison for five clustering-based process placement techniques including our proposed approach
Fig. 6
figure 6

Comparison of the proposed algorithm against five different clustering techniques in terms of delay

Figure 6 shows the communication delay-based comparison between proposed and clustering techniques. The data value shows hop-wise total delay computed on a three-tier data center topology with link and communication delay mentioned in Table 2. In the proposed work, one hop communication is maximum compared to all other techniques; however, the max hop communication is reduced significantly and based on that delay is also reduced. The total delay shows a minimum 3.32% reduction in overall communication delay.

In summary, a total of 1133 messages were sent during the entire run of the simulation. Each message took either one, three or five hops to reach its destination. It is evident that the simulation performance drops if the majority of messages took five hops. On the contrary, the performance is much better if the majority of messages took just one hop to reach their destinations. Note that a better scenario is if the majority of five hop messages took a lesser number of hops, for instance, three hops. Table 4 shows a comparison of hop counts for different process placement techniques including our proposed approach. The results show a hop count reduction of 2.5%, 3.28%, 11.02%, and 14.50% compared to the baseline random process placement method using DBSCAN, hierarchical clustering, k-means, and our proposed method, respectively.

7 Discussion

In PDS, random process placement in a data center can affect the performance of a distributed simulation. The messages can get delayed inside the network due to high network traffic generated by other applications. For instance, in optimistic simulations, network delays can deteriorate the performance due to straggler messages. These messages initiate the rollback mechanism. Often this delay is caused due to a shared multi-tenant environment where other compute-intensive or communication intensive applications are executing. Therefore, placing such communicating processes close to one another, not only reduce the rollbacks but also the overall cloud usage cost. In other words, long-distance communication between processes is considered an infeasible approach for PDS. This is evident from our experimental evaluations that process placement, in order to reduce total communication hops, can improve the overall performance.

  • Locality-aware clustering Organizing processes based on locality can improve performance and reduce network usage. In most of the cases, a process only communicates with some predefined processes. This is somewhat based on the simulation topology, for instance, in the case of tree-based topology, a root process directly communicates with its children. Here, the process can be placed—based on its locality—along with the directly connected nodes. In our proposed approach, we cluster processes based on their communication patterns to reduce communication costs.

  • Online extension of the algorithm The proposed model can be extended to work in an online scenario where a master process can track all involved processes to identify communication patterns. Similar to the proposed approach, these patterns can be used to migrate the processes. It is worthwhile to mention that this can result in frequent migrations, an overhead for the system, which will require a mechanism. In the future, we are also interested to use deep learning techniques for PDS process placement. As all these techniques required a significant amount of data therefore, our proposed framework can facilitate in generating large-scale data sets.

  • Implementation on a real-world data center We are interested to extend our work to support any distributed framework designed for data centers. In the present study, cluster size is user-defined—the number of processes residing in a rack. In future, an extended controller can automatically balance processed based on the underlying computing capacity. Moreover, processes consolidation techniques can be adapted to reduce the number of racks used for PDS.

  • Management implications The proposed work introduces a new research paradigm to the PDS community, that is PDS process placement based on communication pattern. Generally, in a data center, VMs are placed to reduce the overall energy consumption; however, such placements are managed by the vendors to their benefit, with users having no control over the placements. It is worthwhile to mention that existing PDS frameworks are not designed to work on such a multi-tenant environment, with uneven workloads at physical nodes and network switches due to the execution of different applications. The proposed PDS process placement provides a dynamic environment where PDS processes can be moved to other physical nodes to improve the efficiency of the simulation model. This dynamic movement of the PDS processes can be based on data centers internal factors such as network congestion at various switches, available link capacity, and execution overhead on shared physical nodes. Furthermore, adopting the cloud paradigm for complex simulation models gives scalability and fault resilience when running large simulations. That is, the data center can dynamically manage failures through resource provisioning. However, for successful execution of PDS after node failures would require middleware to maintain causality among the PDS processes to produce correct results.

  • Time complexity The time complexity of the procedure MCP is \({\mathcal {O}}(n^2 \log n)\) and that of the procedure MFCVP is \({\mathcal {O}}(n \log n)\). The time complexity of the inner for loop is \({\mathcal {O}}(n^2 \log n)\) and consequently the time complexity of the outer for loop and hence that of the FRM algorithm \({\mathcal {O}}(n^3 \log n)\). The dynamic placement of processes is a search problem, i.e., an NP problem. Under the already discussed constraints, the heuristically proposed process placement technique not only presents a polynomial time solution, but it also surpasses the existing methods in the literature in terms of communication cost minimization as is shown in Fig. 7.

  • Convergence analysis For the analysis, we used sigma (\(\sigma \)) convergence inspired from economics. In the context of process placement, it refers to the reduction in dispersion between hop counts at different racks. For describing this convergence tendency, we use a discrete time interval based on the messages communicated till t with random placement and thereafter \(t+T\) using the proposed FRM algorithm for process placements. The convergence state at a certain time is measured using an indicator of variation. In contrast to the concept of convergence, this indicator represents one calculated using coefficient of variation \(\sigma \) at time t.

    $$\begin{aligned} \sigma _t = \sqrt{ \frac{1}{N-1} \sum _{i=1}^{N} (\log (y_{\textit{it}}) - \log (\bar{y_t}))^2} \end{aligned}$$

    where \(y_{\textit{it}}\) is the hop count at time t and \(\bar{y_t}\) is the mean hop count at time t. In the present context, the degree of convergence increases when \(\sigma \) is increasing. The higher is the \(\sigma \), the higher is the degree of convergence (i.e., \(\sigma _t < \sigma _{t+T}\)). We observe that \(\sigma \) stood at 0.017 by the end of the period t, suggesting a period of divergence (i.e., \(\sigma _t > \sigma _{t+T}\)). This increased to 0.111 by the end of the period \(t+T\), suggesting an evidence for \(\sigma \)-convergence. The trend in the \(\sigma \) over the full time period analyzed (Fig. 7) was negative and statistically significant (\(\alpha _0 = 0.05\), \(F = 13.276\), \(p < 0.001\)).

Though the domain of process placement to improve the efficiency of PDS is relatively unexplored. Many previous works mainly used load balancing, scheduling, and energy to improve the performance of PDS. However, in this study, we introduced process placement to improve this performance by reducing the total number of communication hops. In future, we intend to extend the work by addition of dynamic load balancing, process consolidation, and adaptive learning techniques. This will further enhance the performance of PDS under real data center environment.

Fig. 7
figure 7

Trend in number of hop counts with discrete time interval till t using random placement and thereafter till \(t+T\) using the proposed FRM algorithm

8 Conclusion

Execution of PDS over the cloud, a multi-tenant computing environment, affects its performance. That is, frequently communicating PDS processes get stuck inside the network due to the network traffic generated by other applications hosted on the cloud, consequently, slowing down the entire simulation. In this work, we demonstrate the generation of communication data from the framework for a three-tier cloud data center, which is used to perform PDS process placement using different unsupervised clustering-based techniques. We also proposed FRM algorithm for process placement alongside the aforesaid techniques, showing a 14.5% improvement. The proposed PDS process placement in the cloud paradigm provides a dynamic environment for running complex simulation models with support for scalability and fault resilience.