1 Introduction

Cloud computing is one of the most useful technologies due to its advanced features like pay-per-use, scalability, flexibility, resiliency, and many more (Armbrust et al. 2010). It is very emerging, and nowadays, almost all IT companies are using cloud computing environments. There are mainly three service delivery models in cloud computing: (1) Infrastructure as a Service (IaaS), (2) Platform as a Service (PaaS), and (3) Software as a Service (SaaS). IaaS is the backbone of any cloud environment as it provides the entire infrastructure, PaaS provides a platform for developing and running applications, and SaaS allows cloud users to access a set of applications or software from cloud environments.

A cloud environment consists of three entities: (1) users, (2) data owner, and (3) Cloud Service Provider (CSP). Here, virtualization plays an important role as it provides the cloud clients or users with virtual storage and software. Various other computing services like database, networking, storage, and analytics are also available over the internet through virtualization (Sangpetch et al. 2017). Sometimes, virtualization technology also called a virtual machine monitor, requires a certain number of resources to work effectively. The execution of VMs is occurred using a software tool called hypervisor as depicted in Fig. 1. It creates a bridge between hardware and various guest Operating Systems (OSs) that effectively handle shared memory control (Zhang et al. 2018; Obasuyi and Sari 2015). There are major benefits of virtualization, such as flexible operations, cost efficiency, adaptability in transferring data, and reducing the risk of system failure (Bala and Chana 2012). To provide a service to the cloud users on request, generally, VMs are migrated from one physical node to another, which is known as live migration (Malleswari and Vadivu 2019; Choudhary et al. 2017). As mentioned earlier, there are many issues in live migration like the performance of applications, fault occurrence during inter VM migration, seamless connectivity in network link, as well as security concerns. These challenges proactively occur in the systems, where cloud computing environments are used. As a result, some steps must be taken to determine how migration may be performed or what alternative options are available to address these issues. These problems of live migration must be resolved to increase the system performance and better functioning among many VMs (Motaki et al. 2021).

Fig. 1
figure 1

Multiple guest OSs on the hypervisor

For resolving the problems of live migration, there are many existing techniques (Wang et al. 2021; Garrido et al. 2021; Namasudra 2020; Yuan et al. 2020; Gao et al. 2020, 2021; Sharma et al. 2021; Gómez et al. 2020; Hamdy et al. 2020; Namasudra et al. 2020; Suruliandi et al. 2021; Rauf et al. 2021; Agrawal et al. 2021; Aljunid and Huchaiah 2020; Ali et al. 2021; Kumar et al. 2021). The pre-copy migration consists of two phases: (1) warm-up phase, and (2) stop and copy phase (Hu et al. 2013). The warm-up phase uses the push technique in which foul memory pages are reconstructed on the host server. Foul memory represents data on a disc that is modified, but has not yet been written on the disc. It means that the upgraded or changed memory pages have been regenerated in the migration process. In the stop and copy phase, VM is stopped at the source node server (Cui et al. 2017). In the case of the post-copy migration method, at first, the VM is suspended at the host of the data center. (Hines et al. 2009). Then, VM is moved to the Target Server (TS) and starts execution. If any requests emerge for the memory page that is unavailable at the target server, it generates a page fault. In the hybrid-copy migration technique, there is a combination of pre-copy and post-copy migration techniques (Salfner et al. 2011). Here, at first, it works as a pre-copy live migration, and then, it uses the concept of post-copy migration technique. The main limitation of both these approaches, namely pre-copy and post-copy migration, is that the complete migration time consumption is increased, and live migration is used for allocating a VM from one physical node to another without turning it off (Huang et al. 2011; Deshpande et al. 2013). For significant improvement of the number of task allocations in the live migration process, a high accuracy-based cloud classifier model is used in (Zhang et al. 2021). There is a lot of overhead in this scheme in calculating the accuracy of cloud classification. In the KVM strategy (Deshpande et al. 2012), all the memory pages are initially transferred from the source to the target server, which results in large data transfer rate. The large data transfer rate generates additional overhead in the migration process. In 2020, Dhule et al. (Dhule and Shrawankar 2020) have proposed POF-SVLM to support low data transfer rate and lower downtime, for live VM migration. It uses a shared network between the source node and the destination node. By using this approach, it is easy to monitor the entire system for all the input and output requests. POF-SVLM reduces the migration time by 55–60%, but only a 20% reduction in CPU utilization time, which is another major concern.

To solve the concerns of the existing approaches, a novel scheme has been proposed in this work. In the proposed algorithm, the Master Server (MS) keeps tracking the entire processing of transferring the load from the Faulty Node (FN) to the target server. The proposed technique is mainly based on three approaches, i.e. HSMT, VMRMT, and VMRBU, for minimizing the migration time, downtime, data transfer rate, and the total CPU cores used during migration. By shifting the load towards the master server and quick identification of faulty node, the overall migration time gets reduced. The proposed scheme is simulated using the CloudSim simulator, and many experiments are performed to evaluate the efficiency of the proposed scheme. Results and discussion show that this proposed novel scheme is better than the existing schemes.

The following are the key contributions of this paper:

  1. 1)

    In this paper, a novel approach has been proposed to solve the issues faced by VMs during live migration. It reduces the downtime, migration time, total cores of CPU, and data transfer rate to a great extent to make the overall system efficient.

  2. 2)

    The proposed novel approach breaks down the servers into two primary components: master server and Slave Server (SS). The master server always keeps track of the slave servers, which helps in identifying the fault immediately.

  3. 3)

    The performance of the proposed scheme has been evaluated by executing many experiments. Results and discussion show the efficiency of the proposed scheme over the existing schemes.

The rest of the paper is structured into several parts. Some related works and key findings from the existing approaches are described in Sect. 2. In Sect. 3, there is a description of the entire proposed scheme. The experimental results and their discussion are given in Sect. 4, and finally, Sect. 6 concludes the entire paper with some future work directions.

2 Related Works

There are many existing approaches for the live migration of VMs proposed by many academicians and researchers. Buyya et al. (Beloglazov and Buyya 2012) have presented three algorithms, namely Maximum Correlation Policy (MCP), Random Choice Policy (RCP), and Minimum Migration Time Policy (MMTP), which are used to migrate a VM to other host-assigned VM that requires minimal time to complete the migration process. In their scheme, the host gets shut down frequently. Xu et al. (2016) have proposed an approach for predicting the performance of the application on the basis of IaaS. They have considered hardware heterogeneity and interference-aware VM provisioning using online measured resource utilization. The performance of this scheme is low, if there is heterogeneity in the hardware system. Tao et al. (2016) have introduced a methodology related to energy consumption, which is based on bucket code to generate the token, i.e., a unit of bytes. Their Binary Graph Matching-based Bucket-code Learning Algorithm (BGM-BLA) uses a triple optimization model to reduce energy consumption, migration cost, and communication time between VMs. For comparison and analysis purposes, no open-source tool has been employed in this work. Wu et al. (2016) have proposed an approach for cloud infrastructure based on storage area networks. The system developed in this approach can be applied to the private cloud. The model can predict the overhead during VM allocation using various variables like CPU utilization, I/O utilization, and VM CPU utilization. The limitation to this approach is that it has not been tested for various cloud infrastructures and hypervisors. An approach using different cloud platforms to improve the system efficiency in case of any fault tolerance is suggested by Ahmad and Andras (2019). This model uses open source software, namely Microsoft Azure and Amazon Elastic Compute Cloud (EC2), to measure the scalability of the entire system. Here, the response time varies even if the same experiments are executed in different clouds.

Table 1 Comparative study on existing approaches

of cloud-based software services. Our technical scalability metrics are inspired by metrics of elasticity. We used two.

cloud-based systems to demonstrate the usefulness of our metrics and compare their scalability performance in.

two cloud platforms: Amazon EC2 and Microsoft Azure. O.

Moghaddam et al. (2020) have proposed an intelligent VM migration algorithm, i.e., Cellular Learning Automata-based Evolutionary Computing (CLA-EC), which minimizes the frequency of migrations and physical servers during the placement and replacement of physical servers. Thus, it reduces the power consumption at the data centers. However, the replacement of physical servers across data centers consumes more resources, which creates issues in many practical cases. Rajabzadeh et al. (2020) have suggested a resource monitoring method that efficiently uses memory and bandwidth by following a particular sequence. The sequence starts with testing the critical state of physical machines, then, defines the exact VM for migration, as well as applies the VM migration policy. This scheme continuously tracks load across the node servers, which generates overhead. Here, the node server provides the interaction between users and application.

A novel scheme using the concept of mirroring approach of continuous transmission of memory pages from source to target is suggested by Rajapackiyam et al. (2020). Their scheme includes an algorithm consisting of two layers, namely data mover and mirror block. In this scheme, a master node and a table are maintained to define a migrated VM list. This technique does not use the iterative transmission of data to avoid overhead. However, it faces high migration time. In (Zhang et al. 2013), the authors have mentioned that by using OpenStack, it is easy to maintain server consolidation and balancing of load across different nodes during VM live migration. To remove duplication of data blocks, a Static Chunking (SC) methodology is used in this scheme. It also saves the overall transmission time. The main limitation of this scheme is that it fails to optimize the deduplication time. The deduplication time is the amount of time to process the same message twice within a particular time limit. Sun et al. (2016) have proposed an enhanced serial migration strategy based on the post-copy migration strategy. They have developed queuing models for quantifying performance metrics, such as the average waiting time and the blocking ratio of all the migration requests. To manage large data size in video summarization, an algorithm based on deep bidirectional analysis has been proposed in (Hussain et al. 2020). However, this scheme is slow in its processing. Table 1 shows a comparison and illustrates brief details of existing schemes.

Table 2 Description of notations

3 Proposed Scheme

In the proposed scheme, a network is developed that consists of two clusters linked by an interconnection network. Each cluster consists of a MS and SSs. If the required resource is available, at least one TS from a set of slave servers is selected as a process migration server. Here, the entire workload is transferred to the master server, when a fault arises. The MS manages every request and starts maintaining the replica copy of memory pages. To transfer the entire load to the TS, at first, it identifies the available resources of TSs. Then, transfers the execution state of the VM to the selected TS. After that, all memory pages are shifted to the TS. The TS now manages the entire workload of the FN, and provides a response to the MS. There is a regular shifting of memory pages from the FN to the TS, and the MS is responsible for maintaining each replica copy of the FN. It is easy to transfer all the memory pages immediately because of parallel processing, which supports maintaining replica copies at the MS and starts searching for the next TS to continue the processing. Figure 2 depicts the creation process of a MS from all running servers based on the ability to process all user requests. The MS checks the existence of the FN. Then, a TS is selected with the highest capacity node from all SSs. After the selection of the TS, the execution state of the VM is transferred from the MS to the TS. Now, all memory pages are processed on TS. After that, the user’s request and memory pages are loaded on the TS, and the request is processed on the TS. Then, the replica on the MS is generated, and finally, a response is sent to the user.

Fig. 2
figure 2

Framework of the proposed approach to migrate the entire workload from the FN to the TS

There are mainly three algorithms in the proposed scheme, namely HSMT, VMRMT, and VMRBU, and an objective function. HSMT optimizes the data transfer rate and downtime, VMRMT optimizes the migration time, VMRBU optimizes the actual bandwidth consumption to reduce the consumption of the overall CPU core, and the objective function is responsible for overall optimization of processing of users’ requests from the FN to the TS. The proposed work ensures low data transfer rate, so indirectly it consumes fewer CPU cores. There is no traffic congestion because of the gradual shifting of load, which supports quick processing and results in less downtime and less migration time.

3.1 Host Selection Migration Time

This algorithm starts by receiving all users’ requests, and forwards them for creating VMs, which supports to continuously track all nodes. Here, the MS is selected among the Running Servers (RSs) on the basis of the maximum capacity of a particular node, so that it must accept many requests during any unexpected fault. When a fault occurs, the processing of the user request gets stuck on the SS, and the SS stops responding by giving an acknowledgment to the MS. This triggers the MS that a fault has occurred. Now, the MS tracks the node, where the process gets stuck and marks that particular SS as FN. After that, every request of the FN is managed by the MS. The proposed approach helps in the quick identification of the FN by sending a broadcast message to all SSs. If all SSs give acknowledgments to the MS, then, there is no need for migration. Then, the execution state of the FN is transferred to the MS along with all memory pages for further processing of the user request. The VMs are then stopped at the FN and the memory pages are returned to the MS. This approach easily handles the load of the FN through the MS and provides a better approach by consuming less time, which results in low downtime. Due to the gradual transfer of data, the overall data transfer rate gets down that results in low network congestion. Therefore, it decreases the downtime, and memory pages are quickly transferred from the FN to the MS. Algorithm 1 represents the working of HSMT and Table 2 represents the details of the notations used in this paper.

figure a

3.2 VM Reallocation Migration Time

After completing the execution of Algorithm 1, Algorithm 2 executes a search operation for the maximum capacity node from the existing SSs and marks it as the TS. The node with maximum capacity, which has sufficient resources is selected for hosting the load from the MS. After that, the execution state of the MS is transferred to the TS. If sufficient resources are not available, the current TS is discarded and the next TS is chosen for execution as shown in Fig. 2. Then, the user’s request is transferred from the MS to the TS along with memory pages. The running VMs are suspended after shifting all the memory pages at the TS. This approach consumes less timespan for migrating the VMs because of the quick transfer of the load from the FN to the appropriate TS with the help of the MS by maintaining all tracks and replica copies of memory pages. This reduces the overall time period for the entire migration process. The functioning of VMRMT optimizes the migration time by immediately shifting all workloads from the MS to the TS.

figure b

3.3 VM Reallocation Bandwidth Usage

After the completion of Algorithm 2, VM starts to migrate from the FN to the TS. At first, it loads the user’s request along with the memory pages on the TS. After that, the response of the processed user’s request is stored in a variable \(k\) in step 6. The response to that user’s request must be displayed to the requested user, when the fault has been fixed and the memory page and workload have been shifted from the FN to the TS. When any produced result fails before reaching the user, it can be handled by the replica (in step 7) of the response maintained at the MS. The data transfer in VMRBU is consistent since all memory pages operate on the MS, which prevents the formation of dirty pages. This approach consumes less CPU usage for migrating the VMs from the FN to the TS through the MS. Algorithm 3 represents all the steps in which the user’s request is not lost during migration and it is managed through the MS, if any fault occurs on the FN.

figure c

3.4 Objective Function

The objective function identifies the optimum utilization of CPU cores, and minimizes the downtime, migration time, and data transfer rate during the live migration process. In this work, the cumulative time to migrate a VM is the weighted sum of the HSMT, VMRMT, and VMRBU as given in Eq. (1) (Mousavi et al. 2017). This helps the algorithms to predict the time of migration near to reality and guarantees the algorithm’s extensibility and adaptability. If variables \(A\), \(B\) and \(C\) represent the times taken to execute Algorithms 1, 2, and 3, respectively, to change the effects of these three variables in the optimization objective function, weight coefficients, \(\alpha\), \(\beta\) and \(\gamma\), are used. The sum of values of these coefficients must be equal to 1 for maintaining the result normalized as all the variables contribute to one entity, i.e., the total time consumed.

$$Total time consumed= \propto \times A \left(HSMT\right)+ \beta \times B\left(VMRMT\right)+ \gamma \times C\left(VMRBU\right)$$
(1)

These three coefficients can efficiently balance the optimization of the HSMT, VMRMT, and VMRBU. Datacenter administrators can adjust these three coefficients to have different HSMT, VMRMT, and VMRBU as shown in Eq. (1). When \(\propto =0.4\), \(\beta = 0.4\) and \(\gamma =0.2\), the performance of these proposed algorithms is better as it reduces execution time for the entire migration process and to process the user request. The implementation of these algorithms shows optimization of the total migration time that leads to improvement of the objective function.

4 Performance Analysis

This section represents the performance analysis of the proposed scheme. The experiments regarding the above algorithms are performed on a set of hardware and software as discussed in Sect. 4.1.

4.1 Experimental Setup

The experiments are executed on a Dell XPS 8940 desktop. The hardware specifications of this computer system are Intel Core i7-11700 processor (11th generation CPU, 4.9 GHz), Windows 10 Pro OS (64-bit), 64 GB RAM, 1 TB SSD, and 2 TB HDD. Java 17 and CloudSim 3.0.3 are installed on the computer system for executing experiments (Calheiros et al. 2011). In the CloudSim, the CPU utilization is managed on the basis of uniform distribution of load. As represented in Table 3, there are different configurations of VMs to define specific attributes, such as length of instruction, input/output file size, the number of processors required, etc., for responding to user requests and other tasks. Datacenter defines a set of hosts with their hardware configurations, such as memory, CPU cores, capacity, and many more, and the provision for allocating these resources is based on the availability of the number of hosts and other resources (Zhang et al. 2015).

In the experiments, three different types of VMs are considered, i.e., VM-1, VM-2, and VM-3. The configurations of these VMs are given in Table 3. There is an automated environment for live migration by using the Bash script (LeCun et al. 2010). The Bash script automates live migration after running VM-1 for 5 min. Similarly, the automation of live migration takes place after running the setup of VM-2 and VM-3 for 15 and 20 min, respectively. The start and end times of the live VM migration are recorded in a log file for data transfer, and downtime is measured by sending an ICMP echo request from the client computer to network hosts.

Table 3 Configurations of VM setup

4.2 Results and Discussion

To evaluate the performance of the proposed scheme, a variety of workloads, namely (1) compress, (2) compiler, (3) derby, (4) crypto, (5) scimark.small, (6) mpegaudio, (7) scimark.large, (8) Sunflow, (9) serial, (10) XML, 11) ApacheBench, 12) MySQL, 13) startup, and 14) TensorFlow, are considered in multiple iterations of VM live migration setups, i.e. VM-1, VM-2, and VM-3. The data transfer rate, CPU utilization, VM downtime, and migration time, are calculated for 14 migrations of each form of various workloads on VM1, VM2, and VM3 setups for the proposed scheme, as well as for the existing schemes, i.e. KVM and POF-SVLM.

Figure 3(a), 3(b), and 3(c) show the comparisons of data transfer rate among existing approaches and proposed HSMT for 14 different workloads on three VM setups, i.e. VM-1, VM-2, and VM-3, during live migration. The proposed HSMT approach helps in keeping the data transfer rate low, as a result, HSMT outperforms traditional KVM and POF-SVLM approaches. The proposed approach transfers data to the MS, and then, makes all memory pages available at the TS. In this way, the total data transmission rate has been reduced by 40-50%. In the case of derby and TensorFlow workloads, the workload is large. To accommodate this large workload, the VM memory size is increased as the live migration phase maintains a copy of memory pages from the original node (FN) to the destination node (TS). The more data transfer rate, the more time it takes to migrate the VM.

Figure 4(a), 4(b), and 4(c) display the comparisons of VM downtime of the proposed HSMT and already developed approaches, namely KVM and POF-SVLM. HSMT helps in reducing downtime for the migration process by quickly identifying the FN. Here, by executing the automated bash script on the system, the initialization of the workloads process and the live migration process are executed, and a log file is maintained for the entire experimental procedure to secure the start time and end time of the script execution. In the existing approaches, the downtime is more because all the memory pages are transferred to the target machine and VM takes a long time to resume at the target machine. The results show an improvement as the downtime of the proposed approach is in the range of 0–20 s, which is less than 60 s, the time taken by the existing POF-SVLM approach. Thus, it reduces the total downtime by 70-80%. Further, it is also recorded that since the memory size of VM is large for workloads like derby and Tensorflow, the downtime is relatively high.

Fig. 3
figure 3

Comparison of data transfer among KVM, POF-SVLM, and HSMT. (a) VM-1 setup (b) VM-2 setup, and (c) VM-3 setup

Fig. 4
figure 4

Comparison of downtime among KVM, POF-SVLM, and HSMT. (a) VM-1 setup (b) VM-2 setup, and (c) VM-3 setup

Fig. 5
figure 5

Comparison of CPU utilization among KVM, POF-SVLM, and VMRBU. (a) VM-1 setup (b) VM-2 setup, and (c) VM-3 setup

Fig. 6
figure 6

Comparison of migration time among KVM, POF-SVLM, and VMRMT. (a) VM-1 setup (b) VM-2 setup, and (c) VM-3 setup

Figure 5(a), 5(b), and 5(c) show the CPU utilization for various workloads with different VM setups for live VM migration. In this experimental procedure, the number of CPU cores utilized during live migration is considered as a parameter. It can be easily seen in Fig. 5(a), 5(b), and 5(c) that the CPU consumption has reduced by 60-70% due to VMRBU. When VM starts to migrate from the FN to TS, in VMRBU, the data transfer is consistent because the production of dirty pages is avoided by keeping all the memory pages running on MS. Whereas, KVM and POF-SVLM send all the dirty pages to the destination. Therefore, a large number of dirty pages are transferred. These existing schemes take additional time to resume VM at the destination node. Thus, the utilization of CPU cores is more.

Figure 6(a), 6(b), and 6(c) show that the migration time is less than 50 s except for extremely heavy workloads. Here, the cloud users are unable to detect the migration time because the entire load of the FN is quickly managed through the TS. By using VMRMT, the total migration time is reduced by 40-50% in comparison to the existing approaches. In the case of KVM, it transfers all memory pages and keeps recording every memory page in the VM memory as foul memory. POF-SVLM uses a shared network to send all the information regarding input and output requests to the target machine, which generates a longer delay in migration time. Among all the comparisons made, there is a significant change, when workloads of derby, serial, and TensorFlow, are executed as the workload is more because of the large VM memory size. The final results show an improvement in the overall time consumption of the live migration process by reducing the data transfer rate, CPU utilization, migration time, and VM downtime.

5 Conclusions and Future Works

In a cloud computing environment, sometimes, it is required to move a running VM to various servers during live migration without disconnecting the client or application. Accelerating the live migration process can be helpful to manage data centers for correct and efficient functioning. In this paper, the main emphasis is given on the live migration challenges, such as transfer of data, time consumed for migration, CPU utilization, and downtime. To solve these problems, this paper proposes a novel scheme using three main algorithms, namely HSMT, VMRMT, and VMRBU. Here, in the first phase, the master server is created and all user’s requests are handled through the master server. In the second phase, an appropriate target server is selected, and at last, the user’s requests are processed and the result of it is shown to the user by maintaining a replica copy of memory pages at the master server. The experimental results demonstrate that the proposed scheme is better than the existing schemes, and it reduces the total CPU cores by 60-70%, downtime by 70-80%, data transfer rate by 40-50%, and migration time by 40-50%. In the future, the proposed scheme can be extended by developing a novel scheme for deallocating the virtual machines, when users are removed from the system. In addition, there is a huge scope to identify failures that occur during live migration in a cloud computing environment.