Abstract
Providing fault tolerance support to client-to-server applications is critical in the data center and cloud computing environments. Virtualization provides a direct way of achieving high availability by encapsulating the protected applications into the virtual machine and by periodically checkpointing the entire virtual machine (VM) state to the backup replication. However, existing VM replication solutions suffer from either excessive checkpointing overhead and network latency or unnecessary CPU resources consumption in backup replication. In this study, we exploit the ingredients of output packets and consider that the replication system maintains external consistency if the pre-released packets originate the already synchronized states. Furthermore, we transform the active-active primary and slave VM combination into an active-semiactive one by shrinking the number of active virtual CPUs (vCPUs) in the slave VM. The former optimization mechanism improves the performance in read-mostly client-to-server networked applications, whereas the latter one relieves the problem of double scheduling in the slave host. Therefore, we proposed the COLO++ system which is built over COLO and is a non-stop service solution with coarse-grained lock-stepping VMs for client-to-server systems. The two plus signs represent two of the optimizations. Experimental results using COLO++ implemented on KVM and Linux depict that it achieves nearly native VM performance under read-mostly workloads, as well as lower scheduling overhead in backup replication.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Jiang B, Ravindran B, Kim C. Lightweight live migration for high availability cluster service. In: Proceedings of the 12th International Conference on Stabilization, Safety, and Security of Distributed Systems, New York, 2010. 420–434
Mullender S. Distributed systems. United States of America: ACM Press, 1993: 12
Kivity A, Kamay Y, Laor D, et al. Kvm: the Linux virtual machine monitor. In: Proceedings of the Linux Symposium, Ottawa, 2007. 1: 225–230
Barham P, Dragovic B, Fraser K, et al. Xen and the art of virtualization. In: Proceedings of the ACM SIGOPS Operating Systems Review, New York, 2003. 164–177
Bressoud T C, Schneider F B. Hypervisor-based fault tolerance. ACM Trans Comput Syst, 1996, 14: 80–107
Cully B, Lefebvre G, Meyer D, et al. Remus: high availability via asynchronous virtual machine replication. In: Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, San Francisco, 2008. 161–174
Dong Y Z, Ye W, Jiang Y H, et al. Colo: coarse-grained lock-stepping virtual machines for non-stop service. In: Proceedings of the 4th Annual Symposium on Cloud Computing, Santa Clara, 2013. 3
Clark C, Fraser K, Hand S, et al. Live migration of virtual machines. In: Proceedings of the 2nd Symposium on Networked Systems Design and Implementation. Berkeley: USENIX Association, 2005. 2: 273–286
Elnozahy E N M, Alvisi L, Wang Y M, et al. A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv, 2002, 34: 375–408
Friebel T, Biemueller S. How to deal with lock holder preemption. In: Proceedings of Xen Summit North America, Boston, 2008. 164
Enck W, Gilbert P, Han S, et al. TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Trans Comput Syst, 2014, 32: 5
Song X, Shi J, Chen H, et al. Schedule processes, not VCPUs. In: Proceedings of the 4th Asia-Pacific Workshop on Systems, New York, 2013. 1
Cheng L, Rao J, Lau F. vScale: automatic and efficient processor scaling for SMP virtual machines. In: Proceedings of the 11th European Conference on Computer Systems, New York, 2016. 2
Russell R. Virtio: towards a de-facto standard for virtual I/O devices. ACM SIGOPS Oper Syst Rev, 2008, 42: 95–103
Intel R. Page modification logging for virtual machine monitor white paper. Intel Whitepaper, 2015. https://doi.org/www.intel.com/content/www/us/en/processors/page-modification-logging-vmm-white-paper.html
Intel R 82576 and 82599 Gigabit Ethernet controller datashee. Intel Whitepaper, 2002. https://doi.org/www.intel.com/content/www/us/en/embedded/products/networking/82599-10-gbe-controller-datasheet.html
Fitzpatrick B. Distributed caching with memcached. Linux J, 2004, 2004: 5
Kopytov A. SysBench: a system performance benchmark. https://doi.org/sysbench.sourceforge.net, 2004
Bienia C, Kumar S, Singh J P, et al. The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, New York, 2008. 72–81
Castro M, Liskov B. Practical byzantine fault tolerance and proactive recovery. ACM Trans Comput Syst, 2002, 20: 398–461
Lamport L, Shostak R, Pease M. The Byzantine generals problem. ACM Trans Program Lang Syst, 1982, 4: 382–401
Schneider F B. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput Surv, 1990, 22: 299–319
Bernick D, Bruckert B, Vigna P D, et al. NonStop/spl reg/advanced architecture. In: Proceedings of the International Conference on Dependable Systems and Networks, Yokohama, 2005. 12–21
Webber S, Beirne J. The stratus architecture. In: Proceedings of the 21st International Symposium on Fault-Tolerant Computing, Montr´eal, 1991. 79–85
Jeffery C M, Figueiredo R J O. A flexible approach to improving system reliability with virtual lockstep. IEEE Trans Dependable Secure Comput, 2012, 9: 2–15
Scales D J, Nelson M, Venkitachalam G. The design of a practical system for fault-tolerant virtual machines. ACM SIGOPS Operat Syst Rev, 2010, 44: 30–39
Reiser H P, Kapitza R. Hypervisor-based efficient proactive recovery. In: Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems. Washington: IEEE Computer Society, 2007. 83–92
Minhas U F, Rajagopalan S, Cully B, et al. RemusDB: transparent high availability for database systems. VLDB J, 2013, 22: 29–45
Lu M, Chiueh T. Fast memory state synchronization for virtualization-based fault tolerance. In: Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks, Lisbon, 2009. 534–543
Zhu J, Dong W, Jiang Z F, et al. Improving the performance of hypervisor-based fault tolerance. In: Proceedings of the International Symposium on Parallel and Distributed Processing, Atlanta, 2010. 1–10
Huang D, He B, Miao C. A survey of resource management in multi-tier web applications. IEEE Commun Surv Tut, 2014, 16: 1574–1590
Liu H, He B. VMbuddies: coordinating live migration of multi-tier applications in cloud environments. IEEE Trans Parallel Distrib Syst, 2015, 26: 1192–1205
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, R., Chen, H. Asymmetric virtual machine replication for low latency and high available service. Sci. China Inf. Sci. 61, 092110 (2018). https://doi.org/10.1007/s11432-017-9292-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-017-9292-9