1 Introduction

Virtual machines (VMs) have been popular in the recent two decades; the annual growth rate of VM applications has significantly increased [15]. In addition to many VM related products offered by various vendors, emerging VM applications are found in different fields, such as green energy saving, cluster management, and behavior detection. The virtualization technology provides not just secondary, but key applications in many fields. Along with the expending of the virtualization technology, the VM guest operating system (Guest OS) continues to improve efficiency in operation.

Hadoop [613] was inspired by Google’s MapReduce and Google File System (GFS) [14, 15]. The Hadoop cluster includes multiple worker nodes and a single master that consists of a JobTracker, TaskTracker, NameNode, and DataNode. The Hadoop Distributed File System (HDFS) uses information of rack names when replicating data and tries to keep different copies of data on different racks. The goal is to reduce the impact of a rack’s power outage or switch off failure; thus, even when these events occur, data may still be readable. However, it takes a long time to restart the system from failure.

For most people, it is a big challenge to embrace a new technology; the learning curve is daunting, and issues of reliability and stability are even worse. Hadoop, like the other distribution systems, allows users to operate complex computing with back-end resources and be in charge of metadata links or resource allocation work in the front-end. Developers could use these features to achieve service aims. In this paper, we conducted Hadoop NameNode running on virtual machines and developed a high availability mechanism for NameNode. The HDFS instance requires one unique server, i.e., the name node; thus, there is a single point of failure for an HDFS installation. If the name node goes down, the file system will be offline. When it comes back, the name node must replay all outstanding operations, which could overtake a big cluster for half an hour. The file system includes a secondary NameNode, which regularly connects with the primary NameNode and takes snapshots of the primary NameNode’s directory information, which is later saved to local/remote directories. These checkpoint images can be used to restart a failed primary NameNode without replaying the entire repertoire of the file system action. The edit log creates an up-to-date directory structure as well.

Various challenges are faced while developing a distributed application [3, 1621]. The first problem is hardware failure. If more pieces of hardware are used, the chance to fail becomes even higher. The second problem is that most analysis tasks need to combine data in some way, i.e., data read from one disk may need to be combined with data read from other disks. HDFS replicates redundant copies of data kept by the system, so that in the event of failure, another copy of data is available. This is mostly like the way RAID works. MapReduce offers a programming model that abstracts problems from disk reads and writes as computations over sets of keys and values.

However, currently Hadoop does not support automatic recovery for NameNode failure, a well-known and recognized single point of failure in Hadoop. As mentioned in the Hadoop official site [6] that if the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported. Hadoop infrastructure has become a critical part of day-to-day business operations. As such, it is important for us to find a way to resolve the single-point-of-failure issue that surrounds the master node processes, namely the NameNode and JobTracker. Moreover, it is easy for us to follow the best practice of offloading the secondary NameNode data to an NFS mount to protect metadata, ensuring that processes are constantly available for job execution and data retrieval. We have leveraged some existing well tested components that are available and commonly used in Linux systems today. Our solution, called as Virtualization Fault Tolerance (VFT), primarily makes use of Distributed Replicated Block Device (DRBD) [1] from LINBIT, and Heartbeat from the Linux-High Availability (HA) project. The combination of these projects provides us with a reliable and highly available solution to address current limitations.

Virtualization is used as a solution not only to improve service flexibility, but also to consolidate workloads and enhance utilization of the server. A virtualized system can be dynamically adapted to clients’ demands by deploying new virtual nodes when demands increase, and powering off and consolidating virtual nodes during periods of low demand. In this paper, we employed the virtual machine management tool, OpenNebula [2224], to manage virtual machines, and combined it with other open source resources to achieve the goal of high availability for Hadoop NameNode.

This paper is organized as follows. First, we start with background reviews and related works in Sect. 2. Section 3 describes the system implementation, shows how to design the VFT mechanism, and presents the interface of our virtual machine management tool. In Sect. 4, we design some scenarios to prove our system and mechanism. Finally, Sect. 5 outlines main conclusions and the future work.

2 Background review and related work

2.1 Apache project: HADOOP

Hadoop was created by Doug Cutting, the creator of Apache Lucene that is widely used as text search library. Hadoop has its origin in Apache Nutch, an open source web search engine as a part of the Lucene project. Hadoop is best known for MapReduce and its distributed file system (HDFS, renamed from NDFS); the term is also used for a family of related projects under the infrastructure for distributed computing and large-scale data processing.

2.2 High availability

High availability [25] means “A system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period.” We will focus on cloud configurations that remove as many single points of failure as possible and that are inherently designed with a specific effort on operational continuity, redundancy, and fail-over capability.

Floyd Piedad et al. [26] presented availability levels and measurement in the HA field. They indicated that IT must understand the levels of availability required by users, and users must understand the costs to these targets. Of all availability levels, continuous availability is the most challenging and expensive to provide; in our work, we take this topic forward and try to make HA feasible.

2.3 Fault tolerance technology

In this paper, we consider DRBD with Heartbeat to be a good fault tolerance solution technology. DRBD is a software-based, shared-nothing, replicated storage solution mirroring the content of block devices (hard disks, partitions, logical volumes, etc.) between servers. DRBD is designed as a device building block to form a HA cluster. This is done by mirroring a whole block device via a specified network. The DRBD technology can be understood as a network RAID-1. Figure 1 displays the entire DRBD architecture. The service, including its IP address, can be migrated to other nodes at any time, either due to a failure of the active node or as an administrative action. In HA speaking, the migration of a service is called failover; the reverse process is called failback; and when the migration is triggered by an administrator, it is called switchover [27].

Fig. 1
figure 1

DRBD architecture

DRBD’s core functionality is implemented by a Linux kernel module. In additional, DRBD constitutes a driver for a virtual block device, so DRBD is situated “right near the bottom” of a system’s I/O stack. Because DRBD is extremely flexible and versatile, a replication solution is suitable for adding high availability to any other applications. Heartbeat [28] is daemon software that provides cluster infrastructure (communication and membership) services to its clients. It allows clients to be aware of presence of peer processes on other machines and to effortlessly exchange messages with them [30]. As shown in Fig. 2, DRBD with Heartbeat, which plays a very important role in our system, is a fault tolerance solution in Linux based OS.

Fig. 2
figure 2

DRBD with heartbeat

2.4 Virtualization technologies

Virtualization technology [1, 5, 18, 2932] is an interesting solution to implement cluster-based servers to overcome cluster related problems. Cluster nodes can be virtualized through some virtualization platforms (Xen, KVM, VMWare, etc.) and managed by an efficient virtual machine manager. A provisioning model is incorporated for dynamically deploying new virtual cluster nodes when the user demand increases, and consolidating virtual nodes when it decreases. Virtualization runs multiple virtual machines on a single physical machine, with each virtual machine sharing the resources of that physical computer across multiple environments.

Virtualization is simply the logical separation to request services from the physical resources that actually provide them. In practical terms, virtualization offers the ability to run applications, operating systems, or system services in a logically distinct system environment that is independent of any specific physical computer system. Obviously, all of these have to run on a certain computer system at any given time, but virtualization provides a level of logical abstraction that liberates applications, system services, and even the operating system that supports them from being tied to a specific piece of hardware. Virtualization, focusing on logical operating environments rather than physical ones, makes applications, services, and instances of an operating system portable across different physical computer systems. Through virtualization, one can execute applications under many operating systems, manage IT more efficiently, and share a lot of computing resources with other computers.

2.5 Virtual machine management

A key component in the scenario of virtualization is the virtual machine management system. The VM manager provides a centralized platform for efficient and automatic deployment, control, and monitoring of VMs in a distributed pool of physical resources. Usually, the VM manager also offers high availability capabilities and scheduling policies [33]. Eucalyptus, OpenNebula, and Nimbus [2224, 34] are three major open-source cloud-computing software platforms. The overall function of these systems is to manage the provision of virtual machines for cloud providing infrastructure-as-a-service. These various open-source projects provide important alternatives for those who do not wish to use a commercially provided cloud. In this paper, we employed OpenNebula to implement the research platform.

OpenNebula is a virtual infrastructure engine that enables the dynamic deployment and reallocation of virtual machines in a pool of physical resources. The OpenNebula system extends the benefits of virtualization platforms from a single physical resource to a pool of resources, decoupling the server from the physical infrastructure as well as the physical location. OpenNebula contains one front-end and multiple back-ends. The front-end provides users with access interfaces and management functions. The back-ends are installed on Xen servers, where Xen hypervisors are started and virtual machines could be backed up. Communications between front-end and back-end employ Secure Shell (SSH). OpenNebula gives users a single access point to deploy virtual machines on a locally distributed infrastructure.

2.6 Dynamic resource allocation

In our previous paper, we proposed a Dynamic Resource Allocation algorithm (DRA) [25], which is one of the key components in this paper. In this work, we focus on enhancing Hadoop HA architecture problem; therefore, DRA is not described in detail here. However, the purpose of DRA is to achieve the best balance of resource allocation among physical machines. As shown in Fig. 3, to achieve the maximum efficiency the resource must be evenly distributed. In order to avoid computing resources centralizing on some specific physical machines, how to balance the resources becomes the most important issue.

Fig. 3
figure 3

Dynamic resource allocation

2.7 Related works

Another choice to achieve fault tolerance is to use OpenVZ [35], which is container-based virtualization for Linux. OpenVZ creates multiple secure and isolated containers on a single physical server, enabling better server utilization and preventing applications from conflicting. J. Walters et al. [18], proposed to use both check-pointing and replication in order to ensure the lowest possible check-pointing overhead. However, there are still some open issues about how to integrate check-pointing and fault-tolerance systems into common cluster batch schedulers. But they still provide us a nice practice to handle fault tolerance for virtualization on a single site.

G. Vallee et al. [20] proposed a framework to solve the fault tolerance issue. Such a framework enables the implementation of various fault tolerance policies, including policies presented in the literature that are not validated by experimentation; therefore, they presented a framework coupled with their fault tolerance simulator, and a complete solution for the study of proactive fault tolerance policies. Their framework prototype provides a single policy based on the Xen VM migration, but new policies are still under development. This is the reason why framework needs to be managed via VM management tools, such as OpenNebula [36]. As shown in this study, the Xen VM migration issue has been solved under our framework.

Regarding to the Fault Tolerance mechanism on Hadoop, a good solution was presented by Cloudera [11]. Cloudera focused on providing various Hadoop solutions. In 2009, Christophe Bisciglia presented an article, “Hadoop HA Configuration,” which implemented Headbeat and DRBD to enhance Hadoop HA, and showed how to extend it for visualization.

H. Zhong et al. [37] proposed an optimized scheduling algorithm to achieve the optimization or suboptimization for cloud scheduling problems. In the same research, the authors investigated the possibility to allocate VMs in a flexible way to allow the maximum usage of physical resources. They used an Improved Genetic Algorithm (IGA) for the automated scheduling policy. IGA uses the shortest genes and introduces the idea of dividend policy in economics to select an optimal or suboptimal allocation for the VMs requests. This paper has inspired us to find an optimized algorithm to reach our goal.

Q. Chen et al. [38] proposed a Self-Adaptive MapReduce scheduling algorithm (SAMR) that dynamically computes progress of tasks and automatically adapts to continuously changing environments. SAMR tunes time weight of each stage of map and reduces tasks based on historical information to trace the progress of tasks and identify tasks that are in need of backup tasks.

3 System implementation

In this section, we introduce the system architecture and its components. OpenNebula plays a key role in the entire system; its most advantage is the Live Migration function that is lacked in other virtualization management tools. In addition to the Live Migration function from OpenNebula, we combined DRBD with Heart Beat to enhance high availability of the system.

3.1 System overview

The system was mainly constructed based on the official OpenNebula manual. The OpenNebula core orchestrates three different management areas: image and storage technologies (i.e., virtual appliance tools or distributed file systems) to prepare disk images for VMs, the network fabric (such as Dynamic Host Configuration Protocol servers, firewalls, or switches) to provide VMs with a virtual network environment, and the underlying hypervisors to create and control VMs. The core performs specific storage, networking, or virtualization operations through pluggable drivers. Thus, OpenNebula is not tied to any specific environment but provides a uniform management layer regardless of the underlying infrastructure.

Figure 4 depicts an overview of our system architecture. As described, we built a cluster system with OpenNebula and provided users a web interface to manage virtual and physical machines. Our cluster system consists of four computers with same specifications; hardware of these computers is equipped with Intel i7 CPU 2.8 GHz, four gigabytes memory, 500 gigabytes disk, Debian operating system, and a gigabit switch to connect to the network.

Fig. 4
figure 4

System overview

As depicted in the figure, from the bottom to the top of the infrastructure: hosts are physical machines Debian 1∼3; Xen Hypervisors are suitable for Linux series OS; the following up are two VMs: VM 2 is the primary node, and VM 1 is the secondary node. But if we assume Hadoop NameNode is built on VM 1 as the primary node, VM 2 becomes the slave node of VM 1. Under the Heartbeat + DRBD mechanism, we used 5 IPs to deploy the system; two for Cross Over, two for identifying the primary and secondary, and one for service. Finally, on the top layer, as the key role of the entire design, OpenNebula provides a centralized platform as an efficient and automatic deployment to control and monitor VMs on a distributed pool of physical hosts. We also composed a web interface management tool via DRA and OpenNebula’s components to manage VMs.

Due to limitation of the physical IP address, we built a private network environment in our laboratory. To enable the HA mechanism, some preliminary works needed to be done. First, we set the IPs on both virtual machines. IP 192.168.123.210 is the Service IP controlled by Heartbeat, and is used to provide services for users. In the configuration, VM 2 is the primary node (refer to Table 1 for its setting) and VM 1 is the secondary (refer to Table 2 for its setting), as shown in Fig. 5.

Fig. 5
figure 5

Networking configuration of primary and secondary nodes

Table 1 VM2—primary node network setting
Table 2 VM1—secondary node network setting

After downloading the DRBD package [30] and completely installing it, then we could start to set DRBD config file in both two nodes with the setting as shown in Fig. 6. We show part of drbd.conf content in /etc/drbd.conf file. For the reminder, consistent setting is needed in both the primary and secondary nodes.

Fig. 6
figure 6

Part of drbd.conf content

Use below commands to check the DRBD state: #cat /proc/drbd or

#drbdadm state r0

There are many options available for the Heartbeat configuration. In the following, we will show our methods. There are three main files that we edited to configure the Heartbeat package:

  • /etc/ha.d/authkeys

  • /etc/ha.d/ha.cf /

  • /etc/ha.d/haresources

First, authkeys should be same on both servers. Remember to change the permission as following instruction.

#chmod 0600 /etc/ha.d/authkeys

Second, the ha.cf file is used to define the general settings of the cluster. Our example is shown in Fig. 7.

Fig. 7
figure 7

Part of ha.cf content

Finally, as shown in Fig. 8, the last file, ha resource, defines all cluster resources that will fail over from one node to the next. The resources include the Service IP addresses of the cluster, the DRBD resource “r0” (from /etc/drbd.conf), the file system mount, and the three Hadoop master node initiation scripts that are invoked with the “start” parameter upon failover.

Fig. 8
figure 8

Part of drbd.conf content

3.2 Virtualization fault tolerant methodology

Our approach to managing VMs is based on an efficient mechanism to reach high availability under limited resources. Apart from this, how to study fault-tolerance on VMs and increase reliability are the other topics we want to address in this paper. In order to provide continuous availability for applications in case of server failure, a detection method is needed.

The virtualization fault tolerance (VFT) has three main phases: virtual machine migration policy, information gathering, and keeping services always available.

Virtual Machine Migration Policy: it enables DRA to make sure best performance in the distribution of the virtualization cluster.

Information Gathering: a detection mechanism is applied to retrieve all Hosts and check whether Hosts are alive or not. We detect states of the hosts with a ping command every five minutes by running a Linux schedule via “crontab.”

Keeping Service Always Available: We assume that VM m is under the Heartbeat+DRBD mechanism, and Host n is going to become an unavailable physical machine. Once the Host n is shut down, if VM m is the secondary node, then it will be moved to an on-line Host and booted automatically. If VM m is the primary node, then the secondary node will replace VM m as the primary node immediately. Next pre-primary node will be booted on available host/hots and becomes secondary. In OpenNebula, command onevm is used to submit, control, and monitor VMs. It helps us control dead VMs and deploy them on other available physical hosts.

The workflow is shown in Fig. 9. However, there is a constraint that the number of physical hosts must be no less than three. It is the basic requirement to achieve VFT methodology and will be explained later.

Fig. 9
figure 9

The flow of virtualization fault tolerance

This flow is implemented as one of the scheduled programs and deployed on the front-end. It is reasonable to enhance this function on the front-end of OpenNebula, because OpenNebula controls all VMs operations. Figure 10 depicts an example that explains how our VFT approach is triggered under single-failure events.

Fig. 10
figure 10

The concept for how to trigger VFT

First, if Host A is shut down by unexpected events, in few minutes later, the front-end detects it and then triggers VFT. Next, the secondary node VM 2 becomes primary and handovers all services from preprimary, which is called as FAILOVER. Finally, VM 1 is booted on Host C automatically and becomes the secondary node, which is called as FAILBACK.

4 Experimental environment and results

4.1 Experimental environment

In our experimental environment, each server has same specifications. Table 3 lists CPU, memory, and storages capabilities of the servers. We measured the basic capability of their performance with known benchmarks. Next, we completed our experiment’s data via Apache JMeter. We designed experiments for measuring server performance and throughputs as well. The well-known web application measurement performance tool, “Apache JMeter” [39], one of Apache projects, is open-sourced software and a 100 % pure Java desktop application designed to test functional behavior and measure performance. It was originally designed to test Web Applications but now has been expanded with other test functions.

Table 3 Hardware specification of lab servers

4.2 Networking capability

In this section, we evaluate proposed architecture by studying the effect of virtualization of the worker nodes and physical hosts. In order to quantify different network throughputs in local and remote nodes, we compare transfer times, using the HTTP protocol, and of different file sizes between the physical host and virtual machine. Under the same condition, Table 5 compares throughputs via the HTTP protocol with various file sizes and threads. A significant result is confirmed in Tables 4 and 5: the virtual machine performance is a little less than physical machine, but matches our expectation. Figures 11 and 12 show the network performance and throughputs for comparison of the physical host and virtual machine, respectively.

Fig. 11
figure 11

Networking performance of physical host and virtual machine

Fig. 12
figure 12

Throughputs between physical host and virtual machine

Table 4 Comparison of physical host and virtual machine networking performance
Table 5 Throughputs between physical host and virtual machine

4.3 Life migration of virtual machine

We performed migration tests in an identical pair of server machines, each with eight i7-Core 2.8 GHz CPUs and 4 GB memory. The machines were connected via a switched Gigabit Ethernet. Before migration, daemon required 1 G space on each host, thus, the maximum available memory space was 3 G for each host. There was only one VM on Host-A, and no VM on Host-B; the VM on Host-A used 1 G memory space. We migrated the VM from Host-A to Host-B. Figure 13 shows variations of memory usages of Host A, and Fig. 14 shows variations of memory usages of Host B, respectively.

Fig. 13
figure 13

Migration memory state of Host-A

Fig. 14
figure 14

Migration memory state of Host-B

4.4 Hadoop NameNode failover

In this experiment, we used settings listed in Tables 6 and 7, then built HDFS on a VM with one live node and 28.61 GB spaces. In this scenario, we monitored the HDFS failover while downloading. Another tool for this test is FUSE [40] that is chosen because it allows users to operate HDFS as a local disk.

Table 6 Planned hosts
Table 7 Part of properties of hdfs-site.xml

When HDFS began downloading, the primary node (debian-ha1) was terminated as expected. The downloading action was disconnected after about 10–20 seconds; NameNode was then resumed on debian-ha2 automatically. This result only shows our design is working on Active/Standby states. However, due to the metadata controlled by DRBD, the entire HDFS would not crash under unexpected system outages. It is a real enhancement for Hadoop NameNode because there are lots of issues related to NameNode failure problems after the unexpected system shutdown.

4.5 VFT experiment

In this scenario, we designed the experiment to validate if VMs automatically migrate when the host is off-line, as shown in Fig. 15. The Service IP is used for providing a service channel to external users. Users can access services through this IP, which also named VIP in DRBD speaking. Node 2 is the primary node, and Node 1 is the secondary. The difference between the primary and secondary node is that only the secondary node is allowed to take over the service if the primary node is down. Debian 1, Debian 2, and Debian 3 are the physical hosts. Node 2 lives on Debian 1; and the secondary node, Node 1, on Debian 2.

Fig. 15
figure 15

VFT experiment environments

Figure 16 shows that after Debian 1 is disconnected, the Service IP does not terminate. Besides, only one packet is lost during the time when the failover behavior is enabled. The reason is that the entire system is under the VFT mechanism in which the secondary node can immediately replace the primary node if it is down. Finally, Node 2 is rebooted automatically in about 5 to 7 minutes and becomes secondary. The benefit of VFT mechanism is to obtain a shortest downtime interval. Although we cannot guarantee the integrity of data during the downtime period, we still can reduce the downtime interval to provide a low-cost solution for enterprises.

Fig. 16
figure 16

Measurements of ping loss

5 Conclusions

High-availability is achieved in Hadoop NameNode Active-Standby architecture. Under this architecture, the service can be failed over when the primary node is failed. The most valuable improvement in this paper is that by keeping at least three physical hosts available, then primary and secondary nodes will always exist. Therefore, there are four main key features in this work: the first is Xen Hypervisor; the second, OpenNebula; the third, DRBD with Heartbeat component; and the last, the VFT mechanism. Each component is important and indispensable in our architecture. Systems with continuous availability mean comparatively higher priced, and most have carefully implemented with special designs that eliminate any single point of failure and allow online hardware, network, operating system, middleware, and application upgrades, patches, and replacements. However, the future goal of this paper is to extend our fault-tolerance work beyond failure management in order to enhance resources utilization efficiency of virtualization clusters.