1 Introduction

Cloud computing is based on leveraging the Internet to consume software or other IT services on demand. End users share processing power, space, bandwidth, memory and software. In cloud computing, the resources are efficiently utilized by sharing among different group of users. Users can pay as they consume and only use what they need at any given time. IT decision makers Survey states that 68% of the respondents believe by 2014, more than 50% of their company’s IT services are migrated to Cloud platforms. In the contemporary society of Big Data, the data volume is growing faster than storage capacity (Gantz and Reinsel 2012). Each week, Facebook requires extra 60 TB of storage just for new photos (Beaver et al. 2010). YouTube users upload over 400 h of video every minute and it requires 1 Petabyte of new storage every day (Baesens 2014). According to the International Data Corporation (IDC)’s sixth annual study, until 2020 the digital data will double every two years (Gantz and Reinsel 2012). By 2020, approximately 40% of the data in the digital universe will be stored or processed by cloud computing providers (Gantz and Reinsel 2012). Amazon Simple Storage Service (S3) encountered a data corruption problem caused by a load balancer bug (Wang et al. 2015). Amazon Web Services (AWS) suffered major disruptions due to a DynamoDB failure (AWS 2016).

As a matter of fact, nowadays large volumes of data have been generated due to the advancement in technologies and increased number of users of Internet. Cloud computing has become a viable solution for data processing, distribution and storage. It provides storage as a service to the users to store their data remotely and access them whenever and wherever they need, through the internet. Cloud storage is a data storage model based on a virtualized infrastructure where data is stored in a logical pool and the physical storage is extended to various servers. The logical pool contains data that can be accessed by several users simultaneously. In order to improve the access efficiency, replication of data is considered to be the most viable solution. Replication is the process of creating an exact copy of the original data either in the same data center or in a remote location. It helps in improving the access efficiency, response time, availability of data and reduces the bandwidth consumption, cost of migration and user waiting time. There are two types of replication: (1) static replication and (2) dynamic replication. The static replication is achieved through data resource pool and number of replicas is based on history and is usually three. But it is not effective and there is no need to replicate all the data chunks. However, the dynamic replication is a method of replicating the data according to the varying needs of the cloud consumers and also with the changes in the environment (Milani and Navimipour 2016). But still there are major concerns arises in the process of replication like which data should be replicated? When the replication should be done? Where the replica should be placed? How many new replicas should be created? Which data should be provided to the user? Whether the load is balanced in storage? and finally, management of replica and storage as well.

Intelligent water drop (IWD) is a swarm-based nature-inspired optimization algorithm that contains few essential elements of natural water drops, the actions and reactions that occur between river’s bed and the water drops (Shah-Hosseini 2009). It has two main attributes (1) The velocity of the IWD and (2) the amount of soil on the path. When an IWD moves from one location to another, the increase in velocity is inversely proportional to the soil on the path and the soil of IWD increases as it moves between the locations. The IWD selects the path based on the amount of soil on each path. The IWD algorithm may be used for maximization optimization problems. The solutions are incrementally constructed by the IWD algorithm. Therefore, the IWD algorithm is a population-based constructive optimization algorithm. The IWD algorithm has been used for the travelling salesman’s problem (TSP) and multiple (or multidimensional) knapsack problem (MKP) with promising results. Both the TSP and the MKP are NP-hard combinatorial optimization problems.

Data is not in demand at all the times. The data that is required may lose its validity because of so many factors. Accuracy and time constraints play a major role to consider the importance of data. Data that is received at proper time is considered to be the most appropriate. Data may be retrieved very late before it is being needed. Some data segments are important only for certain amount of time, then it loses its credibility and popularity. Some more data becomes obsolete after a long period. But still the data is stored and preserved as it may be required at any point. Hence very large amount of storage is used to store those data which is of low or no priority in near future. Also, the replication process in cloud stores duplicate copies of same data at different locations to make the data available at all the times which is of huge wastage of storage. Therefore, the process itself costs additional storage space to be used for the data that is of less importance. So there is a need to properly manage the replication mechanism and simultaneously optimize the storage itself (Mansouri and Javidi 2018 and Milani and Navimipour 2016). Hence, to match the changing access patterns of the consumers, the replication strategy itself should be dynamic to maintain high availability with improved performance. In order to solve these issues we design a novel dynamic data replication strategy to improve the access efficiency of cloud storage using combinatorial optimization algorithm.

In this paper, we have proposed a novel replication strategy to improve access efficiency of cloud storage preserving QoS attributes like high availability, performance and load balancing etc. The rest of this paper is organized as follows: the motivation of the problem is discussed in Sect. 2. Related works on the problems and techniques used for dynamic data replication are investigated in Sect. 3. Detailed description of our proposed architecture and methodology appears in Sect. 4. Experimental results and discussions on test cases appear in Sect. 5. Finally, we conclude the paper with future work in reference section.

2 Related work

Wu et al. (2014) proposed the index name server (INS) to manage file storage, de-duplication and load balancing. When the data being generated continues to increase, there will be redundant copies of the same data, this may affect the network bandwidth and the workload of the server. In order to reduce the workload due to duplicate files, the index name server (INS) is used. The duplicate copy of the data is eliminated by deduplication technique which divides the files into chunks and calculates a unique hash code or fingerprint for each chunk. After checking the fingerprint of the requested file, the INS confirm whether the same fingerprint exists in the storage, if so the file chunk won’t be uploaded and only a reference to the existing data is created. This improves the performance and bandwidth associated with the data transmission but the response time of the cloud storage and the availability of data to multiple users are not considered here.

Sun et al. (2012) used a dynamic data replication strategy (D2RS) which replicates only the popular data files and places them in a balanced way. While the replication of entire data will result in overload of storage node and increases the bandwidth usage due to data transfer. The popular data file is identified using the access histories and by setting different weight for different accessed data. More recently accessed data is given big weight, and if this popularity (replication factor of the data) is greater than a threshold i.e. the replication factor of the system, the replication process is triggered. The number of replicas needed to be created is determined based on the old file availability. The file with less replication factor can be removed and the new replica can be placed. The architecture of the system is in hierarchical manner which contains super data center, main data center and ordinary data center. This architecture minimizes the access time and network load by distributing replicas from super data centers to main or ordinary data centers.

Jeon et al. (2012) proposed a data replication strategy (DRS) is used to improve the data access in the cloud environment. The performance of the DRS is based on the user access patterns. These patterns are more flexible and unpredictable. This algorithm dynamically changes the replication strategy based on the changes in the user access pattern. It analyzes the past data access patterns and perceives the changes and identifies the optimal strategy. The system contains a job broker (JB) and multiple sites containing work node, replica manager (RM) and storage node (SN). The user submits the requests to the JB which in turn dispatches it to various sites. The work node performs the job with the data from the SN. The RM contains the pattern recognizer (PR) and replication optimizer (RO). The PR recognizes the data access pattern changes and the RO dynamically chooses the algorithm to optimize the performance of the system. This dynamic switching algorithm perceives the changes based on the change in the performance and switches the strategy.

Li et al. (2011) and Wei et al. (2010), the cloud storage should provide data with high efficiency and reliability in a cost effective manner. The data replication provides the reliability requirement of the storage. In static replication the number of replica is fixed and is usually 3, this provides reliability but requires more storage space for storing 3 copies of data. This in turn increases the storage cost. A novel cost effective dynamic data replication strategy called cost-effective incremental replication (CIR) was proposed. The CIR reduces the storage cost and fully utilizes the storage resources. It creates the replica in an incremental way and ensures the reliability. This approach uses a reliability model which considers the relationship between reliability and the number of replica. The minimum number of replica is initially one and as time goes on, more number of replica will be created incrementally to meet the reliability requirement. This approach is suitable for temporarily used data that have low reliability requirement.

Grace and Manimegalai (2014) proposed the dynamic data replication techniques that enhance data availability and increases reliability, which is needed for the compute intensive applications. The two important challenges in creating duplicates of data are replica placement and replica selection. The replica placement can be logically divided into three stages, namely, replication decision, replica selection and file replacement. The replication decision stage decides when and where to create the replica. The second stage, decides which file needs to be replicated. The file replacement stage is activated if the best site to store the new replica does not have sufficient storage space. The replica selection techniques decide which replica location is the best place to access the data for users. If several replicas are available for a file, the optimization algorithm selects the optimal replica based on the parameters: access cost, access latency, bandwidth consumption, balanced workload, maintenance cost, job execution time, response time, fault tolerance and quality of assurance. The duplicate copies can be placed in appropriate nodes and can be selected for job execution. The replica placement and selection techniques also depend on the dynamic architecture of the grid, in which the users can join and leave the grid at any time. In this paper, various replica placement and selection strategies were discussed. These algorithms measure and analyze various parameters such as bandwidth consumption, access cost, scalability, execution time, makes span and storage consumption.

Niu et al. (2013) proposed an intelligent water drop algorithm (IWD) to provide a schedule considering multiple objectives in multiple objective job shop scheduling problem, which is the process of allocating resources to set of tasks based on some criteria like makes span, tardiness and mean flow time of the schedule. Intelligent Water Drops algorithm is one of the recent bio-inspired swarm-based optimization algorithms, which has been used for optimization problems. This algorithm follows the natural movement of water drops that flow in rivers and lakes. It considers to main attributes (i) The velocity of the IWD and (ii) the amount of soil on the path. When an IWD moves from one location to another, the increase in velocity is inversely proportional to the soil on the path and the soil of IWD increases as it moves between the locations. The IWD selects the path based on amount of soil on the path. The path followed by IWD gives the schedule and the locations traversed to reach the target from the source are the consecutive operations to be done to complete a job. For each path the velocity and the soil of IWD is updated which is used to select the optimal schedule.

Long et al. (2014), nowadays large data sets are becoming important part of shared resources. The volume of data is increasing to petabytes. Such large volumes are stored in cloud storage. These data can be managed in a distributed manner by data replication. The data replication involves two questions: (i) how many suitable replica of each data should be created? (ii) Where should these replicas be placed? To address this replica management problem a multi-objective off-line optimization approach with an improved artificial immune algorithm is used. With this strategy, five objectives namely: file unavailability, mean service time, load variance, energy consumption and mean access latency is addressed. This approach formulated a mathematical model to describe the objectives along with their tradeoffs. The MORM strategy selects the individuals (data node) from the population based upon the storage space available, to place the replica. The individuals with highest fitness are cloned and the mutation process is done. The clone, mutation and selection process is executed repeatedly. Since it is a static approach the full knowledge of service time and access rate of the node should be known.

Wang et al. (2013), describes that data plays an important role in data intensive services; hence the cost and access time of the data greatly influence the quality of the service. There were many challenges involved in data intensive services; it includes the cost of transferring the data, the dynamic nature of the cloud, data replication and storage cost. In order to handle these challenges a data replica selection based on Ant Colony Algorithm is used. In this the ant selects the data replica such that the response time and the cost of transferring the data from server to endpoint are minimized. There are various pricing models like usage pricing model, package based, flat free subscription based and combination based pricing models. The service providers can switch among these pricing models to charge the users. The cost is calculated based on price of the replica, network bandwidth between the endpoint and the replica site and the size of the data. The data replica is selected by data replica (DR) agent. Once the DR agent is invoked the ants are placed at the service endpoint and moves from the endpoint to the each replica site. Each replica has an initial pheromone density based on which the ant chooses the replica. The ants make a local updating of pheromone after finding a set of data replica. The local updating is done using the pheromone intensity of the visited replicas. After all ants returns to their service end point a global updating of pheromone are done. This helps in selecting the appropriate replica which reduces the response time and transfer cost.

3 Proposed architecture and methodology

The primary objectives of the proposed D2R-IWD architecture are as follows:

  1. (a)

    To identify the problems exist in the process of dynamic data replication in cloud storage.

  2. (b)

    To design a novel dynamic data replication strategy to improve the access efficiency of cloud storage preserving the QoS attributes of cloud such as high availability, performance and load balancing etc.

  3. (c)

    To propose an optimization algorithm to effectively manage the replicas and thereby improve the performance of cloud storage.

The major concerns of replication are the number of replicas to be created, location of the new replica, selection of data for replication, time to trigger the replication process, the management of replicated data and the provision of optimal replica to the user. Hence, our aim is to design a novel dynamic data replication strategy to improve the access efficiency of cloud storage preserving the QoS attributes of cloud such as high availability, performance and load balancing. The D2R replicates the data dynamically based upon the number of access to the files and placement of replica considering the migration cost. The management of replicated data and the provision of optimal data to the user are done with the use of intelligent water drop (IWD) Algorithm. Also, it periodically checks for improving the capacity of the storage by removing unwanted replicas with balanced load on the storage.

The data service system of cloud contains a scheduling broker, cloud broker, data center broker and data centers. The scheduling broker is responsible for interaction with the end uses. Users submit the task to the scheduling broker through the user interface and are loaded in the task queue. All tasks that need to be served are processed in FCFS fashion. The cloud broker decides which data center should be selected to provide the services for the tasks. Each and every task stored in the queue is forwarded to the appropriate data center. The data center contains a replica catalog, in which all the files that are stored in the corresponding data center are indexed with its replicas. A replica log is placed which will keep track of all the activities with respect to accessing the replica, and the users that access the replica with its timestamp. Within each data center is a data center broker, which in turn contains information about the data that is stored at any point in time. The information about the data such as its location where the data is stored, number of access to the data and size of the data. The data center broker is involved in the process of replication and the management of replicated data. The proposed D2R-IWD in generic cloud architecture is represented in Fig. 1.

Fig. 1
figure 1

D2R-IWD in generic cloud architecture

Firstly, we need to monitor the storage considering the set of files, their replicas, and the location in the datacenter. The storage is created with a specific name, bandwidth and capacity. The files are divided into fixed sized chunks and distributed to storage. The storage is monitored by calculating the bandwidth utilized, total number of access to the storage and the available storage space. Secondly, the dynamic data replication process is carried out calculating the replication factor based on the popularity degree of data. Also, a dynamic threshold is calculated based on the replication factor of the entire system and the adjustable system performance parameter. This helps in deciding when to initiate the replication process in the dynamic environment. When the replication factor is less than the dynamic threshold, it will trigger the replica manager to initiate the replication process automatically. Finally, the optimization and management of replicas to improve the access efficiency of cloud storage and the storage itself is done using a swarm based intelligent water drop algorithm.

3.1 Dynamic data replication

The data is replicated considering the dynamic changes in the cloud environment. The data that has more popularity is found using the number of access request to the particular data and that data will be chosen for replication. Figure 2 represents the dynamic data replication strategy is adopted taking list of files and the storages as input. The entire workflow of the replication process is described stepwise as follows.

Fig. 2
figure 2

Algorithm 1—dynamic data replication strategy

  • Step 1: The popularity of the data specified by the user as popularity degree (PDi) is calculated based on the number of access to the corresponding data.

  • Step 2: The replication factor (RFi) is calculated based on the popularity degree, number of replicas already available and the size of the data file.

  • Step 3: Dynamic threshold value (TH) is determined from the replication factor of each data file.

  • Step 4: The constraint (RFi > TH) is checked for each file and if it is satisfied, then it will trigger the process of replication. Otherwise, the process is terminated.

  • Step 5: The number of replicas for each file satisfying the condition in step 4 is calculated.

  • Step 6: The final step of the replication strategy is to identify the appropriate storage to place the replica for user access.

The popularity degree (PDi) of file fi is calculated using the number of access (ani) and the time based forgetting factor wi as in Eq. 1.

$$PD_{i} = an_{i} \times w_{i}$$
(1)

Based on the popularity degree the replication factor (RFi) for each file is calculated as in Eq. 2.

$$RF_{i} = \frac{{PD_{i} }}{{RN_{i} \times FS_{i} }}$$
(2)

where RNi is the number of replica and FSi is the size of the data file.

Time based forgetting function represents the loss of popularity of the data with the users as it becomes old over time and is derived based on the number of access to the data over a period of time.

The dynamic threshold (TH) value is calculated based on the replication factor of the entire system and the adjustable system performance parameter (α) as in Eq. 3.

$$TH = \hbox{min} \left( {\left( {1 + \alpha } \right) \times RF_{sys} ,\hbox{max} \left( {\mathop \forall \limits_{{k \in \left[ {1,2, \ldots ,l} \right]}} RF_{k} } \right)} \right),\alpha \in \left[ {0,1} \right]$$
(3)

The replication process is triggered for the files whose replication factor is greater than the dynamic threshold (RFi > TH). The number of replicas to be replicated is determined based on the old file availability. The suitable location for the replica placement is determined by the access information of the storage node to balance the load on the system.

3.2 Optimization of replicated data using IWD

IWD is a bio-inspired multi-objective Algorithm which uses the movement of water drops over the river bed. Here the intelligent water drop is used for two operations, (i) for selecting the optimal replica for the user request and (ii) for the removal of replica to manage the storage space. Figure 3 represents the data flow diagram of the IWD algorithm for the optimization of the replicas and that of the storage space.

Fig. 3
figure 3

Data flow diagram of IWD for replica provision and removal process

Here we generate a graph which contains the adjacency list of files for the storage. The IWD will visit each file in the storage and provide the suitable one in order to complete the task. The underlying chunking mechanism in which the files are separated and stored can be of any type, either a fixed or variable size chunks. Each and every IWDs are given with the list of files, number of replicas and its location in the storage. Several factors are considered before making a decision of appropriate selection of replica. The steps in replica selection are described as follows.

3.2.1 Selection of replica

IWD is used for the selection of optimal replica according to the user request. The factors considered in the selection of optimal replica are the bandwidth, number of user request to a particular replica, total number of access to the storage and the storage space available. Figure 4 shows how IWD is used for selection of replica.

Fig. 4
figure 4

Selection of replica using IWD Algorithm

  1. (a)

    Initialize the static parameters The static parameters such as the number of iterations, (itermax), iteration count itercount initially 0, number of IWDs NIWD = Nc the number of storage nodes, network cost updating parameters (as, bs, cs), initial network cost for each storage node netcost(i) and initial bandwidth of storage have to initialized. The network cost includes the utilized bandwidth, the number of access to the storage and the available space.

  2. (b)

    Initialize the dynamic parameters The dynamic parameters such as the visited list of replica (Vc(IWD)) which is initially empty.

  3. (c)

    Creation of IWD Create the IWDs and distribute them to the nodes.

  4. (d)

    Selection of path Update the visited list of IWD (Vc(IWD)). IWD chooses the storage that has the files requested by the user and returns those storages in each iteration. The IWD will use the graph list containing the adjacency list of files for each storage and returns the storage that contains the requested file.

  5. (e)

    Selection of total best The IWD chooses the storage that has minimum network cost (i.e.) minimum of utilized bandwidth, number of access and maximum available space, from the set of storages that contains the files requested. The network cost is calculated as in Eq. 4.

    $$netcost(i) = \frac{bw}{{a_{s} }} + \frac{access}{{b_{s} }} + \frac{space}{{c_{s} }}$$
    (4)

    where netcost(i) is the network cost of the storage i, bw, access and space are the utilized bandwidth, number of access and available storage space of the storage i respectively and as, bs, cs are the network cost updating parameters. The network cost is updated for each request from the user. Also, the location of the user is taken into consideration before being applied with the algorithm. Here, the bandwidth is given higher precedence over the number of access and the available space. So, the network cost is directly proportional to the consumed bandwidth. However, in some cases if the priority is changed then those parameters are updated with higher values and so on.

  1. (f)

    The iteration best and total best The storage containing the replica are found for each iteration and returned as iteration best. The optimal replica based on the storage is selected by the IWD as the total best as in Eqs. 5 and 6.

    $$T^{IB} = storage\;\;containing\;\;requested\;\;file$$
    (5)
    $$T^{TB} = \left\{ \begin{aligned} &\hbox{min} \left( {netcost(i)} \right)\;\;if\;\;multiple\;\;storage \hfill \\ &T^{IB} \;else \hfill \\ \end{aligned} \right.$$
    (6)

    where TTB is the total best solution and TIB is the iteration best.

3.2.2 Removal of replica

If more number of replicas is created the storage space gets minimized. In order to efficiently utilize the available space the IWD is used. It helps in identifying the replica with less popularity or less number of access and it can be deleted. The user request can be redirected to other replica and the process is carried out periodically. The working mechanism of IWD for the removal of replica is shown in Fig. 5.

Fig. 5
figure 5

Removal of replica using IWD Algorithm

  1. (g)

    The static, dynamic parameters are initialized and the IWDs are created and distributed. The visited list of IWD is also updated as in the replica selection process.

  2. (h)

    IWD chooses the storage that has maximum network cost, since it is the storage with more bandwidth utilization, more number of access and less available space. The network cost is calculated giving priority to the storage space followed by the available bandwidth and then the number of access. This can also be altered depending upon the requirement and the objective of the problem.

  3. (i)

    Based upon the result of the IWD, the storage is selected and each files in the storage is visited by the IWD with the help of graph list.

  4. (j)

    The IWD returns the replica to be deleted considering the following conditions:

    1. (i)

      The replica with no access or less number of access.

    2. (ii)

      The replica occupying more space.

  5. (k)

    Here the iteration best is the storage with maximum network cost and the total best is the replica that is returned for the deletion. A minimum of two replica is maintained and the user request of the replica to be deleted is redirected to other replicas to balance the load of the cloud storage. The process of deletion is instantiated whenever there is less amount of storage, followed by more bandwidth utilization and then the priority moves to the total number of access on the storage. The proposed strategy automatically deletes the unwanted replicas in the whole storage and free the storage space thereby efficiently improves the capacity of the storage to accommodate more amount of data.

4 Experimental results and discussions

CloudSim has been used for simulating and evaluating the proposed system. CloudSim is a Java based simulator which provides many classes for modeling and simulating cloud systems. New classes can be easily developed or can be inherited from the already developed classes. It runs on machine with Intel XEON E 52630L-3.1 GHz processor with 16 GB RAM and 1 terabyte hard disk as a master node as in Hadoop distributed storage. We have also used several Intel Core i5A 380-3.1 GHz processor machines with 4 GB RAM each and 500 gigabytes hard disk for the slave nodes. Searching and retrieval of data in Hadoop is compared with our model with respect to access time (Lee et al. 2015). We have compared our environment with Hadoop, distributed cloud storage and tabulated the results. Also, we compared our method with several optimization techniques such as particle swarm optimization (PSO) and genetic algorithm (GA) (Lizhen et al. 2018). In doing so, we have allocated 6 storage spaces each of varying capacity [50, 100 GB]. Different types of files such as text, audio; image and video are divided into fixed size chunk and distributed among the storage. Initially there is one replica for each set of files up against the default replica size of 3 because it will increases redundancy and hence occupy three times the storage space. The environment was tested for 1000 different inputs requesting various files to complete the task.

For each file the number of access, number of replica already present, popularity degree and replication factor are calculated using the Eqs. 1 and 2. Figure 6 shows the number of access, number of replica present initially, the popularity of the file based on the number of access and the replication factor.

Fig. 6
figure 6

Replication factor of each file

Based on the threshold calculated as in Eq. 3 the files to be replicated are chosen. Figure 7 shows the files to be created and the number of new replicas. After the replication of the files is done, the user request is satisfied by selecting the optimal replica with the help of IWD.

Fig. 7
figure 7

Determine the number of new replica

After the replication of the files is done, the user request is satisfied by selecting the optimal replica with the help of IWD. To do so an adjacency list is created for each and every file as shown in Fig. 8.

Fig. 8
figure 8

Adjacency list representation of random graph for sample storage

The vertices represent the fixed size chunk of each file. The path to reach a particular chunk is found out in the random graph generated for each storage. This helps to identify the replica and execute the IWD algorithm to properly manage the replicas in the storage. Table 1 shows the storage list containing the requested files. The optimal replica is chosen based on the network cost of the storage and number of access. Consider the user requests for the files [g, b]. The file g has 2 replicas and they are present in storage_0 and storage_2, among the two storages the user request is satisfied with the one from storage_2 since the network cost of storage_2 is less compare to that of storage_0. The network cost is calculated as in algorithm 2.

Table 1 Sample storage list and network cost of each file

After the selection of optimal replica the management of the storage is done. The replica with minimum number of access is selected and is deleted from the storage with less available space using Algorithm 5. Table 2 shows the file to be deleted and the corresponding storage from which it is to be deleted.

Table 2 Illustration to select optimal storage for deletion

Upon calculating the network cost of the files, our D2R-IWD algorithm selects the appropriate replica and initiates the deletion process and redirecting the existing users to access other replicas to balance the load on the storage. Therefore, in the above case file b is opted for deletion from storage_2 and updated in the replica catalog. The proposed IWD algorithm runs in the background of all the nodes and optimally removes the unnecessary replicas from the storage.

4.1 Test case 1: comparison of replication strategy: D2R-IWD versus Hadoop distributed storage, PSO and GA

The performance of D2R-IWD is compared with the traditional replication strategy adopted in Hadoop, a distributed environment for provisioning the requested file to the user. We have also compared our results with popular optimization techniques such as PSO and GA. Here, we refer to storage and retrieval of files across Hadoop as the task. The number of tasks is varied from 50 to 1000 and the comparison is made in terms of access time (Bai et al. 2016). Figure 9 shows the performance comparison of replication strategies of D2R-IWD with other techniques. As the number of tasks to access the same file increases the popularity of the file increases. D2R-IWD works well with peak loads, but we have to note that there is less improvement during minimum load conditions on different file access. The access time of the storage decreases as the number of tasks increases as per the proposed algorithm. In initial time periods, the proposed D2R-IWD algorithm has little bit higher access time than the traditional replication strategy. For 50 tasks the access time is around 1750 ms.

Fig. 9
figure 9

Comparison of Replication strategy w.r.t. storage and optimization techniques

When the number of task is increased to 100, the access time steeply decreases and reached to less than 500 ms for more number of tasks. The access time is drastically reduced irrespective of the number of tasks and provides a better result with respect to access time. Hence, the access time is inversely proportional to the number of user requests initially but there is a tremendous increase in performance of the replication process for large number of file access with 16% increase than PSO, 20% increase than GA and almost double than that of Hadoop. From Fig. 9, it is inferred that our D2R-IWD provides minimum access time for increased number of tasks and has a very low convergence rate than PSO and GA.

4.2 Test case 2: multiple users requesting single file

Here the response time in accessing the file from storage was tested. The number of users is increased, in which multiple users requesting the same file. There is a tremendous increase in the load in accessing the file, which in turn place the burden on the servers in which the files are replicated. Also, there is a possibility of higher response time in accessing the file there by affects the performance of the cloud. Figure 10 shows the comparison of the proposed D2R-IWD algorithm with Hadoop with respect to storage, and PSO, GA with respect to algorithm efficiency. As the number of users increases the response time in accessing the file from the storage keeps relatively decreased when compared with the traditional cloud storage. There is a huge improvement in the performance of the system using our algorithm for replication. To illustrate the above stated improved performance, say for instance, for every increase in 50 users for accessing the same file the algorithm behaves close to the existing distributed storage model till it reaches 150 users. As it reaches 200 users and thereon, for every increase in the number of users there is a proportional decrease in response time till it reaches the maximum number of users of 550. However, at around 400 users in accessing the single file from storage, there is an increase in response time for both the methods.

Fig. 10
figure 10

Response time w.r.t. multiple users requesting single file

Even though at some point in time both behaves more or less similar, our methodology proves to be better than PSO and GA in minimizing the response time in accessing the file from storage irrespective of the increased number of users. It is observed that there is 22% increase in response time as a whole than PSO, 27% increases than GA and almost double than that of Hadoop.

4.3 Test case 3: multiple users requesting different files with increased number of users

A total of six storage space was created with each of capacity 70 GB in the master and slave nodes. Each and every storage contains different chunks of various files with different size. Initially the chunks are stored randomly across the storage and we compared our algorithm on replication with Hadoop distributed storage, PSO and GA. In D2R-IWD, after the decision of number of replicas was made, the files are stored based on the network cost. Whereas in traditional methodology the number of replica is fixed and the storage is selected randomly. The graph in Fig. 11 shows the comparison with respect to the access time plotted against the increased number of users requesting for different file access. Here the number of users is increased in terms of 50 users and the access time varies according to the user request for different files of varying size from the storage.

Fig. 11
figure 11

Access time w.r.t. increased number of users requesting different files

As we can see from the graph, there is steady state that is maintained when the number of users is increased to 50 for both the cases and from there on, the access time tremendously decreases for increase in users accessing different files from storage. The average access time in adopting a cloud storage without optimization algorithm is 342 ms for increasing the user access whereas; the average access time on applying our D2R-IWD Algorithm is 147 ms. Hence, irrespective of whatever the chunking mechanism is used, our algorithm provides better result in reducing the access time of the data with 18% increase than PSO, 21% increase than GA and almost double than that of Hadoop. However, both these algorithms work better in align with our proposed approach and takes longer time to converge closer to our method.

4.4 Test case 4: performance comparison: optimizing the cloud storage space

We also turned our perspective over how to reduce the storage space with the same set of files and optimize the storage space with same or better performance to the end users. The proposed algorithm keep on executing in the background of the storage and check for the usage of files, their properties and decide on maintaining the replicas and optimize the storage.

In general, once the replicas are defined, it is not going to change unless the administrator intervenes. Our D2R-IWD algorithm runs continuously in the background and the updated list of files that are frequently accessed or not, are traced and maintained in a list and are repeatedly updated for a specific time frame. Three sets of lists on accessing the files were maintained: frequently accessed, less frequently accessed and not accessed for a long time. It helps in reducing the number of replicas and optimally frees the storage space.

Figure 12 shows the comparison of D2R-IWD and the distributed storage in terms of available free space. Also, we compared our algorithm with standard optimization techniques such as PSO and GA. Here, the number of files to be stored is increased in relative to size over time and the storage space is compared. The total space consumed in cloud storage without our proposed strategy is averaged to 3000 MB for increased number of files in the storage whereas, the average storage space on applying our D2R-IWD Algorithm is 1650 MB. Almost there is 40% improvement in the free space of storage. However, PSO and GA works moderately in managing the storage space and takes longer time to reach an optimum position unlike the proposed approach. Therefore, the graph clearly shows that D2R-IWD manages the storage space in an efficient manner neither than Hadoop in terms of storage where the later almost doubled. Furthermore, in terms of optimization algorithm it is perceived that there is 24% increase in free storage space for PSO and 28% increase for GA.

Fig. 12
figure 12

Consumed storage space w.r.t. number of files

5 Conclusion and future work

In cloud computing environment failures are normal rather than exceptional. Hence data availability, fault tolerance and access efficiency plays a major role in the performance of cloud. However data redundancy affects the storage space predominantly. The existing replication strategy arbitrarily fixes the number of replicas for each and every file. Hence, in this paper, a novel dynamic data replication strategy with intelligent water drop algorithm is presented, which considers parameters like bandwidth, number of user access, available storage space and traffic. The two main contributions of this paper includes the dynamic data replication strategy, replicates the data based upon the number of access or the popularity of the data. Secondly, the intelligent water drop algorithm helps in the selection of optimal replica based upon the bandwidth, traffic and number of user request. It is also used for the management of cloud storage by selecting the replica for removal considering the number of access and the storage space occupied by the replica. We have tested our algorithm with storage similar to Hadoop and analyzed the access time for storage and retrieval of data. It is observed that there is almost 40% increase in the free space of storage. We have also compared our approach with standard optimization algorithms like PSO and GA on replication strategy and found out that there is 16% increase than PSO, 20% increase than GA. Furthermore, a detailed comparison was made in accessing the same data with variable loads and observed that there is 22% increase in response time as a whole than PSO and 27% increases than GA. For accessing different data, there is a tremendous improvement in access time with 18% increase than PSO, 21% increase than GA. We have tested it with four different test cases to improve access efficiency and obtained promising results preserving the high availability, performance and load balancing of the cloud storage. For future works, the energy factors of the storage node can be considered. Several other QoS attributes can be optimized with respect to cloud storage. Grouping of files with respect to storage can be explored to minimize the access time. Also, optimizing the query to retrieve the data from storage remains a challenging issue. Several other prediction mechanisms can be applied and verified to increase the accuracy of the selection process and the proposed model can be tested and incorporated in the real cloud environment.