1 Introduction

In some scientific application areas like earth observations, huge amounts of data are becoming a significant part of the shared resources. Such large-scale datasets are usually distributed in different data centers. Data replication technique is usually applied to manage large data in a distributed manner. The replication algorithm tries to store multiple replicas to achieve efficient and fault-tolerant data access in the systems (Mansouri et al. 2019; Dinesh Reddy et al. 2019). Although data management has been previously investigated (Mansouri 2014; Alghamdi et al. 2017; Pitchai et al. 2019; Mansouri and Javidi 2018a), very few of the available algorithms take a holistic view of the different costs and benefits of replication. Many of them usually adopt replication process to enhance data availability and efficiency. These metrics are improved when the number of copies in the system increases. But, the most critical fact they ignored is that data replication leads to energy consumption and financial costs for the provider. Consequently, introducing a data replication technique that considers the balancing of a variety of trade-offs is necessary. In recent times, optimization is a booming field to determine an optimal solution for complex problems. One of the well-known economical techniques for solving optimization problem is meta-heuristic approach due to its (1) simplicity of implementation, (2) prevention of local optima, and (3) independent of problem specific. Consequently, researchers have focused on meta-heuristic technique to solve replication questions (Nanda and Panda 2014; El-Henawy and Abdelmegeed 2018).

The highlighted contributions of the review are:

  • Presenting the main challenges of data replication algorithms in the distributed environments.

  • Comparing the meta-heuristic-based data replication techniques according to the some attributes like security, energy consumption, bandwidth consumption, performance, storage usage, response time, etc.

  • Implementing a data replication algorithm based on several meta-heuristic techniques, i.e., GA, ACO, FA, BA, PSO, WOA, GWO, and ABC to investigate the impact of meta-heuristic techniques on the simulation results.

  • Identifying some open research problems in the field of data replication and management in cloud and grid environment.

The rest of the survey is structured as follows. In Sect. 2, a background of cloud computing has been presented. Section 3 covers data replication classification and main challenges in data management. Section 4 introduces the main concept of meta-heuristic techniques. Section 5 presents some challenges motivating meta-heuristics use. Section 6 discusses about meta-heuristic-based replication algorithms in grid and cloud environments. Section 7 summarizes the reviewing results. Section 8 evaluates the performance of data replication algorithms based on different meta-heuristic techniques. Section 9 presents some of the main research challenges and future works. Finally, the conclusion is given in Sect. 10.

2 Background of cloud computing

In this section, we firstly introduce the historical evolution of cloud computing and then we present the main technologies of cloud environment. Finally, the comparison of service models in cloud is discussed.

2.1 History and emergence of cloud computing

In cloud computing, the term “cloud” is applied as a metaphor for “the Internet,” so the phrase cloud computing means a type of Internet-based computing. In addition, another reason for representing network infrastructures by an iconized “cloud” is hiding the complexity of the facilities from users (Shojaiemehr et al. 2018). The additional terms that come with “cloud” show the scope of that “cloud,” and it can be, for instance, mobile computing and networking. The historical evolution of cloud computing since 1960s to 2011 is presented in Table 1 (Moura and Hutchison 2016; Nadh Singh and Raja Srinivasa Reddy 2017).

Table 1 Historical evolution of cloud environment

It is obvious that cloud computing evolution is currently related to the increasing big data popularity. Recently, there are interesting research areas such as grids (Mansouri et al. 2011), Internet of Things (IoT), and network functions virtualization (NFV) for future networks with a strong relation to cloud computing.

Various review works on data replication can be found in the literature, as presented in Table 2. The novelty of the work we present here, in relation to other surveys, is to focus on a study on meta-heuristic techniques for data replication in data grid and cloud commuting environments.

Table 2 Data replication-surveyed contributions

2.2 Foundations of cloud computing

Main foundation elements of cloud computing are virtualization, distributed computing like grid, Internet, and network management (Mansouri and Javidi 2019). Figure 1 indicates the common elements of cloud environment. Virtualization has a main role in cloud computing and can offer significant benefits such as maximizing resources, flexibility, availability, scalability, hardware utilization, load balancing, and security. In addition, cloud computing system needs reliable access, management automation, and self-service provisioning. The major complexity of cloud system is related to the management automation that automatically optimizes resource usage and adapts based on the dynamic demands of users and system status (Michael et al. 2010).

Fig. 1
figure 1

Main features of cloud computing

2.3 Cloud computing service models

Services of cloud have three specific features that differentiate them from traditional hosting: (1) on-demand service which means resources are made available to the user on an “as-needed” basis; (2) rapid elasticity that refers to the ability of a system in adapting to workload changes by increasing or decreasing the amount of system capacity (e.g., CPU, storage, and input/output bandwidth); and (3) self-service that means customers are able to provision cloud computing resources without interaction with service providers (Aznoli and Jafari Navimipour 2017). Generally, there are three service models to describe cloud services (see Fig. 2): Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS).

Fig. 2
figure 2

Service models in cloud

2.4 Cloud architecture

Cloud computing is not a completely new concept; it is based on several other computing areas such as utility computing and cluster computing. Table 3 gives an end-to-end comparison in various perspectives (Hashemi and Khatibi Bardsiri 2012).

Table 3 Comparisons of grid and cloud

Foster et al. (2008) presented the four-layer structure of cloud computing in comparison with the traditional five-layer grid architecture (see Fig. 3).

Fig. 3
figure 3

Cloud architecture

  • The main responsibility of fabric layer is managing low-level resources such as computing units, storage, and network bandwidths.

  • The platform layer presents a development and running platform for cloud workflows and controls middleware services, scheduling service, programming languages, and so on.

  • The unified resource layer is composed of both software services and hardware services that are necessary for executing cloud applications.

  • The final layer consists of cloud workflows like social networking tools that operate within virtual environment.

All resources such as storage are shared among the tenants in cloud environments, and so an elastic management of these resources leads to the satisfaction of both tenant requests and the profit of provider. In addition, quality of service (QoS) requirements are very critical for the dynamic nature of the cloud environment and for this regard a service-level agreement (SLA), a legal contract between the tenant and the provider, is defined (Limam et al. 2019). The general overview of SLA is presented in Fig. 4 (Aljoumah et al. 2015).

Fig. 4
figure 4

SLA main concepts

3 Data replication

Nowadays, large amount of data is being created and processed by many scientific and engineering researches like bioinformatics, earth science, and astronomy. In dynamic and distributed environments such as cloud and grid, many challenges revolve around data management and data transfer. Data replication is a general technique to overcome these challenges and enhance the performance of system and reliability (Mansouri 2016; Tos et al. 2018; Mansouri and Dastghaibyfard 2013; Boru et al. 2015). Mansouri and Buyya (2018) discussed about a vital role obtaining the cheapest network and storage resources in suitable time with considering various prices of storage types and time-varying workload. To optimize the costs of replica creation, storage, and potential migration, they investigated the following main questions. (1) Which storage type should host the replica (i.e., placing), (2) which replica should select for task execution, and (3) when the replica should be migrated from a storage to another one.

Generally, data replication algorithms are categorized into two types: static replication and dynamic replication as shown in Fig. 5. Static replication algorithms predetermine the location of replicas. Therefore, there are no more copies generated or adjusted. One of the main benefits in static replication strategy is low overhead during execution (Mansouri et al. 2017), while dynamic replication algorithm dynamically generates and removes replicas based on the access pattern and system status. Dynamic replication algorithms are suitable for dynamic environments like cloud. Nevertheless, frequent huge data transfer that is a consequence of dynamic method leads to a strain on resources of network. Hence, a well-defined data replication algorithm that can create replicas in appropriate time and location is essential to avoid unnecessary replication.

Fig. 5
figure 5

Types of replication process

Static and dynamic replication techniques can be categorized further into categories as distributed and centralized strategies (Sun et al. 2012; Mansouri and Javidi 2018b). In centralized replication techniques, a central element manages all aspects of data replication. All necessary information is either collected or distributed by this central element. The main problem of the centralized method is if the nodes of distributed system frequently enter and leave, then the overload of central decision element significantly grows. On the other hand, in the decentralized method, no central element exists to keep the information of system and so nodes themselves decide about data replication. The main issue of the decentralized manner is the synchronization process. Hence, designing a data replication algorithm, which adapts to the environment and requests of user, becomes a critical issue, especially as datasets of interest tend to be large.

3.1 Main challenges in data replication

Data replication process reveals the following challenges:

  • Replica decision Which data should be replicated to meet the requirements of users for reducing waiting time and data access time? It is not necessary to create replicas for all files, particularly for nonpopular files.

  • Number of replicas How many suitable new replicas should be generated to provide the reasonable system availability? When the number of new replicas is increased, then cost of the system maintenance will significantly increase. Moreover, too many replicas may not enhance system availability, but lead to the waste of resources in the network.

  • Replica placement Where the new replicas should be stored to reduce job execution time and network latency? It is hard to select optimal locations so that the workload of system is balanced.

  • Replica selection Which replicas should be selected to access the required data for job execution? One of the vital parameters in replica selection is response time.

  • Replica replacement Which replicas should be deleted to provide sufficient storage space for new replica? Usually, the algorithm must check the benefits of storing the new replica with the cost of deleting the files.

  • Replica consistency When a replica changes, how to make other copies and original file quickly consisting? It is necessary to ensure synchronous update when modifying for any replica.

One of the main factors that play a main role in replication process is the architecture of environment. Generally, there are four architectures: (1) multitier, (2) hybrid, (3) P2P, and (4) general graph, as shown in Fig. 6 (Tos et al. 2015). In Tos et al. (2015), the analyzing simulation results based on response times proved that data replication algorithms were performed best in general graph architecture.

Fig. 6
figure 6

Architectures considered in evaluations. a Multitier, b hybrid, c P2P, d graph (Tos et al. 2015)

4 Meta-heuristic techniques

In the last 30 years, an approximate technique has been introduced that combines standard heuristic strategies with higher-level frameworks to explore a search space, effectively. These techniques are nowadays named meta-heuristic. Meta-heuristic techniques have been widely conveyed in the literature (i.e., algorithms, applications, comparisons, and analysis) due to simplicity and flexibility (Singh Kushwah et al. 2018; Ebrahimzade et al. 2020; Mirzai et al. 2017; Farzampour et al. 2019; Ebrahimzade et al. 2018). Four advantages of meta-heuristic algorithms are as follows:

  1. (1)

    Broad applicability Meta-heuristic algorithms are not problem specific and can be used to any problems that can be formulated as a function optimization. Hussain et al. (2019) sorted different fields that meta-heuristic techniques are used based on the number of publications during the period of 1983–2016. Table 4 indicates that meta-heuristic algorithms mostly applied on numerical problems (e.g., continuous and discrete, constrained, and unconstrained, etc.), and data mining (e.g., optimization in classification, prediction, and clustering, etc.). Other applications among top ten fields are communications (e.g., networking, and telecommunication), transportation (e.g., routing), engineering (e.g., mechanical designs), civil engineering (e.g., bridge design), and information and communications technology—ICT (e.g., cloud and grid computing).

    Table 4 Applications of meta-heuristic algorithms
  2. (2)

    Hybridization Meta-heuristic algorithms can be combined with traditional approaches. Raidl and Roli (Dokeroglu et al. 2019) presented a review on the strategies of hybrid meta-heuristics and indicated that combining various algorithms is one of the most successful techniques for optimization problem. However, the process of designing hybrid meta-heuristics is rather complex and needs knowledge about a broad spectrum of algorithmic concepts, programming, and data structures.

  3. (3)

    Ease of implementation They are simple to develop and understand. Meta-heuristic algorithms can solve problems with different objective functions with minor changes in codes.

  4. (4)

    Efficiency They can solve large problems in acceptable time. In other words, they lead to a significant reduction in software development time.

Figure 7 indicates an outline of how conceptual meta-heuristic algorithm works (Tsai and Rodrigues 2014). It consists of initialization, transition, evaluation, and determination operators that are performed in a number of iterations. Firstly, it randomly creates initial solutions like the greedy and deterministic strategies. Secondly, it performs a transition operator to modify the current solution and can keep partial good subsolutions. Thirdly, it uses the evaluation operator to evaluate the solution based on the value of objective function in the optimization problem. Fourthly, it considers the determination operator to guide the search directions to move toward better solutions. The performance of meta-heuristic algorithm is controlled by this operator. For example, an intensification search leads to converge very quickly but very easy to trap into local optimum in initial iterations. On the other hands, a diversification search will not converge to a particular solution very quickly, but the quality of final solution will be good.

Fig. 7
figure 7

Overall framework of meta-heuristic algorithm

Figure 8 shows an example that illustrates how meta-heuristic algorithms work in general (Tsai et al. 2015). Consider the population size is two and each solution has four subsolutions. Firstly, two solutions are randomly generated. Then, the solutions are transformed by the transition process. After that, the evaluation operator evaluates the fitness of solutions and determination operator selects suitable solutions for the next iteration. Finally, after the termination criterion is satisfied, the final solution obtained is considered as an approximate solution or the optimal solution.

Fig. 8
figure 8

Example of meta-heuristic procedure

Most of the state-of-the-art meta-heuristic algorithms (classical) such as genetic algorithms (GA) (Goldberg and Holland 1988), particle swarm optimization (PSO) (Eberhart and Kennedy 1995), ant colony optimization (ACO) (Dorigo et al. 2006) have been introduced before the year 2000. Nevertheless, new evolutionary algorithms such as artificial bee colony (ABC) (Basturk and Karaboga 2006), firefly algorithm (FA) (Yang 2009), whale optimization algorithm (WOA) (Mafarja and Mirjalili 2018), bat algorithm (BA) (Yang 2010), and gray wolf optimization (GWO) (Mirjalili et al. 2014) also are developed successfully in the last two decades. In many cases, these new meta-heuristic algorithms achieve the best solutions for some of the benchmark problems. Figure 9 represents the search result of the number of publications for some classical and new generation meta-heuristic algorithms on Google Scholar web site (in May 2019). We can see that GA and ABC have the largest number of papers.

Fig. 9
figure 9

The number of papers for the classical and new meta-heuristics

Cloud computing has become a buzzword in the scope of high-performance distributed system since its unique features such as on-demand self-service, compatibility, and elasticity. To achieve its full advantages, much research is necessary across broad areas (Mansouri et al. 2019). One of the main challenges, which must be focused for its efficient performance, is data replication. Meta-heuristic-based algorithm can determine near-optimal solutions in reasonable time for such hard problems.

5 Motivation

The distributed computing environments have undergone several changes. The emergence of cluster is the early change that connects a set of loosely or tightly systems to work together. Then, grid computing is developed to solve the issues of cluster systems being only able to use local resources. Therefore, grid combines all the available heterogeneous resources from geographically distributed systems. Now, cloud computing is used that leverages the strengths of cluster and grid systems (Foster et al. 2008). Generally, the vision for cloud and grid is similar and they are struggling with many of the same issues. Nevertheless, the details and technologies used may differ. Data replication is one the common problems that are applied in the distributed systems such as P2P and grid systems. In such systems, a data replication algorithm must specify what to replicate? when to create/remove replicas? where to store them?, and how many replicas to store? (Ranganathan and Foster 2001). But, most of the developed data replication algorithms in the above environments are difficult to adapt to clouds since they focus on improving the performance of system without considering cloud goals such as the economic cost. In other words, data replication algorithms for grid environments usually try to present low response time and the most significant point that they neglected is the profit of cloud providers (Mokadem and Hameurlain 2020). In fact, the performances of these systems are improved as the number of replicas increases. However, creating as many replicas as possible in cloud environments cannot be economically feasible since data replication does not come free and it can result in wasteful resource (i.e., storage and energy) and so reduces the provider profit (Long et al. 2014). Therefore, designing data replication techniques for cloud should also take into account other points such as: (1) considering a reliable quality of service (QoS) according to the service-level agreement (SLA) (Mokadem and Hameurlain 2020) and (2) adjusting the cloud resource based on the pay as you go’ pricing model (Pitchai et al. 2019). It is obvious that proposing a dynamic data replication algorithm for cloud should be involved balancing a variety of trade-offs. For example, the response time decreases as the number of replicas increases. Nevertheless, the number of replicas should be as small as possible for minimizing the energy.

While the fundamental data replication aspects remain unchanged, various optimization objectives ballooned the replication literature in the past decade. Because of new dynamic properties of cloud (e.g., diversity of user requests, dynamic load on data centers, and changing resource capacity), the traditional data replication cannot work effectively in cloud computing. Users are using the cloud in a dynamic manner, and so, the resource must be automatically allocated and de-allocated to maximize agility. With increasing the number of cloud users and size of files each day, data replication turns into an imperative issue to manage.

Guarantees of performance (e.g., response time) are often not considered as a part of the SLA by providers of cloud because of the heterogeneous workloads in cloud environment. For instance, Google Cloud SQL only offers error guarantees without response time guarantee (Mansouri and Javidi 2018a). So the “tenant performance/provider profitability” trade-off should be solved for data replication algorithms in cloud especially when they are proposed for OLAP (online analytical processing) applications. As a consequence, a data replication algorithm seeks to find the near-optimal solutions by balancing the trade-offs among the several optimization objectives [e.g., availability (Wei et al. 2010), reliability (Li et al. 2017), low latency (Ma and Yang 2017), data durability (Liu et al. 2018), security (Ali et al. 2018a) and energy efficiency (Séguéla et al. 2019), availability, load balancing (Long et al. 2014), and cost (Limam et al. 2019)].

Meta-heuristic techniques are preferred to effectively deal with its complexity in the data replication of grid and cloud, and accordingly several algorithms have been introduced. Since meta-heuristics apply iterative processes to find solutions in a reasonable time and can provide better results than traditional methods (Shojaiemehr et al. 2018). Consequently, meta-heuristic technique plays a vital role for data replication problem in the distributed environment. To the best of our knowledge, despite the importance of meta-heuristic-based data replication algorithms, there is not any detailed and comprehensive review of these techniques. The aim of this study is to review the available strategies and compare the differences among the explained algorithms.

6 Meta-heuristic-based data replication algorithms

In this section, we classify the dynamic replication algorithms into different categories with respect to the meta-heuristic techniques. There are many performance criteria for cloud computing, so it is infeasible to cover all aspects in one paper. In this review, the scope is narrowed to main parameters related to data-intensive systems. Table 5 indicates a summary of meta-heuristic-based replication algorithms (between the periods of 2006 and 2019) for grid and cloud environments. The sign (+) shows that the replication algorithm considers the concerned option, and the sign (−) indicates that the replication algorithm does not focus on that option. Finally, the sign (NM) indicates that the option is not mentioned in the paper.

Table 5 Comparison of data replication algorithms-cont’d

6.1 Data replication strategies based on genetic algorithm

The genetic algorithm (GA) is one of the most popular evolutionary techniques that introduced in 1970s by Henry Holland (1992). The standard GA is very generic, and many parts such as solution representation, selection process, crossover, and mutation operators can be developed differently based on the problem. Algorithm 1 indicates the pseudo-code of genetic algorithm.

figure a

Cui et al. (2018) proposed a new replica placement strategy based on GA for cloud environment. The authors designed a tripartite graph with three kinds of vertices that represent jobs, replicas, and nodes. The mappings from jobs to replicas are indicated by the edges between job and replica vertices. This mapping is used in scheduling step. The mappings among replicas and nodes are represented by the edges between replica and node vertices. This mapping determines the locations of data replicas. The proposed strategy can determine the mappings with the best performance as a near-optimal solution by using genetic algorithm. The performance evaluation indicated that the proposed replica placement strategy has lower data transmission time than random data placement algorithm applied in Hadoop distributed file system (HDFS) (Shvachko et al. 2010).

Junfeng and Weiping (2016) presented a pheromone-based genetic algorithm for selection strategy to enhance the performance of cloud system. The proposed strategy determines the probabilistic information of replica in cloud by virtue of group constant overlapping, realization of mutation operators, and the feedback information. Genetic algorithm indicates two main benefits in replica selection step. The first advantage is that genetic algorithm finds the optimal solution at end. The second one is that it is a distributed optimization technique and so can be adapted to the distributed environment. For fitness calculation, the proposed algorithm considers network condition and transmission time. The simulation results demonstrated that the presented algorithm leads to good performance in terms of mean job execution time.

Junfeng and Weiping (Al Jadaan et al. 2010) enhanced replica selection process in data grid to ensure the satisfaction of the users. The proposed strategy considers availability to download necessary data from that node even if there are some difficulties such as malfunctioning and degradation in network. The authors combined various parameters (i.e., response time, availability, and security) that are not aggregated with each other in replica selection. To overcome this problem, they employed GA-clustering algorithm. Then, the best replica is the one that provides good response time, availability, and a reasonable level of security at the same time. If more than one site shows the best possible combination, then the proposed strategy will randomly choose one of them by default. The experimental results conclusively demonstrated that it is more secure than previous strategies such as round robin (RR) strategy, and at the same time, it is more reliable and efficient.

Almomani and Madi (2014) presented a genetic-based replica placement algorithm to determine the best location for new replicas. The proposed algorithm with optimization technique has two main achievements. The first improvement is that it reduces data access time by considering the read cost of files. The second improvement is that it avoids network congestion by considering the load of sites. The fitness function is defined based on the read cost and storage cost. Therefore, it can determine the appropriate site for replica placement such that it satisfies the user requirement and resource provider.

Grace et al. (2014) used two meta-heuristic strategies (i.e., genetic and ant colony) to select appropriate location from many available sites. The main parameters in ant colony-based algorithm are response time and the size of file. In genetic-based algorithm, response time, data availability, security, and load balancing are considered. The simulation results with GridSim indicated that the efficiency of genetic algorithm is 30% more than the ant colony strategy.

Xu et al. (2015) introduced a novel data placement algorithm based on GA to improve data access in cloud environment. Firstly, the authors defined a mathematical model for data scheduling in cloud system and then used the fitness function based on the number of data access to compute the fitness of each individual in a population. They employed roulette-wheel selection technique for choosing the individuals with high fitness value. The experimental results indicated that genetic algorithm could find a near-optimal data placement matrix in an acceptable time and performance result is better than Monte Carlo strategy (Shijie et al. 2010).

Liu et al. (2017) presented a new replica placement strategy by GA for cloud computing. The proposed strategy considers two main facts. The first one is that the dependency between initial files is important in reducing the data transmission time. It stores files with high dependency together, and so data movement among data centers is decreased. The second fact is that the transmission cost is related to the file size. In addition, there is much probable that a task requires small input files but creates large files. Therefore, if the created files are necessary for other tasks in various data centers, then the large amount of data must be moved. In summary, the proposed strategy after constructing the data interdependency matrix uses the genetic algorithm to determine the best replica location. It uses the total transmission time as fitness function. The experimental results proved that the proposed algorithm could reduce the size of data movement compared to the K-means algorithm (Yuan et al. 2010).

Wu (2017) proposed a replica placement strategy from cost-effective view in the cloud system. The author discussed about the trade-off between dataset’s real replicas and pseudo-replicas (i.e., the replica is not constant in the nodes, but is created by its predecessors immediately if necessary). The proposed strategy includes two main processes. First, it introduces a cost model for deciding about number of replicas and their places. The presented cost model comprises of important factors such as storage cost, data transfer cost, and data computation cost for making acceptable cost estimation. Then, it analyzes a cost–benefit concept for data management. Finally, it provides real and pseudo-replicas technique that minimizes data management cost using generic algorithm. The evaluation results indicated that the proposed strategy compared with no replication (NREP) scheme, random (RAND) strategy, and minimum cost replica placements strategy (MCRP) (Wu 2016) is very cost-effective.

Chunlin et al. (2019) proposed a dynamic multiobjective optimized replication algorithm to improve system performance. The proposed algorithm takes into account the file unavailability, the load of data center, and cost of network transmission. The load of data center is determined based on the CPU capability, memory, disk space, and the bandwidth of network. Then, it uses the fast nondominated sorting genetic algorithm to solve the multiobjective optimized replica placement problem. It considers the binary coding scheme and determines the location where the replica is stored and the number of replicas. Moreover, it performs the replica consistency based on the reliability record table to guarantee the data availability. The proposed algorithm applies the replica migration strategy for the file access hot spots problem that is appeared by burst requests. The experimental results presented that the proposed strategy could improve the usage rate of network resource in comparison with dynamic replica adjustment strategy (DRAS) (Qu et al. 2016), ARDS algorithm.

Zhang et al. (2014) proposed a replica placement algorithm to improve user access time. The proposed scheme includes two phases. In the first phase, it determines the data center based on the P-center model (Rahman et al. 2008). In the second phase, it defines the specific storage server and disk array of the selected data center based on the creation and deletion cost in the hash function. Finally, the optimal replica placement solution is determined by genetic algorithm. It defines a 0–1 vector \( < x_{1} ,x_{2} , \ldots ,x_{n} > \) to show the solution of genetic algorithm, and the parameter xj indicates whether the data center j is selected for replica placement or not. The results of experiment indicated that the proposed algorithm could reduce the average access time.

Huang and Wu (2018) proposed a cost-effective replica placement algorithm for read-intensive and write-intensive transactional workload. The proposed algorithm takes into account data storage, updating cost, transmission time, and processing costs in cloud system and tries to determine the number of replicas. Then, it finds appropriate locations for new replicas. Moreover, it considers the user access paths to replicas and the number of servers in the cloud system for definition of objective function. The authors introduced a hybrid genetic algorithm (HGA) and presented a heuristic rule according to the data support degree to define the initial population for optimization. The data support degree is defined as the percentage of the number of requests managed by data center i, and then, the number of data replicas is determined based on the data support degree before creating the initial population. The simulation results indicated that the solution quality of HGA is close to the optimal solution under large-scale instances.

6.2 Data replication strategies based on ant colony algorithm

Ant colony algorithm (ACO) is a meta-heuristic approach introduced by Dorigo in 1992 to solve hard problems like the traveling salesman problem (Dorigo 1992). Ant colony algorithms were inspired by the observation of real ant colonies behavior how they forage for food. Algorithm 2 indicates the pseudo-code of ant colony algorithm.

Wang et al. (2013) developed a cost- and time-aware model for data replication that is appropriate for data-intensive service composition in cloud. For cost calculation, the proposed algorithm considers the usage-based model, the package-based, the flat fee subscription-based, and the combination-based pricing models. Then, it uses ant colony optimization to reduce the cost of data-intensive service solution based on the cost of replication and response time during replica selection process. The results of simulation proved that the presented strategy could solve replica selection problem efficiently.

figure b

Sun et al. (2005) introduced an ant colony-based replica selection algorithm in data grid environment. The proposed strategy considers important criteria in replica selection process: (1) disk I/O transfer that related to disk seeks times. It is obvious that for decreasing the data retrieval time, lower seek time is better. (2) Condition of network that refers to the available bandwidth for transferring replica. (3) Load of site that contains the requested replica. The results of evaluation proved that the proposed algorithm could reduce average access time, especially in data-intensive environment.

Yang et al. (2013) described a new replica selection strategy based on ant colony optimization to reduce access latency. The authors modeled task node as foraging ants and required files of task as food. Therefore, the replicas that are stored in various sites have different paths to the food. They assigned an eigenvalue to each replica, so the size of eigenvalue shows the possibility of being chosen. Moreover, the eigenvalue of replica modifies based on the various circumstances. The proposed strategy considers important factors such as replica host load and available bandwidth in selection process. The simulation results demonstrated that the proposed algorithm could improve performance in terms of effective network usage and mean execution time.

Jafari Navimipour and Alami Milani (2016) proposed a replica selection using ant colony technique to enhance the performance of cloud system. The authors considered a graph that each node indicates a data center and edges connect them. At the beginning, all ants randomly choose a cloud center since there are no pheromones. After the optimal data file is found by the first ant, then other ants will interest to use the new data center that has the target file. For choosing replica, the ants apply pheromone information that is defined based on the read historical accessibility of the replica and the size of replica. The simulation results indicated that the proposed strategy could reduce more access time compared to RTRM strategy (Bai et al. 2013).

Shojaatmand et al. (2011) used ant colony algorithm for selecting the most suitable replica in data grid environment. They defined two references for each file as logical file name (LFN) and a physical file name (PFN). LFN is independent of where the file is placed and PFN indicates a particular replica of file. The proposed strategy uses the pheromone information to show how good the replica is. Ants get the statistical information of the nodes that have the needed replica during moving in the network. Each ant places the collected information in nodes as a trail, and other ants apply this information in finding the path of better pheromone. The pheromone is defined based on the available bandwidth and the size of file. The simulation results with OptorSim demonstrated that the proposed strategy could reduce more response time compared to the no replication strategy.

6.3 Data replication strategies based on artificial bee colony algorithm

Artificial bee colony (ABC) technique was proposed by Karaboga in 2005 for optimizing numerical problems (Karaboga 2005). This strategy is inspired by the intelligent foraging behavior of honey bees. The bees in a colony are categorized into three groups: employed, unemployed foraging bees, and food sources. The first two groups (i.e., employed and unemployed foraging) search for rich food sources near to their hive. This model contains two main leading modes of behavior: (1) for positive feedback, recruitment of forager bees to rich food sources. (2) For negative feedback, abandonment of poor sources by foragers. Algorithm 3 indicates the pseudo-code of bee colony algorithm.

figure c

Khojand et al. (2018) proposed a dynamic data replication based on fuzzy system for data grid environment and called predictive fuzzy replication (PFR). PFR algorithm predicts future needs based on the history usage of files and prereplicates them in the appropriate locations. PFR redefines the balanced ant colony optimization (BACO) strategy (Tharani 2016), which is developed for task scheduling in computing grids. In BACO strategy, each job is considered as an ant that tries to get resources in the system. But, PFR algorithm defines a file as an ant and a node as a resource. The authors designed the fuzzy system to define some general rules that are always true in the system. Since grid is a dynamic system, there is not accurate information at every moment. Therefore, fuzzy inference is a good choice for predicting the behavior of system. The proposed fuzzy system presents the direct relation between usage ratio and node size. In addition, it presents the opposite relation between file size and node level. The experimental results proved that PFR strategy could increase the percentage of the use of the replicas compared with predictive hierarchical fast spread (PHFS) (Mohammad Khanli et al. 2011), fast spread (FS) (Ranganathan and Foster 2001), and cascading algorithms (Ranganathan and Foster 2001).

Azimi et al. (Khalili Azimi 2019) introduced a bee colony-based replication in cloud computing. The authors considered a three-level hierarchical topology. First level is regions that are connected with low bandwidth, and the second level consists of LAN’s (local area network) of each region that are connected by higher bandwidth comparing to the first level. The third level includes the sites of each LANs that are connected to each other through a high bandwidth. In the proposed bee colony-based algorithm, if new locations or new food regions show better quality or more nectar, the bee remains in new location and one unit will be added to its trial index. The quality is defined as the probability of existing files in sites. After the searching stage is finished by worker bees, then the best site is selected based on the number of requests for a file. The experiment explained that the introduced replication strategy successfully could reduce mean job time compared to LFU, LRU, and BHR (Park et al. 2003) algorithms.

Taheri et al. (2013) used bee colony optimization algorithm for simultaneous job scheduling and data replication in grid systems. The proposed scheduling algorithm arranges jobs based on the length of job, and so the longest job has higher priority than others. A specific number of positions are assigned to bees for advertising each node. If the benefit of the newly allocated job/bee is higher than older bee, then it is replaced. If the similarity of bees is lower than 80% to the dancing bees, then they will be replaced and so biasing all bees to a specific bee type is avoided. Now various types will be available on the dance floor to advertise a node for more job types. The proposed strategy arranges all files based on the size, and so the largest file has the highest priority with respect to others. For prevention of unnecessary replication, the proposed replication strategy uses the condition of “ArrUpTimes (k)/MinUpTime (Dx) < 2,” where ArrUpTime is array of the total upload time of Dx to all its dependent jobs and MinUpTime (Dx) indicates the minimum uploading time of Dx stored on any node. The experimental results demonstrated that the proposed strategies could improve transfer time and resource utilization compared to the DIANA (Anjum et al. 2006), Chameleon/FLOP (Sang-Min and Jair-Hoom 2003), MinTrans (Abdi and Mohamadi 2010), and MinExe (Ranganathan and Foster 2002).

Salem et al. (2019) introduced a novel replication strategy based on ABC algorithm. In the first phase, the proposed algorithm tries to handle the least-cost path issue for determining the optimal replica placement and obtaining low cost based on the knapsack problem. The main goal is finding a solution so that the consumer can get and store replicas through the shortest path with a lower cost and provide load balancing in the system. In the second phase, ABC strategy is executed by data centers to find an optimal sequence of data replication and support the best least-cost path. The experimental results presented that the introduced strategy could reduce the data transmission in comparison with dynamic cost-aware re-replication and rebalancing strategy (DCR2S) (Gill and Singh 2016), enhanced fast spread (EFS) (Bsoul et al. 2011), and genetic algorithm (GA).

6.4 Data replication strategies based on firefly algorithm

Firefly algorithm is proposed by Yang (2013) at Cambridge University to deal with multimodal, global optimization problems. Flashing light of fireflies is the main characteristic that shows two major procedures as absorbing the mating partners and informing the potential predators. There are some physical rules in the flashing lights. The landscape of the objective function is defined by the brightness of a firefly. A mutual coupling is triggered between two fireflies after the firefly is allocated within the vicinity of another firefly. The attractiveness is proportionate to the brightness. When distance increases, the attractiveness and brightness decrease. In other words, the less bright firefly will attract to the brighter one. Algorithm 4 indicates the pseudo-code of firefly algorithm.

figure d

Sadeghzadeh et al. (Sadeghzadeh and Navaezadeh 2014) improved the replica placement step using firefly technique in data grid environment. The proposed strategy considers grid as a M × N matrix where M is number of sites and N indicates the number of files. The authors used the total read cost and write cost that are introduced in Tu et al. (2010). In each generation, elements superiority has been determined to find the best locations based on their personal experience and collective intelligence. The results of simulation indicated that the presented strategy could reduce the storage usage.

6.5 Data replication strategies based on particle swarm optimization algorithm

Particle swarm optimization (PSO) is a population-based approach that is proposed by Eberhart and Kennedy (1995). It is inspired by social behavior in bird flocking or fish schooling. Each particle is defined by position and velocity, which moves in a search space. In each iteration, the velocity of each particle is updated according to the local best known position and global best position. Algorithm 5 indicates the pseudo-code of PSO algorithm.

Jayasree and Saravanan (2018) presented an adaptive particle swarm division and replication of data optimization (APSDRDO) algorithm to consider the security of cloud data and automatic data updating. APSDRDO algorithm divides a replica into several fragments and stores them based on T-coloring concept. Therefore, a successful attack on a single data center should not guess the positions of other fragments inside the cloud system. The replica placement is done in two phases. In the first phase, data center is candidate according to the centrality process to improve the retrieval time. In the second phase, data center is selected by PSO algorithm and at the same time updating process is performed. The objective function is defined based on the read time and write time. The comparison results with Greedy, Division and Replication of Data in the Cloud for Optimal Performance and Security (DROPS) (Ali et al. 2018b) and Genetic Replication Algorithm (GRA) indicated that APSDRDO has better average response time.

figure e

Ebadi and Jafari Navimipour (2018) developed an energy-aware data replication algorithm for cloud. The author used PSO due to strong global search ability and tabu search (TS) due to the powerful local search capability. The fitness values of particles are obtained based on total cost (i.e., reads and writes) and energy. Then, all of the particles repeatedly move until the maximum number of iterations is met. Therefore, the proposed algorithm (HPSOTS) considers the replication problem as a 0–1 decision problem and tries to minimize total energy and cost (TEC). The experimental results demonstrated the strength of HPSOTS, especially for low storage capacities.

Munos et al. (Muñoz and Carballeira 2006) introduced a replica selection strategy based on the PSO-LRU approach for data grid environment. The authors assumed that a file location request is a bird searching food. The bird decides about location of searching according to the flock food chirp. The food chirp is decreased across distance. The proposed strategy considers some performance metrics such as hit ratio and network cost. The network cost is obtained based on the three factors as latency, bandwidth, and the size of file. If in the selected node there is not enough space, the proposed strategy deletes files based on the least recently used approach. The simulation results indicated that the proposed algorithm could reduce response time in the large distributed system.

7 Summary and discussion

This section summarizes the results of the literature review. The distribution of articles by meta-heuristic algorithms is shown in Fig. 10. We can see that 46% of the total articles used GA and 12% of the literature are related to PSO algorithm. There are many meta-heuristic techniques that have been introduced in the last two decades to achieve better design and solve the hard problems in cloud environment such as lion optimization algorithm that has been used for task scheduling to enhance different QoS factors [e.g., resource utilization, degree of imbalance, makespan time (Ahmed Almezeini and Hafez 2017), gray wolf optimization algorithm to reduce the makespan time and energy consumption in cloud system (Natesan and Chokkalingam 2019)]. There is no single nature-inspired optimization algorithm that optimally solves all problems (i.e., no free lunch” (NFL) theorem) (Wolpert and Macready 1997). In other words, it is possible an optimization algorithm can solve a certain set of problems but are unsuitable for other types of issues (Mirjalili 2015). Consequently, the recent and effective optimization algorithms should be tested for solving the objective-based data replication problem. In addition, various modifications of the basic versions of existing meta-heuristic algorithms are presented for solving their weaknesses such as slow convergence rate and parameter adaptation. Due to the successful results of fuzzy logic in various scientific fields, researchers focus on using the positive features of fuzzy logic principles in meta-heuristics design (Kumar et al. 2019; Peraza et al. 2016). Hence, it is concluded that the enhanced versions of algorithms or hybrid meta-heuristics may offer better results compared to the other presented optimizers for complex replication problem.

Fig. 10
figure 10

Distribution of papers based on meta-heuristic algorithms

To simulate cloud and grid environments, there are some prominent simulations tools. We can see in Fig. 11, many algorithms have been implemented by authors in MATLAB or Java. The most popular simulators of cloud and grid are CloudSim and OptorSim to evaluate the experimental results by extending existing classes based on the requirements of algorithm. CloudSim is a discrete-event tool that is introduced by University of Melbourne (Goyal et al. 2012). Through CloudSim large cloud data centers, hosts, virtual components can be simulated and enables users to design resource allocation, scheduling, replication, provisioning strategies, virtualization techniques, and energy management techniques (Mansouri and Javidi 2018c).

Fig. 11
figure 11

Comparisons among testing environments

Also, OptorSim was proposed by University of Glasgow in Scotland to simulate the structure of a real data grid and investigate the effectiveness of replication algorithms (Bell et al. 2003). Consequently, it is expected that data replication algorithms are tested based on popular toolkits to configure a real cloud or grid scenario. Interestingly, common simulators (e.g., CloudSim and OptorSim) are also based on the most popular programming language, Java, and may be a reason for their popularity.

From Fig. 12, we can observe that the aim of the general idea of data replication is placing replicas at different locations. Nevertheless, the timing of data replication and selecting file for replication are crucial decisions and may have effect on computing time, storage space, data traffic, and so forth.

Fig. 12
figure 12

Distribution of papers based on main challenges of data replication

Moreover, most of works assumed only the scenario of read-only files and did not consider the consistency concept. However, there are always some read–write files in the system and some of them may be important. Generally, replica consistency management can be performed by two ways: (1) synchronously using the so-called pessimistic techniques and (2) asynchronously considering optimistic approaches. It is necessary to manage the data consistency among the various replicas distributed in different data centers. Since the cost of replication will depend on exactly, when and how those changes need to be carried out, for this aim, Terry et al. (2013) introduced consistency-based SLAs to show a broad set of acceptable consistency/latency trade-offs.

Figure 13 describes the number of research papers, which are considering different QoS parameters (i.e., response time, load balancing, availability, cost, security, energy, and energy). Figure 13 depicts that response time is used QoS parameter in all research papers, while only 1 research paper considered energy consumption and 2 used security parameter. It distinctly outlines the importance of security and energy factors and necessity of novel and improved data replication algorithms along with the rise in the utilization of cloud computing.

Fig. 13
figure 13

QoS parameters of data replication algorithms

In multiobjective data replication optimization, it is expected to consider several important parameters to meet the prerequisite of users and additionally service providers. Generally, the traditional algorithms used a process based on performance objectives for data replication, but the utilization of multiple objectives like QoS, cost, and load within one strategy was not proposed. This consideration can support optimal services with minimal expense for the users.

8 Case study

In this section, we try to investigate the impact of meta-heuristic techniques on performance of data replication algorithms. We evaluate all data replication algorithms based on meta-heuristic techniques by CloudSim that is a most popular framework for cloud (Calheiros et al. 2011). We have used the same fitness function and configuration for all algorithms. We formulate multiobjective data replication problem into mathematical form and define the fitness function by Eq. (1).

$$ {\text{Fitness}} = w_{1} \times {\text{RT}} + w_{2} \times {\text{SU}} + w_{3} \times {\text{RN}} $$
(1)

where RT, SU, and RN represent response time, storage usage, and number of replicas, respectively. w1, w2, and w3 indicate three parameters corresponding to the importance of fitness factors. In this paper, an equal importance considers for all parameters so all weights are set to one. Table 6 indicates the commonly used parametric settings of all strategies.

Table 6 Values for parameters of the meta-heuristics algorithms

8.1 Based on cloud structure

In this section, we try to evaluate different algorithms based on various numbers of tasks and virtual machines. Table 7 indicates the values of configuration for CloudSim (Mansouri et al. 2019).

Table 7 Detailed characteristics of CloudSim

Figure 14 indicates the fitness values for different meta-heuristic-based data replication algorithms (i.e., GA, ACO, FA, BA, PSO, WOA, GWO, and ABC). In Fig. 14a, it is obvious that when the number of tasks increases, the fitness value will be increased for all replication methods. In addition, GWO and WOAs show better behavior compared to others in high load condition. For example, WOA in average improves the fitness by 28% and 7% in terms of different number of tasks and VMs compared to FA and 36% and 12% compared to PSO. It is because WOA has a good capability of exploration due to the position updating mechanism of whales. Throughout the initial step of algorithm, this mechanism leads to the whales are moved randomly around each other. In the next steps, the positions of whales are rapidly updated and then whales move in the best direction as the spiral-shaped route. In WOA, these two phases are performed independently, in half iteration each local optima are avoided, and at the same time the convergence rate is improved through the iterations.

Fig. 14
figure 14

Fitness values based on a number of tasks and b number of VMs

While most of the other optimization techniques such as PSO and FA do not use operators to consecrate a specific iteration for exploration or the exploitation since they consider only one updating method, hence the probability of trapping into local solution is increased. With 20 VMs, we can observe that GWO is the best and GA is the worst since GWO can provide a suitable balance between exploration and exploitation.

For complex problems, the standard randomization method of WOA may lead to increase computational time since randomization has a key role in exploration and exploitation process, while the randomization method has little effect on ABC compared to WOA. In ABC method, the probability is only used to update the employee bees and onlooker bees. The employed bees are responsible for diversification and so perform a local search to each nectar source. The onlooker bees are considered for condensation and updating the better food sources. ABC algorithm can escape from local solutions by considering these groups of bees and so enhance the search capability. The result of this advantage of ABC can be seen in Fig. 14 where ABC improves the fitness compared to WOA, ACO, and GA by 5%, 14%, and 22% in terms of different number of VMs, respectively.

Figure 15 shows the performance of meta-heuristics in terms of fitness value by boxplot format that explains the empirical distribution of data. The boxplot graph in Fig. 15a is generated for 4000 tasks and for different number of virtual machines, while Fig. 15b is generated for 50 VMs and different number of tasks. In addition, the outcomes are obtained after simulating the experiment 30 times. The quality of the algorithm can be determined by the location of its box plot. Therefore, in our case (i.e., minimization optimization) a lower location of the box means the better quality of solutions. Figure 15 shows that the interquartile range and medians of ABC and GWO are comparatively low which means that these two methods perform better to achieve lower fitness. It can be observed from Fig. 15 that the median in the PSO and BA boxplot is in the middle of the rectangle and the whiskers are about the same length. There exists the number of outliers in FA, which is a point of concern to utilize this algorithm as optimization algorithm in replication problem. The ABC plot indicates less variation and spread compared to the other plots. The median of this model is approximately in the middle. Smaller boxes of ABC and GWO show smaller variance in the fitness value, and so they have a stable performance.

Fig. 15
figure 15

Boxplot of fitness based on different a number of VMs and b number of tasks

8.2 Based on heuristic structure

In this section, we compare replication algorithms in terms of population size and number of iterations. Table 8 shows the CloudSim configuration.

Table 8 CloudSim configuration

Another parameter that can be discussed for the evaluation of meta-heuristic algorithms is the size of population. In Fig. 16, GWO and ABC have shown the best results among the evaluated strategies. GWO quickly convergences to the global solution by 100 iterations. But other strategies find the promising region with the higher number of iterations. In addition, the ABC algorithm indicates its superior efficiency in Fig. 16, since the searching processes of ABC avoid from trapping in local optima.

Fig. 16
figure 16

Fitness value versus number of iterations

GWO can achieve to the global solution (0.2) with a very smooth convergence rate, and ABC is the second best algorithm (0.25). We can observe that BA and FA indicate the lower convergence rates compared to WOA and ABC with comparably low efficiencies. In other words, BA and FA can reach to global solution with a higher number of iterations. This is, perhaps, because of the poor exploration process of BA and the constant parameters of FA during the searching process and hence may trap in local optima.

BA has better convergence rate than PSO and GA. The main advantage of BA compared to PSO is that BA solves problem with echolocation and frequency tuning. As a result, it presents some capabilities similar to the key feature of particle swarm optimization. In addition, BA can automatically zoom into a promising area by switching from explorative moves to local intensive exploitation, whereas PSO suffers from local optima and low convergence speed to global solution, especially for high-dimensional space.

Figure 17 indicates the fitness value based on different size of populations. It is clear from results that the lowest fitness value (about 2.23) is obtained with size of population 35 by GWO. The main reason is that GWO algorithm applies several candidate solutions to avoid locally optimal solutions. It also suddenly jumps toward the desired region of search space by sharing the information of multiple candidate solutions.

Fig. 17
figure 17

Fitness value versus population size

Moreover, BA has better performance in terms of fitness (in average improves fitness by 23%) compared to PSO. It also can interpret from Fig. 17 that the increasing the number of particles cannot be effect on PSO to achieve better fitness because its capability of producing new solution is weaker than other new techniques such as GWO and WOA.

The comparison of response time, storage usage, and number of replicas is shown in Fig. 18, revealing that GWO can replicate files in the best manner. In summary, the results prove that GWO shows high performance in solving data replication problems. The main reason is the operators of GWO algorithm to evade local optima and converge faster toward the optimum.

Fig. 18
figure 18

Response time, storage usage, and number of replicas for different algorithms

We know that an optimization strategy is proper for solving a certain set of problems but may be inappropriate for some other problems. Therefore, this domain of research is open and allows the researchers to enhance the existing algorithms or introduce new meta-heuristic algorithms for obtaining consistent and accurate results.

9 Open research challenges

We list some open research topics that need attention.

9.1 Modeling the provider profit

Taking the technical side (i.e., delay, portability, etc.) and business side (release clauses, notice time constraints, price difference) together makes the problem of data replication complex. The reduction of operating expenses and increasing the profit of the provided services are critical due to the substantial infrastructure investments being made by cloud providers. One solution for this purpose is maximizing the efficiency of resource utilization. Nevertheless, increasing supplier profits does not necessarily coincide with the improvement of user service quality. The objective of data replication algorithm must be analyzing this multidimensional trade-off. However, fewer of the introduced algorithms consider the economic impact of data replication on the cloud provider. Therefore, studying the combined effectiveness of the optimal number of replicas and strategically storing them in such a way satisfies the response time objective for job exaction while simultaneously enables the provider to return a profit is still an open issue.

There are different types of costs that contribute toward the total expenditure of provider, for example, the computational investment cost that supports hardware to function properly. Network usage cost is another cost that is appeared due to data transmission and remote data access. Storing new replicas will require storage space and increase the expenditure of storage cost. On the other hand, it is difficult to determine how a change in one unit of expenditure leads to a change in another. In addition, the relationship among various kinds of expenditures can change from one provider to another. Consequently, a data replication algorithm should try to estimate the expenditures that affect the profitability of the service provider during its decisions. More work is required toward designing data replication algorithm under specific budget constraints.

9.2 Considering replica consistency

When a data file is changed in the system, the problem of maintaining consistency among existing replicas appears. Therefore, proposing a dynamic replica consistency model for different levels of consistency is critical. In addition, the consistency problem for metadata, i.e., additional information about data like directories and catalogs, exists. This kind of metadata and catalogs is used by replica management service to find the characteristics of replicas for each data. Such directories can also be copied, and their consistency is important to the correct operation of the system. Most of the works done in replication assumed that data are read-only and did not focus on consistency. It is necessary that the consistency algorithms dynamically adapt replica consistency number at runtime to provide a balance between consistency and the quality of service.

9.3 Modeling the energy consumption

Nowadays, one of the critical issues in cloud computing is high energy consumption. When number of replicas is significantly increased, then energy consumption is increased. Many researchers believe that rather than considering a performance approach, it might be more important to focus on energy as a critical parameter during data replication. Nevertheless, energy consumption modeling is another main question that is ignored by most optimization functions. If number of replicas is decreased, then performance of system may degrade. Hence modeling of energy consumption characteristics of data center and balancing a variety of trade-offs with various proposing replication management techniques will give better results. Moreover, it is necessary to consider the shutdown and power-on methods with the objective of optimizing the energy consumption of the data centers.

9.4 Deeper look at reliability and security

Data reliability and security have been widely concerned for current dynamic cloud storage, since public clouds are vulnerable to different kinds of the Denial of Service (DoS) attacks like Distributed Denial of Service (DDoS) and Economic Denial of Service (EDoS) (Masdari et al. 2016). Therefore, considering a data protection model in which a user’s sensitive data is replicated on various locations is essential. When there are several replicas in various locations, an attacker must delete or modify all of them to make data unavailable to the user and so data replication can improve the data survivability. On the other hand, the data security should be reduced by creating replicas of data file since it makes data easier to be stolen. Proposing a function of time to model data survivability and security for cloud environment under the protection of data replications are also fields that are continuously drawing the attention of researchers. The traditional data replication algorithms can obtain high data reliability by multireplicas, but leading to large storage consumption. Therefore, these algorithms reduce storage consumption by deleting redundant replicas, but data reliability cannot be guaranteed. Consequently, a dynamic approach for creating minimal replicas that can satisfy the data reliability and availability requirements based on file’s storage expectation should be considered.

9.5 Investigate the scalability problem

It is obvious that databases scale up in many dimensions like data structures, number of users, size, and complexity. Hence, all of the aforementioned aspects must be considered as a whole, so that the provider can allocate resources in a manner that can satisfy the requirements of data-intensive tasks. Consequently, another concern is the scalability of strategies as the system grows geographically.

9.6 Exploiting new meta-heuristic algorithms

Also observed from this survey that most of the strategies applied very basic meta-heuristic algorithms like standard PSO and genetic, while recently various modifications are also presented in the basic versions of existing meta-heuristics to solve the drawbacks of the basic ones. In addition, there are novel meta-heuristic techniques such as grasshopper (Saremi et al. 2017) algorithm that provide superior results in comparison with conventional and recent algorithms in the literature. Hence, these search optimization algorithms with high robustness and effectiveness may also be applied to solve data replication problem and provide solutions that are more accurate. Consequently, the recent and effective optimization approaches are necessary to investigate for solving the multiple objective-based data replication problem because the traditional strategy is only considered in most of the works.

9.7 Implementing with suitable simulators

Simulators play a key role to model the system and evaluating algorithms since experimentation in a real environment is very difficult and expensive. Consequently, there has been a significant increase in the development of cloud simulation frameworks with various features. For example, GreenCloud simulation (Kliazovich et al. 2012) tool focuses on energy awareness in every network devices and secCloudSim (Rehman et al. 2014) simulator presents the basic security features of authentication and authorization. Therefore, the researchers must choose the most appropriate simulator based on a set of requirements.

We can see that most of the algorithms included in this survey used personal code to test the strategies. As a next step, these replication algorithms must be implemented in valid simulators based on the real-world scenarios. In addition, most of works compared the simulation results with very simple approaches and so it is interesting to compare the proposed ones with the recent algorithms that are already far better than the basic algorithms.

9.8 Considering the real patterns of user application

Most of works test their algorithms by any orbital setting, but these evaluations do not seem to be enough. More study is necessary to design the user application models. For instance, the four features of big data as volume, velocity, variety, and veracity (Mansouri et al. 2019) can show the final form of the user application. Therefore, the replication algorithm developer will have to put into consideration more complex scenarios where the degree of interdependency will reach a new level.

9.9 Lose the discussion parameters

Most of the algorithms reviewed in this paper do not investigate the impact of various considered parameters on performance results. For instance, history length and network topology have a great impact on the obtained experiment results. Therefore, many experiments are still necessary for thoroughly evaluating the performances of algorithms.

Furthermore, since the execution of a replication algorithm based on meta-heuristic technique closely depends on the performance of the used meta-heuristic technique, the meta-heuristic technique overhead on the algorithm performances should be assessed.

Another problem is the ambiguity in the explanation of some replication algorithms and the lack of necessary and somehow critical details. For instance, the pseudo-code of algorithms and the values of the used parameters are not reported and so their implementations by others are very hard.

9.10 Optimizing the number of replicas

One of the main problems in data replication algorithm is achieving the performance factors, while a particular degree of replication (i.e., a certain number of replicas) is maintained. Therefore, an optimal number of replicas should be determined with economic and performance consideration, since the value of threshold for number of replicas plays a key role in achieving optimality for both users and service providers.

9.11 Using learning techniques

In real systems available today it is hard to gather complete information and discover knowledge about the system in advance. Data mining techniques can efficiently present new meaningful knowledge about tasks, files, and resources to enhance system management. With supervised, unsupervised, and semi-supervised learning techniques, we can find file access patterns, file correlations, and user access behavior. Then, this information can be applied to predict the future behavior system and enhance data replication. In addition, the clustering techniques should be used to find a balanced solution based on QoS requirements like security, load balancing, response time, and cost.

9.12 Widen your horizons: game theory

Another original future direction that we present is to apply game theory for data replication problem. Game theory is a formal study of decision-making and a powerful mathematical analysis for multiple levels of optimality (Yang et al. 2020). It is suitable when two or more agents with various preferences must make mutual decisions. This interdependence causes each agent to consider the other player’s possible decisions in own strategy. The agents seek to maximize their individual utility, and a data replication should be proposed in a manner that benefits both resource providers and users. For example, if a solution causes a lower response delay and cost for the users, then the broker profit is increased. Moreover, predicting the player behavior based on prior knowledge would be beneficial. To the best of our knowledge, there is no data replication algorithm based on game theory; thus, we encourage researchers to apply game theory during replication. Game theory-based strategies have low complexity and high computational speeds (Bielik and Ahmad 2012).

10 Conclusions

In this paper, we provide a review of data replication algorithms for cloud computing as well as data grid systems. Only few works classified data replication by taking meta-heuristic approaches as one of the classification criteria. Therefore, we have studied the recent dynamic replication strategies only according to meta-heuristic techniques and discussed their contributions.

According to the findings of a literature review from 2006 to 2019, it can be seen that there is no such meta-heuristic replication algorithm, which can fulfill all the required factors, but they can perform better when some particular factors among response time, resource utilization, latency, etc. have been considered at a time. From the above study, it can be observed that there is still a lot of approach to be proposed in the field of data replication for distributed environments.

Security is one of the main factors that are neglected by most of replication algorithms, and so to thwart the effects of security attacks, future study on data replication problem should notice different security-related parameters during replication decision. For example, data replication strategies can be combined with trust management approaches and trust level of each user should be applied besides the other conventional replication factors in cloud. Furthermore, some important objectives like load balancing on data center should be considered more to achieve green cloud computing.

Meta-heuristic algorithms have experienced a noteworthy shift toward the hybridization with other techniques for combinatorial optimization problems. The hybridization of various algorithms tries to exploit the complementary character of several optimization techniques and benefits from synergy. Hence, the recent hybrid meta-heuristics would be beneficial for solving the multiple objective-based data replication problem because the traditional algorithms are only investigated in most of the works.

Finally, we are convinced that research on game theory-based replication algorithm is still in its early days and researchers interested in this topic can obtain useful contributions. We hope that this review gives some guidance to this interesting line of research.