1 Introduction

The massive increase in data generation in recent years has led to the need for cost-effective solutions for data storage systems [1]. An overriding requirement of such systems is that timely access to data is guaranteed [2]. Lack of access to data can have severe consequences. For example, in medical emergency applications, patients' vital information is to be available to physicians whenever required [3]. In other activities such as research collaboration, it is desired that researchers share and use their findings wherever needed [4].

Cloud computing offers on-demand virtualized resources, such as computing nodes and storage spaces, to users who are billed on a fee-for-service model with negotiated service quality. In general, service quality criteria are expressed in Service Level Agreements (SLAs) between cloud providers and users. A typical highest priority item in such SLAs is data availability. This item deals with data access assurance, and is measured as the probability of data availability [5]. The second most common item in typical SLAs is related to system response time.

Cloud providers may have to pay penalties if they fail to provide users with the negotiated data availability. They may also face penalties for poor network performance such as high response times to users’ requests. As a result, it is important to improve the quality of cloud services to promote both user satisfaction and cloud profitability.

In distributed storage systems, each node can handle at most a limited number of requests simultaneously. If a request is sent to a node with a high workload, the request may be delayed until the previous requests are completed [6]. Well-designed strategies distribute users’ requests among storage nodes to decrease users’ response times. In this paper, we consider the response time to be the interval between the request initiation and the time the node receives the response to that request [8]. One such strategy is to use well-balanced placement of replicated data on the storage nodes. This usual results in balanced network node workloads, which in turn improves network performance [7].

Data replication strategies aim to improve data availability by creating identical copies of data, and placing them in distributed geographical locations. In this way, requested data can be accessed from any data replica [1, 2].

Data replication strategies have to address four main challenges [9]:

  1. 1.

    When is the appropriate time to replicate data? If the number of requests for data access increases prior to the creation of data replicas, this can result in increased response times. Thus, the replication process must be done before network performance degrades.

  2. 2.

    Which data need to be replicated? Replicating data which are not frequently accessed has no positive impact on network performance and will waste storage space. Although it is recommended that all data have at least one replica as backup to avoid data loss due to failure of storage nodes, data having high access rates should be targeted for the replication process.

    Selecting data for replication is mostly based on prediction and popularity techniques. In the prediction-based methods, the data that are forecast to be accessed by a majority of users in the near future using locality properties are replicated. In the popularity-based methods, data that had the highest access rates in the past are replicated [10].

  3. 3.

    How many replicas should be created? The number of copies used to reduce response time should be balanced with the need to minimize the required storage space.

  4. 4.

    Which nodes should be selected to store the replicated data? The nodes used to store replicated data are mostly chosen based on various characteristics such as available storage space and distance from users.

Depending on how they address the challenges listed, data replication strategies can be grouped into two main categories: (1) Static replication strategies are those that executed the same way regardless of changes in network conditions. Despite their ease of implementation, static strategies are rarely used in dynamic architectures such as cloud storage systems; (2) Dynamic replication strategies consider network conditions such as the status of storage nodes. Despite their higher computational costs, dynamic strategies increase network performance more than static ones [11].

Note that replication strategies can replicate data fully or partially. While all the data are replicated in full-replication strategies, partial-replication strategies only focus on high demand data to reduce related storage costs [10].

Recently, Peer-to-Peer (P2P) architecture has been considered an attractive alternative for cloud storage systems. In this type of architecture, nodes of the system are made up of independent nodes connected via the Internet. P2P cloud storage systems can offer advantages over those of traditional client–server architectures. For example, if a node fails to respond to a user ‘s request, other nodes holding the same data can respond to that request. However, a well-positioned node in a P2P structure may fail to serve any requests due to such features as low processing capacity. On the other hand, a node may be hard to access if it is not located in a proper structural position in the network, even if it has optimal feature characteristic. As a result, a replication strategy in a P2P storage system must consider both the features and structural status of nodes.

In this paper, we propose a full dynamic popularity-based replication strategy for P2P cloud storage systems that increases data availability and improves network performance in load balancing and response time. Our strategy aims to select the most popular data for replication and determine the proper nodes for storing the popular data. Additionally, the proposed strategy employs both feature and structural status to select the nodes for storing the popular data.

The outline of this paper is as follows: A literature survey is given in Sect. 2, followed by our proposed methodology in Sect. 3. Simulation results are presented in Sect. 4, and conclusions and future work are discussed in Sect. 5.

2 Literature Review

In P2P architecture, the nodes i.e. peers, may act as independent nodes if the peer nodes are built from users’ own computer systems in such a way that each node can act as a server or client [12, 13].

To model the P2P architecture, undirected connected graphs are considered. We denote the P2P graph as \(G(P, E)\), where \(P\) is a non-empty set of graph nodes and \(E\) is a set of links connecting the nodes [14].

In contrast to client–server architecture, there is no single point of failure in P2P structures. This is because all nodes can act as a server. A P2P network can also be easily scaled up by the addition of a new peer node with low cost [15, 16]. Additionally, the small-world phenomenon applies to P2P networks, which means the data can be transferred across the network in a reasonable time. In general, in the networks having small-world behavior, the average of the highest distance between two nodes (d) can be estimated using the following equation:

$$\langle d\rangle \approx \frac{\mathit{ln}N}{\mathit{ln}\langle k\rangle }$$
(1)

where, N is the number of nodes in the network and k is the average node degree.

Despite the advantages mentioned, P2P architecture suffers from some drawbacks. It mostly requires a complex management system, since the network is made up of different computer systems. Furthermore, each node can freely and randomly join or leave the network at any time, which negatively impacts data availability [17].

P2P structure can be classified into four categories [18] which are unstructured, structured, super-peer and hybrid P2P networks. In the following, the replication strategies of different P2P architectures are discussed in detail.

2.1 Replication strategies in unstructured P2P networks

In unstructured P2P networks, the nodes do not have the information related to the geographical locations of other nodes. In this way, searching for a piece of data is done by continuous forwarding of search messages from one node to its neighbors. As a result, successful hits for specific data are not guaranteed since predefined mappings between data and their locations are not provided [3, 11].

In [19], a zone-based data replication is introduced while the whole network is divided into sub-zones based on the number of nodes in the network. It is assumed that each node has the information about the number of access requests for its local data. In this way, the popularity of the local data is calculated and broadcasted to the network. Nevertheless, the access time and the size of the data are not considered. Furthermore, as the nodes have no global information about the network, data may be replicated in different sub-zones redundantly. Finally, neither the feature nor the structural status of nodes is considered for storing the replicated data.

In [20], a data replication strategy for utilization of unused storage space is discussed. When the replication process starts, every node gathers the status of each node from the network to find the history of data transferred to other nodes. In this way, data are not replicated only in a group of well-configured nodes. The nodes having enough space to store the replicated data are selected. However, the structural score of each node is not taken into account in this strategy.

2.2 Replication strategies in structured P2P networks

In the structured P2P networks, the mappings between the data and their locations are controlled using key-value pairs, kept in Distributed Hash Tables (DHT). DHTs can then be utilized to locate the data faster.

In [21], a replication strategy called CORP is suggested to decrease the load on busy nodes. CORP replicates the data that have high access rates. On the other hand, the data that have frequent updates are replicated as low as possible to minimize the cost of ensuring consistency among replicas. In the creation phase, CORP considers both the data popularity and frequency of update on the data. In the placement phase, the created replicated data are placed on nodes with highest load capacities and closest distance to replicas that store the same data. As a result, CORP strategy reduces the cost of frequent updates. Nevertheless, storing the replicated data in the same geographical regions makes the data vulnerable to failures such as natural disasters. This shortcoming reduces the data availability and influences network performance especially for the requests originated from different geographical areas.

In [22], a distributed data replication model is presented. The replicated data is placed on the nodes, where the lowest delay for data access is obtained. Additionally, this model uses balanced-tree data structure to search for replicated data. Nevertheless, it is assumed that no changes are happened in the network which is not realistic due to the churning nature of cloud network.

2.3 Replication strategies in super-peer P2P networks

In the super-peer P2P architecture, the nodes are categorized into groups which are managed by super-peer. The super-peers are the nodes having more processing power, bandwidth, and storage spaces compared to others. Each super-peer keeps the group information such as the nodes characteristics and the data stored in them. In this way, the super-peers are able to manage search requests initiated by the nodes within the groups [3].

In [23], an interest-based data replication strategy named SWARM is introduced. In SWARM, the nodes are grouped into colonies using users’ common interests to share data among them. Each group of nodes are assigned to a super-node to deal with the replication process. However, SWARM does not consider the dynamic characteristic of cloud storage system due to its low dynamicity.

In [24], a data replication strategy is introduced to prepare a predefined level for data availability and minimize the response time. The network nodes are grouped into virtual clusters. Stronger nodes, in terms of processing power and storage capacity, called CH, are selected to manage each cluster. Initially, each CH computes the local data availability which means the ratio of the number of replicas in the cluster to the cluster size. Afterwards, global data availability is calculated using communications among CHs. Then, the data with lower availability are considered as rare data, and selected for replication. However, this model assumes the nodes have no limitations in their storage spaces.. Moreover, the structural status of the nodes is not considered in this model.

2.4 Replication strategies in hybrid P2P networks

Unstructured P2P networks are rarely implemented in a fully-distributed manner. In contrast, a combination of peer-to-peer and client–server models called the hybrid P2P networks are used. Hybrid models employ centralized servers to monitor accessing to data. For example, in P2P structure of Skype, when a node searches data, it sends message to the centralized servers [12]. Then, the servers identify the nodes stored the requested data. Finally, the initial nodes receive the requested data by direct communication with the storage nodes [3, 18].

In [25], a model named DROPS is studied. DROPS aims to increase the network performance and improve the security of users’ data. In DROPS, the users’ data are divided into blocks to increase data security. Nevertheless, fragmentation reduces data availability since all storage nodes holding any blocks of data must be available when needed. To select the proper node for storing the popular data, the centrality characteristic related to the structural status of node is applied. However, only one replicated data for each fragmentation is created which makes data availability at risk.

In [26], a distributed strategy named DARS, is suggested for data replication. Each node keeps the access rate of its stored data. Before the node workload surpasses the threshold, the data with the highest requests is replicated using fuzzy logic. Then, the replicated data is placed on the neighbor nodes with a lower workload and higher degree. However, the replicas aggregation in one region decreases the data availability. This is because, the localization strategy of DARS increases the access time for users in different geographical areas.

Table 1 illustrates a comparison of the discussed data replication strategies in the P2P architecture based on the main criteria.

Table 1 Comparison of data replication strategies in P2P architecture, that and indicate the parameter is considered and not considered respectively

3 The proposed replication strategy

In this section, we propose a dynamic data replication strategy using a hybrid P2P architecture for cloud storage system. In the proposed strategy, any request for accessing a piece of data will be carried out by supper-peer nodes. In this way, the specification of the replicas is sent to the user. The super-peers are aware of the information about each data such as its size, number of replicas, and the access rate in each period of time. As a result, the super-peers are able to calculate the popularity changes for each data. Note that, any changes on data popularity and the scores related to the candidate nodes are performed at the start of a fixed period in the network activity.

Our proposed strategy consists of two main parts:

  1. 1.

    Determining the most popular data for replication: The most popular data are replicated to reduce the response time to access the data and distribute the workload among storage nodes. Additionally, any replicated data that are no longer in interest of different users is removed to free the storage spaces and use it efficiently.

  2. 2.

    Determining the proper nodes for storing the most popular data: The best nodes are selected according to the feature specifications and the positional status of all storage nodes in the network.

In the following, each part of the proposed strategy is discussed in detail.

3.1 Determining the most popular data for replication

Generally, different data do not have the same interest to users. In fact, in practice only a small subset of data is of interest to a large subset of users. Thus, it is important to calculate the popularity of each data and replicate only the data that are of more interest to the users. On the other hand, the pattern of access requests for a particular data will not remain constant over the time. For example, analyzing Google searches on two different programming languages, the Perl and the Python, from 2004 to 2020 shows that these two programming languages have different access ratios. Perl programming language search was a dominant trend in 2004–2006, but this trend \ lost its attraction afterwards. In this way, we can consider the data related to the Perl programming language as “cool” data. This means these data were only attractive in the past but no longer. Given that most of the recent searches are now focused on Python (compared to Perl), we can consider the data related to Python programming language as “hot” data.

It is important to identify both the hot and cool data in the replication strategy. This is because, we can replicate the hot data to increase the network performance, and remove the replicated data related to the cool data to free the storage space without downgrading the performance.

The number of access requests for a piece of data affects its popularity directly. However, the access rate is also related to the total number of requests sent in the network. Using the access rate, a more accurate perspective can be obtained for identifying the popularity of data. On the other hand, the quantity of used storage space is vital in the P2P networks. Thus, we must determine the size of each data before replication. In this way, we consider the smallest stored data in the network.

Figure 1 is the algorithms of the proposed strategy for determining the most popular for replication. The popularity change of the \({i}^{th}\) data in the \({n}^{th}\) period is calculated in Eq. (2).

Fig. 1
figure 1

Determining the most popular data for replication

$$\Delta {P}^{n}\left({f}_{i}\right)=\left\{\begin{array}{c}+ \left(\frac{N{R}^{n}\left({f}_{i}\right)}{TN{R}^{n}}*\frac{\mathit{min}\left(Size\left(f\right)\right)}{Size\left({f}_{i}\right)}\right),\\ N{R}^{n}>0\\ - \left(\alpha *\left(\frac{TN{R}^{n}-1}{TN{R}^{n}}\right)\right), \\ Otherwise\end{array}\right.$$
(2)

where,

  • \(N{R}_{n}\left(fi\right)\) is the number of access requests for the data file \(i\) in the \({n}^{th}\) period;

  • \(TN{R}_{n}\) is the total number of access requests in the \({n}^{th}\) period for all the data in the network;

  • \(Size({f}_{i})\) is the size of the data file \(i\).

  • \(min\left(Size \left(f\right)\right)\) is the smallest data file in the network;

  • Parameter \(\alpha \), which ranges from 0 to1.

In our proposed strategy, if the popularity change of a data file exceeds 1, the data file is considered as “hot”. Note that, the number of replications for a piece of data is determined based on the integer part of the popularity changes. On the other hand, when the popularity of a data file is less than −1, the data file is determined as “cool”. In this way, one of its replicated data is eliminated randomly provided that the availability of a cool data file does not become less than the desired threshold for data availability.

A minimum replica number for each data file must be determined and kept unchanged. This number has a direct correlation with data availability and users’ desired availability for the data files. In this way, the minimum required replicas for each data are calculated in Eq. (3):

$${R}_{min}\left({f}_{i}\right)= \lceil\frac{1}{1-A{v}_{th}}*\left(P\left(F\right)+{H}_{off}\right)\rceil$$
(3)

where,

  • \(A{v}_{th}\) is the desired threshold for data availability which is a decimal number;

  • \({H}_{off}\)• is the offline hours of the storage nodes which is related to the churn behavior of peers in the P2P structure;

  • \(P(F)\)• is the average failure probability of the storage nodes, calculated in Eq. (4):

$$P(F)= \frac{\sum_{i=1}^{N}({P(F}_{i}))}{N}$$
(4)

where,

  1. o

    \(P({F}_{i})\) is the failure probability of the storage node \(i\);

  2. o

    \(N\) is the number of nodes in the network.

Note that, the average lifetime of storage nodes or any hardware is determined by their factories. Thus, the failure probability for the \({i}^{th}\) storage node in a year is calculated in Eq. (5):

$$P({F}_{i})=AF{R}_{i}= 1-\mathrm{exp}\left(\frac{-8760}{MTT{F}_{i}}\right)$$
(5)

where,

  • \(AF{R}_{i}\) is the annual failure rate.

  • \(MTT{F}_{i}\) is the average failure time.

  • 8760 is the number of hours per year [2].

The offline hours, expressed as \({H}_{off}\) in Eq. (3), is the offline hours of storage nodes in 24 h. \({H}_{off}\) is calculated for the node \(i\) in Eq. (6):

$${H}_{off}= \frac{\sum_{i=1}^{N}(\frac{{H}_{off}^{i}}{24})}{N}$$
(6)

If the number of replicas for a cool data becomes less than the number calculated in Eq. (3), the replicated data related to the cool data is not deleted. In this condition, the popularity of the cool data is set to 0.

3.2 Determining the proper nodes for storing the most popular data

It is important to place the created replicated data of the most popular data on proper nodes. In our proposed strategy, the proper nodes are determined using scores calculations for each node which are based on two different scores as the following:

  1. 1.

    The first score is calculated using the nodes feature specifications called feature score. The feature specifications include nodes workload and their available storage space.

  2. 2.

    The second score called structural score, consider the position of storage node in the network. The structural score also informs other storage nodes in the network and the links between them. In other words, the positional status of all storage nodes effects the structural score of each node.

The total score of storage nodes is computed based on the feature and structural scores. This calculation is demonstrated in Eq. (7):

$$ScTotal\left(n\right)=\left[\theta *ScStructural(n)\right]+ [\left(1-\theta \right)*ScFeature\left(n\right)]$$
(7)

where,

  • Parameter θ, called impact factor, determines the significance of the feature and structural scores. If its value gets close to 0, it indicates the priority of the feature score versus the structural score. On the other hand, the structural score of the storage nodes can be prioritized over the feature score by assigning its value close to 1.

Figure 2 is the algorithms of the proposed strategy for determining the proper nodes for storing the most popular data. In every time interval of the network activity, all storage nodes calculate their scores. Each node has all the information needed for the computation of the feature score. Moreover, the structural score of each storage node is computed using super-peer(s) support. Note that, super-peer(s) has the global information of the P2P network. Finally, when the feature and.

Fig. 2
figure 2

Determining the proper nodes for storing the most popular data

structural scores of the storage nodes are calculated, the total scores can be obtained. In the following, each score is elaborated in detail.

3.2.1 Feature score

In every time interval, the feature score of the storage nodes is calculated using the feature specifications which is given in Eq. (8):

$$ScFeature(n)=\beta *\left(\frac{{c}_{n}*{a}_{n}}{{\lambda }_{n}*{\tau }_{n}}\right)$$
(8)

where,

  • \(\beta \) indicates the importance of feature score versus structural score;

  • \({\lambda }_{n}\) presents the access ratio received by the node;

  • \({\tau }_{n}\) shows the average response time of the node for the received access requests;

  • \({c}_{n}\) specifies the number of requests that the node can respond simultaneously. In general, each node can respond to a limited number of simultaneous requests;

  • \({a}_{n}\) presents the ratio of the node activities in the current time interval. \({a}_{n}\) is important in the P2P architecture since a storage node can get offline and inaccessible at some points.

3.2.2 Structural score

The position of the storage nodes and how they connect with other nodes determine the structural scores. Note that, storage nodes are connected via an undirected link in an internet-based network which indicates the communications of the nodes in both directions. Moreover, the distance between two nodes affects their communications quality. Since shorter connection between two nodes improves network performance, we consider the reverse of link length for the weight of each link.

On the other hand, the centrality metric demonstrates the importance of nodes according to their structure status in the network graph [27, 28]. If a storage node has a higher centrality, it has prominent impact on the network. However, the type of centrality metric is based on the network topology. In this way, the evidential centrality is one of the centrality metrics that can be applied to weighted undirected graphs [29] which is accordance to our P2P graph. The evidential centrality is based on Dempster-Shafer theory. In this theory, each problem consists of a set of hypotheses as shown in Eq. (9):

$$U=\{{H}_{1}, {H}_{2}\}$$
(9)

The power set of the problem set, stated in Eq. (9), includes all the possible states of the hypotheses which is named as frame of discernment [29] which are as the following:

  1. 1.

    No hypothesis occurs.

  2. 2.

    If \({H}_{1}\) occurs

  3. 3.

    If \({H}_{2}\) occurs

  4. 4.

    Both the \({H}_{1}\) and \({H}_{2}\) occur

    The possible hypotheses are demonstrated in Eq. (10):

$${2}^{U}=\left\{\varnothing , \left\{{H}_{1}\right\}, \left\{{H}_{2}\right\}, \left\{{H}_{1}, {H}_{2}\right\}\right\}$$
(10)

Generally, the probability mass function assigns the value from 0 to1 for each member of the power set. The assigned value indicates the occurrence probability of each member [29]. Furthermore, the mass value for the null member is zero. Finally, the sum of the assigned mass values is equal to 1. The characteristics of mass function are presented in Eq. (11):

$$m:{2}^{U}\to \left[\mathrm{0,1}\right], m\left(\varnothing \right)=0, \sum_{A\in {2}^{U}}m\left(A\right)=1$$
(11)

In the proposed strategy, we employed centrality metric to calculate the structural score. Based on the Dempster-Shafer theory, the structural score in our strategy is defined using the two problems related to degree and weight. The number of links connected to each node is used in the degree problem. For the weight problem, the sum of the weights of links connected to each node is considered.

Similar to Eq. (9), the hypotheses \(l\) and \(h\) for the degree problem is illustrated in Eq. (12):

$$di=\{l,h\}$$
(12)

where,

  • The hypothesis \(l\) indicates the probability that the node \(i\) is low-effective in the network based on its degree;

  • The hypothesis \(h\) show the probability that the node \(i\) is high-effective in the network based on its degree.

Similar to Eq. (10), the power set of the degree problem set for the node \(i\) is shown in Eq. (13):

$${2}^{di}=\left\{\varnothing , \left\{l\right\}, \left\{h\right\}, \left\{l,h\right\}\right\}$$
(13)

The probability mass function assigns a value between 0 and 1 (inclusively) to each member of the power set. In this way, the max–min normalization is used which is displayed in Eq. (14) [30]:

$${A}_{normal}=\frac{|A-\mathit{min}\left(S\right)|}{\mathit{max}\left(S\right)-\mathit{min}\left(S\right)}*\left(U-L\right)+L$$
(14)

where,

  • \(S\) is the set of numbers;

  • A is a member of the set \(S\);

  • \(U\) is the upper bound of the range;

  • \(L\) is the lower bound of the range;

  • max(S) is the maximum value in the set S;

  • min(S) is the minimum value in the set S.

The lower and upper bounds of the mass function are 0 and 1, respectively. When all members have equal value, the denominator of min–max normalization gets 0. To avoid this condition, a small positive value as \(\varepsilon \) is added to denominator. Hence, the modified form of normalization function is illustrated in Eq. (15):

$${A}_{normal}=\frac{|A-min(S)|}{\mathit{max}\left(S\right)-\mathit{min}\left(S\right)+\varepsilon }$$
(15)

To determine the probability that member \(A\) is a high-effective member in the set, the value of member \(A\) is compared to the minimum value of the set members. In this way, the probability that storage node \(i\) with the hypothesis \(h\) is a high-effective node according to its degree can be calculated using Eq. (16):

$${m}_{di}(h)=\frac{|{k}_{i}-{k}_{m}|}{{k}_{M}-{k}_{m}+\varepsilon }$$
(16)

where,

  • \({k}_{i}\) is the degree of node \(i\);

  • \({k}_{m}\) and \({k}_{M}\) are the lowest and the highest degrees of the nodes in the network respectively.

This is also shown in Eq. (17) and Eq. (18):

$${k}_{m}=min\{{k}_{1},{k}_{2}, \dots , {k}_{N}\}$$
(17)
$${k}_{M}=max\{{k}_{1},{k}_{2},\dots , {k}_{N}\}$$
(18)

A member is compared to the maximum value in the set to detect the probability of its low-effectiveness which is illustrated in Eq. (19):

$${A}_{normal}=\frac{|A-max(S)|}{\mathit{max}\left(S\right)-\mathit{min}\left(S\right)+\varepsilon }$$
(19)

The probability that a storage node is a low-effective node using its degree is calculated similarly. The value of the hypothesis \(l\) in the degree problem is obtained in Eq. (20):

$${m}_{di}(l)=\frac{|{k}_{i}-{k}_{M}|}{{k}_{M}-{k}_{m}+\varepsilon }$$
(20)

Since the total value of the mass function for a power set should be equal to 1, the value of \({m}_{di}(l,h)\) can be calculated using Eq. (21):

$${m}_{di}\left(l,h\right)=1-[{m}_{di}\left(l\right)+{m}_{di}\left(h\right)]$$
(21)

Additionally, the weight problem of the node \(i\) can be defined using Eq. (22):

$$wi=\{l,h\}$$
(22)

where,

  • The hypothesis \(l\) shows the probability that node \(i\) is low-effective in the network based on its weight;

  • The hypothesis \(h\) indicates the probability that the node \(i\) is high-effective in the network based on its weight.

Moreover, the problem power set for the weight problem is presented in Eq. (23):

$${2}^{wi}=\left\{\varnothing , \left\{l\right\}, \left\{h\right\}, \left\{l,h\right\}\right\}$$
(23)

The probability that node \(i\) is a high-impact node according to its weight is calculated according to Eq. (15). Specifically, the value of hypothesis \(h\) for the weight problem is obtained in Eq. (24):

$${m}_{wi}(h)=\frac{|{w}_{i}-{w}_{m}|}{{w}_{M}-{w}_{m}+\varepsilon }$$
(24)

where,

  • \({w}_{i}\) is the weight of the node \(i\);

\({w}_{m}\) and \({w}_{M}\) are the minimum and maximum weights of the nodes in the network as shown in Eq. (25) and (26):

$${w}_{m}=min\{{w}_{1},{w}_{2}, \dots , {w}_{N}\}$$
(25)
$${w}_{M}=max\{{w}_{1},{w}_{2},\dots , {w}_{N}\}$$
(26)

The probability that a storage node such as node \(i\) has a low-effect in the network based on its weigh is calculated in Eq. (27):

$${m}_{wi}(l)=\frac{|{w}_{i}-{w}_{M}|}{{w}_{M}-{w}_{m}+\varepsilon }$$
(27)

Using Eq. (11), the total value of the mass function in each power set problem should be equal to 1. Hence, the value of \({m}_{wi}(l, h)\) is calculated in Eq. (28):

$${m}_{wi}\left(l,h\right)=1-[{m}_{wi}\left(l\right)+{m}_{wi}\left(h\right)]$$
(28)

The probability that a node is high-effective in the network is obtained by combining the hypotheses \(h\) for both degree and weight problems. This is done by applying the Dempster rule of combination theory given in Eq. (29). The Dempster rule of combination for our hypotheses is also shown in Table 2.

Table 2 The probability mass functions based on the Dempster rule of combination
$${m}_{i}\left(h\right)=\frac{1}{1-k}\sum_{A={m}_{i}(h)}{m}_{di}(h).{m}_{wi}(h), k=\sum_{A=\varnothing }{m}_{di}\left(h\right).{m}_{wi}\left(h\right), A={m}_{di}(h)\cap {m}_{wi}(h)$$
(29)

The probability that the node \(i\) is low-effective in the network, is calculated by unifying the hypotheses \(l\) for both degree and weight problems as is shown in Eq. (30):

$${m}_{i}\left(l\right)=\frac{1}{1-k}\sum_{B={m}_{i}(l)}{m}_{di}(l).{m}_{wi}(l), k=\sum_{B=\varnothing }{m}_{di}\left(l\right).{m}_{wi}\left(l\right),B={m}_{di}(l)\cap {m}_{wi}(l)$$
(30)

Finally, the structural score of the storage node \(i\) is calculated based on the evidential centrality given in Eq. (31):

$$ScStructural\left(i\right)={m}_{i}\left(h\right)-{m}_{i}(l)$$
(31)

In the following, the structural score in a sample network, illustrated in Fig. 3) is calculated using the proposed strategy.

Fig. 3
figure 3

A sample network

As discussed, the minimum and maximum degree of this sample network are achieved using Eq. (17) and Eq. (18) respectively:

$${k}_{m}=min\left\{1, 2, 3\right\}=1$$
$${k}_{M}=max\left\{1, 2, 3\right\}=3$$

The probability that the storage node A is a high-effective node in the network based on its degree is calculated using Eq. (16).

$${m}_{dA}\left(h\right)=\frac{\left|1-1\right|}{3-1+0.01}=0$$

Using Eq. (20), the probability that the node A is a valuable storage node according to its degree is shown in the following:

$${m}_{dA}\left(l\right)=\frac{\left|1-3\right|}{3-1+0.01}=\frac{2}{2.01}$$

Hence, the value of \({m}_{dA}(l,h)\) according to Eq. (21) is illustrated below:

$${m}_{dA}\left(l,h\right)=1-\left[0+\frac{2}{2.01}\right]=\frac{0.01}{2.01}$$

In the proposed strategy, the weights of nodes are specified as: \(AB=\frac{1}{10}\), \(BC=\frac{1}{40}\), \(BD=\frac{1}{10}\), and \(CD=\frac{1}{25}\).

Furthermore, the normalized links weights are \(AB=4\), \(BC=1\), \(BD=4\), and \(CD=1.6\). Using Eq. (25) and Eq. (26), the minimum and maximum weights are obtained as the following:

$${w}_{m}=min\left\{2.6, 4, 5.6, 9\right\}=2.6$$
$${w}_{M}=max\left\{2.6, 4, 5.6, 9\right\}=9$$

Using Eq. (24), the probability that node A is a high-effective node in the network based on its weight is computed as what is below:

$${m}_{wA}\left(h\right)=\frac{\left|4-2.6\right|}{9-2.6+0.01}=\frac{1.4}{6.41}$$

Using Eq. (27), the probability that the storage node A does not have structural impact according to its weight is calculated as the following:

$${m}_{wA}\left(l\right)=\frac{\left|4-9\right|}{9-2.6+0.01}=\frac{5}{6.41}$$
$${m}_{wA}\left(l,h\right)=1-\left[\frac{1.4}{6.41}+\frac{5}{6.41}\right]=\frac{0.01}{6.41}$$

Using Dempster rule of combination given the Eq. (29), Eq. (30), and Table 2, we obtain \({m}_{A}\left(h\right)=0.0014\) and \({m}_{A}\left(l\right)=0.9983\). Consequently, the structural score of node A is obtained as what is below:

$$ScStructural\left(A\right)=0.0014-0.9983=-0.9969$$

This negative value shows that the structural status of storage node A is unsuitable in the network. In other words, accessing the data stored in node A is not affordable for users. Similarly, the structural score of node B is 1.0015. Thus, the node B is an important node in the network which is aligned with Fig. 3 that the node B acts as a hub in the network. Using Eq. (7) and Eq. (8), the total score of node A and B is demonstrated in Table 3.

Table 3 The scores of nodes A and B related to the network shown in Fig. 3

As shown in Table 3, if the parameter \(\theta \) gets close to 0, the total score of node A increases to more than node B score. This is because, the feature score of node A is higher than node B. On the other hand, by increasing the parameter \(\theta \), the importance of node B compared to node A is proved.

In our proposed strategy, the parameter \(\theta \) controls the impact of feature or structural scores of storage nodes. Hence, the best value for the parameter \(\theta \) can be adjusted flexibly according to the network conditions.

4 Model evaluation

To evaluate the proposed strategy, we have used the CloudSim simulation toolkit. Table 4 shows the configuration specifications of the virtual machines used in the simulation.

Table 4 The specifications of virtual machines

The configuration specifications of the storage nodes are given in Table 5.

Table 5 The specifications of storage nodes

In CloudSim, users’ access requests are sent as cloudlets in the network. The configuration specifications of cloudlets are shown in Table 6.

Table 6 The specifications of cloudlets

The number of data files is randomly applied in this simulation with its size ranging from 1 to 1024 MB. To simulate the users’ access rate for each data, the normal distribution function is used.

We compared our proposed strategy with DROPS [25], since DROPS is based on hybrid P2P architecture. Additionally, the random placement strategy is used for comparison. This strategy places the created replicated data randomly on the storage nodes [22].

Figure 4 compares the average number of replicas versus the time interval for the proposed strategy, DROPS [25] and random placement strategy. Note that, the unavailability probability of the storage nodes in the proposed strategy is set to 0.3.

Fig. 4
figure 4

The average number of replicas versus the network activity in the proposed strategy compared to DROPS [25] and random placement strategy

As shown in Fig. 4, the consumption of the storage spaces in the proposed strategy are more than the other models.

This is because, the proposed strategy considers the users’ offline hours and the nodes failure probability. This accommodates with the real condition since it is related to the churn behavior of the P2P networks’ users. Additionally, DROPS [25] and random placement strategy generate a fixed number of replicas for all the data which is not appropriate for the dynamic nature of cloud network.

For the proposed strategy, the average number of replicas compared to the file removal rate i.e., the parameter α in Eq. (2), is illustrated in Fig. 5.

Fig. 5
figure 5

The average number of replicas versus the file removal rate (α in Eq. (2)) in the proposed strategy

Note that, \(PF\) notation denotes the failure probability of storage nodes in Eq. (4) and \({H}_{off}\) is the average ratio of the node offline hours per day as described in Eq. (6). The sum of these two parameters shows the unavailability probability of the storage nodes. In other words, the number of replicas can be examined compared to the unavailability probability of nodes using Fig. 5.

As shown in Fig. 5, using a constant value for parameter α, when the unavailability of storage nodes increases, the average number of replicas increases.

For example, if the parameter α is 0.2 and the sum of \(PF\) and \({H}_{off}\) are 0.1 and 0.5, the average number of replicas of the data files are 4.4 and 11.9, respectively. Moreover, by increasing the parameter α and keeping a constant value for the node’s unavailability probability, the number of replicas reduces. For instance, if the parameter α is 0.1 and 0.9 and the sum of \(PF\) and \({H}_{off}\) are equal to 0.1, the average number of replicas is 4.3 and 2.9, respectively.

The load balancing of storage nodes versus the parameter θ (in Eq. (7)) using the proposed strategy is shown in Fig. 6. Parameter θ is the impact factor related to the impact of feature and structural scores. When the parameter θ gets close to 1, the structural score of nodes is prioritized over the feature score, and vice versa.

Fig. 6
figure 6

The load balancing of nodes versus the impact factor (θ in Eq. (7)) in the proposed strategy

Note that, Std notation indicates the standard deviation of data access in a normal distribution. The lower values for Std denotes that many accesses are requested for a small group of data. On the other hand, a more distributed access pattern for the data results in higher values for Std.

As presented in Fig. 6, the load balancing of nodes is increased by considering both the feature and structural scores of storage nodes in the proposed strategy. This is because, regardless of the parameter Std, by setting up the parameter θ between 0 and 1, a more balanced workload is achieved than setting the parameter θ to 0 or 1. For instance, if the parameter θ gets close to 0, the load balancing of the nodes increases since more distributed access pattern for the data is happened. On the other hand, when a small group of data are more popular to many users, this shows that the parameter θ is close to 1.

As illustrated in Fig. 6, the feature score has more influence than the structural score for higher values of Std. Alternatively, for lower values of Std, the structural scores of storage nodes have greater impact than the feature one. For example, assigning 0.3 and 10 for the parameter θ and Std respectively, the nodes workload is at their highest balanced states.

If we decrease Std, the workload mostly gets higher for a constant value of the parameter θ. For example, if the parameter θ is assigned to 0.4, the lowest load balancing of the network occurs for Std \(=0.1\).

Note that, if the parameter θ is set to 0, the feature score is only considered for storage nodes. On the other hand, by assigning the value of 1 for the parameter θ, the feature score is not used. Clearly, considering both the feature and structural scores has more advantages over using only one score and is taken into account for selecting the nodes to store the replicated data.

Our proposed strategy is flexible in balancing the nodes workload. In this way, according to the data access rates and the network workload, we can change the parameter θ to reach our desired goals.

For the proposed strategy, the response time versus the parameter θ in Eq. (7) is shown in Fig. 7 for a variety of std values.

Fig. 7
figure 7

The response time versus the impact factor (parameter θ in Eq. (7)) in the proposed strategy

As presented in Fig. 7, when Std is equal to 10, the lowest response time occurs value the parameter θ is equal to 0.3. Similarly, for different values for Std, the lowest response time is achieved when the parameter θ is between 0 to 1 exclusively.

Having a more distributed access rate for the data results in lower response time. Furthermore, access to a small group of data increases the response time. As a result, the idea of considering both the feature and the structural score in the proposed strategy improves the network performance in terms of the response time. This is attained using the parameter θ in the proposed strategy. Thus, the proposed strategy can flexibly adjust the parameter θ according to the data access rate.

In Fig. 8, the nodes’ load balancing versus the number of cloudlets in the proposed strategy, DROPS [25] and random placement strategy are depicted and compared.

Fig. 8
figure 8

The nodes load balancing versus the number of cloudlets in the proposed strategy compared to DROPS [25] and random placement strategy

As shown in Fig. 8, when the number of cloudlets sent to the network is 1500, the percentage of load balancing in the proposed strategy, DROPS [25] and random placement strategy are 87.4, 81.8 and 78.4 respectively.

The proposed strategy improves the network performance by increasing the nodes load balancing, compared to DROPS [25] and the random placement strategy. This is because, only the structural score is considered in DROPS [25]. On the other hand, the random placement strategy does not use any feature or structural status of nodes for selecting the storage nodes.

In Fig. 9, the response time versus the number of cloudlets is illustrated in the proposed strategy, DROPS [25] and random placement strategy.

Fig. 9
figure 9

The response time versus the number of cloudlets in the proposed strategy compared to DROPS [25] and random placement strategy

The proposed strategy benefits from lower response time in comparison with DROPS [25] and random placement strategy. For instance, when the number of cloudlets is 3000, the average response time in the proposed strategy, DROPS [25] and the random placement strategy is 24.050, 39.360 and 84.550 s respectively.

In the following, the time complexity of the proposed strategy is evaluated. The first part of the proposed strategy is related to the data popularity determination. If \(m\) indicates the number of data in the network, the smallest data is initially searched in linear time, i.e., O(m),.

Afterwards, the data popularity for each data is calculated according to Eq. (2) which is done at linear time i.e., O(m). Finally, all the data are sorted based on their popularity which is accomplished in time \(O(m*\mathrm{log}m)\). Consequently, the total time complexity for calculating the data popularity is equal to \(O\left(m*\mathrm{log}m\right)\).

The second part of the proposed strategy is related to the proper nodes determination for storing the most popular data. In this process, the lowest and highest degrees of the storage nodes are found using divide and conquer approach. This is performed in \(\left(\frac{3}{2}n-2\right)\) steps and has a linear time complexity equal to \(O(n)\) with \(n\) showing the number of storage nodes in the network.

As stated in [15], the average node degree in the Internet network graph is 6.34. Hence, the total number of network links is equal to \(\left(\frac{6.34}{2}n\right)\). In this way the lowest and highest weights links are detected in \(\left(\frac{3}{2}\left(\frac{6.34}{2}n\right)-2\right)\) steps.

The feature and structural scores of each storage node are computed in constant time \(C\). Additionally, the storage nodes are sorted according to their total score in the time \(O(n*\mathrm{log}n)\). Consequently, the total time complexity for determining the proper nodes for storing the most popular data is equal to \(O(n*\mathrm{log}n)\).

Figure 10 illustrates the number of execution cycles versus the number of nodes in the network when determining the proper nodes for storing the most popular data in the proposed strategy.

Fig. 10
figure 10

The number of execution cycles versus the number of nodes in the network when determining the proper nodes for storing the most popular data in the proposed strategy

When only the feature score is considered, i.e., \(\theta =0\), the time complexity decreases significantly. Thus, calculating the structural score of storage nodes takes considerable amount of time in the proposed strategy.

Figure 11 depicts the execution cycles versus the number of storage nodes and data files in the network of the proposed strategy.

Fig. 11
figure 11

The number of execution cycles versus the number of nodes and data files in the network of the proposed strategy

As depicted in Fig. 11, when the number of the storage nodes is close to 500, the execution time is independent of parameter \(\theta \). However, by increasing the number of data files, the execution time of the proposed strategy increases linearly and are affected more by the parameter \(\theta \). This is because, a high portion of the time cost is related to the computation of structural scores.

5 Conclusions and Future Work

In this paper, we have proposed a novel dynamic data replication strategy to improve data availability and network performance. In our proposed strategy, data files with the most frequent access rates are selected for replication. These popular data are then placed on the storage nodes with higher structural and feature scores. The feature score considers the configuration specifications of storage nodes. The structural score uses the position of storage nodes in the network.

Our simulation results shows that the introduced scores have significant impacts on the network performance. This is due to the fact that when we consider both the feature and structural characteristics of storage node, the data access rate distributes more uniformly in the network. As a result, more balanced workload in the storage nodes and lower response time are achieved. Moreover, the proposed strategy has the flexibility to adjust the impact of feature and structure scores according to the network conditions.

As part of a future work, one can consider the nodes proximity to determine the proper nodes for storing the most popular data. This can provide the ability to capture the structural status of nodes more precisely. Additional improvement to the proposed strategy would be also to balance between the communication and computation overhead in the storage nodes.