1 Introduction

The first blockchain was conceptualized by Satoshi Nakamoto in 2008 [17]. The goal of Blockchain technology is to create a decentralized open environment to store information, execute transactions and perform functions [26]. With the increasing popularity of digital encryption currency such as Bitcoin [7], Ethereum [3] and DCEP (Digital Currency Electronic Payment) proposed by the People’s Bank of China, blockchain technology is gaining more and more attention and development.

In blockchain systems, there are two kinds of nodes [27]. One is a full node, which contains a complete copy of the blockchain, and the other is a light node, which is just a client to access the data from full nodes. One of the bottlenecks in the current blockchain is that many nodes who want to participate in the blockchain system do not have enough storage space to be full nodes. When there are a small number of full nodes in the system, the decentralization and security of the system are insufficient.

ElasticChain [12] was proposed to solve the bottleneck. In [12], the entire blockchain data is fragmented and stored in relatively stable nodes. However, the node reliability evaluation method is very simple in [12]. It only detects the integrity of the data in storage nodes and the number of responses of the nodes. These two features cannot fully describe the reliability of nodes. Many other factors affect the reliability of a node, such as the performance of nodes, the length of time the nodes stay in the system, and the size of the data saved by the nodes, etc.

Challenges:

Some methods have been proposed to improve the node reliability evaluation in ElasticChain, such as [11]. However, there are two shortcomings in the work of [11]. One shortcoming is that the calculation process of some evaluation features is relatively simple and only a small number of factors are considered. As a result, the current calculation results of these features are unrepresentative. The other is that the features of evaluating node reliability are still incomplete. Some important features are not taken into account. These two shortcomings greatly reduce the accuracy of classifying reliable nodes. If the blockchain data is stored in some unreliable nodes, which may exist problems such as node downtime, excessive latency, etc., the security of data will be seriously impacted.

Our contributions:

This paper first proposes an optimized data distribution model for the ElasticChain. The Extreme Learning Machine (ELM) method [8] is used to be the classifier in this model. ELM [8, 28] is one of the machine learning models. We can produce reliable, repeatable decisions and uncover hidden insights through learning from historical relationships and trends in the data [18] by using machine learning methods. We use the ELM method instead of other machine learning methods because the ELM classifiers have a good performance in training and classification [9] (the details are in Section 3.3).

Second, we design a comprehensive evaluation method of node reliability. In the method, we define five evaluation indicators: the security, the trustworthiness, activeness, stability and communication costs of storage nodes. According to this evaluation standard, we can accurately classify truly reliable nodes and save blockchain data in them.

Finally, we conduct extensive experiments to demonstrate the efficiency and effectiveness of the optimized data distribution model based on the synthetic data.

This paper extends a preliminary work [11] in the following aspects. First, we analyze the basic theory and the advantage of ELM method in detail, and then propose the optimized data distribution model to further improve the accuracy of reliable node classification. Second, we add new features of node reliability evaluation (such as the communication costs between nodes) and redefine the incomplete features (such as the security of nodes, node activeness, etc.) in the new optimized model. Moreover, compared with [11], we add two sets of experiments to verify the efficiency of the new model and double the experiment to prove the effectiveness of this model.

Paper organization

The remainder of the paper is organized as follows. Section 2 reviews the related work on the technique and application background of blockchain. Section 3 introduces the ElasticChain model and ELM. Section 4 introduces the architecture of the optimized data distribution model and the strategies of feature selection. Section 5 reports experimental evaluation. Finally, conclusions are presented in Section 6.

2 Related work

In this section, we review some related work on the technique and application background of blockchain.

In [21], the research directions in blockchain data management and analytics are detailed. Four topics were mentioned: leverage existing capabilities of mature data and information systems, enhance data security and privacy assurances, enable analytics services on blockchain as well as across off-chain data, and make blockchain-based systems active-oriented and intelligent. In [4], BlockBench is proposed, which is the first evaluation framework for analyzing private blockchains. It serves as a fair means of comparison for different platforms and enables a deeper understanding of different system design choices. The results on Ethereum, Parity and Hyperledger Fabric demonstrate that these systems are still far from displacing current database systems in traditional data processing workloads.

Wang et al. [22] and Li et al. [15] and many models improve the query speed of data in the blockchain. Wang et al. [22] presents ForkBase, a storage engine specifically designed to provide efficient support for blockchain and forkable applications. By integrating the core application properties into the storage, ForkBase not only delivers high performance but also reduces development effort. Li et al. [15] proposes an effective model for analyzing Ethereum data, called EtherQL. EtherQL provides highly efficient query primitives, such as range queries and top-k queries.

Xu et al. [25] and Kokoris-Kogias et al. [13] and many other systems increase the scalability of the blockchain. Xu et al. [25] describes a consensus unit-based storage scheme for blockchain systems, called CUB. CUB organizes some nodes into a unit, and a unit stores at least one copy of blockchain data. It addresses the high storage requirement in the wide usage of blockchain on various devices such as mobile phones or low-end PCs. Kokoris-Kogias et al. [13] presents OmniLedger, a novel scale-out distributed ledger that preserves long-term security under permissionless operation. It ensures security and correctness by using a bias-resistant public-randomness protocol for choosing large, statistically representative shards that process transactions.

In terms of node reliability, there is no relevant research to fully describe the reliability of blockchain nodes currently. Therefore, we give an evaluation method combined with machine learning in this paper.

3 Preliminaries

In this section, we introduce preliminary knowledge of ElasticChain and extreme learning machine. Furthermore, we propose the problem definition.

3.1 Storage node reliability verification in ElasticChain

According to the node reliability verification method [12], nodes in ElasticChain include three roles: the user node, the storage node, and the verification node, as shown in Figure 1. A node in a network would have one, two or three roles at the same time. User nodes are participants in the blockchain system. Blockchain operations, such as transactions, are completed between user nodes. And the fragmented blockchain data is stored in storage nodes. The role of the verification node is to provide reliable storage nodes for the user nodes.

Fig. 1
figure 1

The nodes in ElasticChain

The process of storage node reliability verification [12] is shown in Figure 2. Firstly, ElasticChain sets the same reliability values to each storage node. Then, the verification nodes check the reliability of data in storage nodes at every same period time. If the data in the storage nodes is complete, the reliability value remains unchanged. If the storage node data is modified or lost, the verification nodes will reduce its reliability value and store it in the POR (Proofs of Reliability) chain. The ElasticChain uses the reliability values of each storage node in the POR chain as a standard to select the highly reliable storage nodes. When the user nodes apply for storing data, the verification nodes provide the latest reliability value of storage node for the user nodes. Then, user nodes can select the most stable storage nodes to store the block data.

Fig. 2
figure 2

Node reliability verification

3.2 Problem definition

In ElasticChain, the reliability verification method for storage nodes U (U = {u1, u2, ..., ui}) is relatively simple. It only considers the integrity of the data in the storage node and the number of responses. These two features cannot fully describe the reliability of nodes.

Threat model

Data in storage nodes can be tampered with, and the malicious node may transmit false information to user nodes. Furthermore, the unreliable storage nodes are often offline or they drop frequently. In addition, the network where the storage node is located has a huge delay and cannot deliver data to users in time.

Therefore, these features checked in ElasticChain are not sufficient. In this paper, we propose an optimized data distribution model, which can comprehensively describe node reliability characteristics, and accurately classify nodes based on their reliability.

3.3 Extreme Learning Machine(ELM)

ELM has been originally developed for SLFNs and then extended to the “generalized” SLFNs where the hidden layer need not be neuron alike [5, 14]. Firstly, ELM randomly assigns the input weights and the hidden layer biases, and then analytically determines the output weights of SLFNs. It can achieve better generalization performance than other conventional learning algorithms at an extremely fast learning speed. Besides, ELM is less sensitive to user-specified parameters and can be deployed faster and more conveniently [9]. In recent years, a lot of research has been done on ELM. In [23], in order to effectively use the information from multiple attributes, an upper integral network is considered as a classification system by using multiple upper integral classifiers with a single layer neural network, and the learning mechanism of ELM is used to train the single-layer neural network. In [24], a novel Distributed Extreme Learning Machine (ELM*) based on a distributed MapReduce framework is proposed, which can learn massive data efficiently in parallel. In [2], an image classification method was proposed based on extreme k-means and EELM, and the method has superior performances on classification rate compared with other traditional methods based on experimental results. Compared with the naive implementation, the ELM-based implementation achieves much better performance.

The basic steps of ELM, as introduced in [8], are as follows. For n arbitrary distinct samples (xj, tj), where xj = [xj1, xj2, ..., xjn]TRn,and tj = [tj1, tj2, ..., tjm]TRm, standard SLFNs with L hidden nodes and activation function g(x) are mathematically modeled as

$$ \sum\limits_{i=1}^{L}{\beta_{i}}{g_{i}}({\mathbf{x}_{j}})=\sum\limits_{i=1}^{L}{\beta_{i}}g({\mathbf{w}_{i}}\cdot{\mathbf{x}_{j}} + {b_{i}})={\mathbf{o}_{j}} \quad\ \quad\ (j = 1, 2, . . . , N) $$
(1)

where wi = [wi1, wi2, ..., win]T is the weight vector connecting the i th hidden node and the input nodes, βi = [βi1, βi2, ..., βim]T is the weight vector connecting the i th hidden node and the output nodes, bj is the threshold of the i th hidden node, and oj = [oj1, oj2, ..., ojm]T is the j th output vector of the SLFNs.

The standard SLFNs with L hidden nodes and activation function g(x) can approximate these N samples with zero error. It means \(\sum \nolimits _{j=1}^{L} || \mathbf {o}_{j} - \mathbf {t}_{j} ||= 0\) and there exist βi, wi and bi such that

$$ \sum\limits_{i=1}^{L}{\beta_{i}}g({\mathbf{w}_{i}}\cdot{\mathbf{x}_{j}} + {b_{i}})={\mathbf{t}_{j}} \quad\ \quad\ (j = 1, 2, . . . , N) $$
(2)

The equation above can be expressed compactly as follows.

$$ {\mathbf{H}}\beta = {\mathbf{T}} $$
(3)

where H(w1, w2, ..., wL, b1, b2, ..., bL, x1, x2, ..., xL)

$$ = [h_{ij}]=\left[ \begin{aligned} g({\mathbf{w}_{1}}\cdot{\mathbf{x}_{1}} + {b_{1}}) \ \ & g({\mathbf{w}_{2}}\cdot{\mathbf{x}_{1}} + {b_{2}}) & ...\quad\ & g({\mathbf{w}_{L}}\cdot{\mathbf{x}_{1}} + {b_{L}})\\ g({\mathbf{w}_{1}}\cdot{\mathbf{x}_{2}} + {b_{1}}) \ \ & g({\mathbf{w}_{2}}\cdot{\mathbf{x}_{2}} + {b_{2}}) & ...\quad\ & g({\mathbf{w}_{L}}\cdot{\mathbf{x}_{2}} + {b_{L}})\\ {\vdots} \quad\ \quad\ & \quad\ \quad\ {\vdots} & {\vdots} \quad\ \ \ & \quad\ \quad\ {\vdots} \\ g({\mathbf{w}_{1}}\cdot{\mathbf{x}_{N}} + {b_{1}}) \ \ & g({\mathbf{w}_{2}}\cdot{\mathbf{x}_{N}} + {b_{2}}) & ...\quad\ & g({\mathbf{w}_{L}}\cdot{\mathbf{x}_{N}} + {b_{L}}) \end{aligned} \right]_{N \times L} $$
(4)
$$ \beta=\left[ \begin{aligned} \beta_{11} \ \ & \beta_{12} & {\cdots} \quad\ & \beta_{1m}\\ \beta_{21} \ \ & \beta_{22} & {\cdots} \quad\ & \beta_{2m}\\ {\vdots} \ \ \ \ & \ \ \ {\vdots} & {\vdots} \quad\ \ \ & \ \ {\vdots} \\ \beta_{L1} \ \ & \beta_{L2} & {\cdots} \quad\ & \beta_{Lm} \end{aligned} \right]_{L \times m} and \quad {\mathbf{T}}=\left[ \begin{aligned} t_{11} \ \ & t_{12} & {\cdots} \quad\ & t_{1m}\\ t_{21} \ \ & t_{22} & {\cdots} \quad\ & t_{2m}\\ {\vdots} \ \ \ \ & \ \ {\vdots} & {\vdots} \quad\ \ \ & \ \ {\vdots} \\ t_{N1} \ \ & t_{N2} & {\cdots} \quad\ & t_{Nm} \end{aligned} \right]_{N \times m} $$
(5)

H is called the hidden layer output matrix of the neural network and the i th column of H is the i th hidden node output with respect to inputs x1, x2, ..., xN. The smallest norm least-squares solution of the above linear system is computed by

$$ \beta = {\mathbf{H}}^{\dagger} {\mathbf{T}} $$
(6)

where H is the Moore-Penrose generalized the inverse of matrix H. Then the output function of ELM can be modeled as follows.

$$ f({\mathbf{x}}) = {\mathbf{h}}({\mathbf{x}})\beta = {\mathbf{h}}({\mathbf{x}}){\mathbf{H}}^{\dagger} {\mathbf{T}} $$
(7)

Given a training set \(\mathcal {N}\) = {(xj,tj) | xjRn, tjRm, j = 1, 2, ..., N }, activation function g(wi, bi, xj) and hidden node number L, the pseudo code of ELM [9] is given in Algorithm 1.

figure a

We use the ELM method instead of other machine learning methods in this paper. The reason is that compared to other machine learning methods, the ELM classifiers have higher performance of training and classification [9]. For example [6], the testing accuracy of ELM is 99.14% in MNIST OCR dataset, which is 0.27%, 0.09%, 0.54% and 0.42% higher than Deep Belief Networks (DBN), Deep Boltzmann Machines (DBM), Stacked Auto Encoders (SAE) and Stacked Denoising Auto Encoders (SDAE), respectively. The training time of ELM is 281.37s, while the training time of the other methods is more than 17 hours. Based on the 3D Shape Classification dataset, the testing accuracy of ELM is 81.39%, which is 4.07% higher than the Convolutional Deep Belief Network (CBDN) method. The training time of ELM is 306.4s, while the CBDN method needs more than two days.

Moreover, different from Deep Learning which requires intensive tuning in multi hidden layers and hidden neurons, ELM theories show that hidden neurons are important but need not be turned (for both single hidden-layer feedforward neural networks (SLFNs) and multi-hidden-layer of networks) [20]. Therefore, the learning in ELM can simply be made without iteratively tuning hidden neurons, and this is one of the reasons ELM is efficient.

Compared to other traditional efficient models, online sequential learning can be achieved in ELM [16]. When new data is generated, traditional models need to put the new data and old data together and retrain. ELM can retain previous training experience and train new data based on current experience. Fast training can ensure that the training data of the model is complete and real-time. Therefore, the predicted result by ELM can be more accurate.

In ElasticChain, the fast classification will reduce the time for nodes to distribute duplicates when each block is generated and reduces the impact on blockchain system throughput. Thus, ELM is chosen as the classifier in our method.

4 The optimized data distribution model

In this section, we first describe the architecture of the optimized data distribution model. Then we introduce the features used to classify reliable nodes. After that, we propose an algorithm to describe the data distribution process.

4.1 Architecture

Figure 3 shows the architecture of the optimized model. It consists of three modules: the ElasticChain system module, the node feature extraction module and the ELM classifier module.

Fig. 3
figure 3

Architecture of the optimized data distribution model

In the ElasticChain module, the verification nodes check the reliability of the storage node at the same time interval. There are 6 inspection results in total, and the results are saved in verification nodes. The verification nodes check the data volume and data integrity in the storage nodes, the number of disconnections and the online time of the storage nodes, the number of times the storage nodes have completed the verification work and the network environment of the system.

In the feature extraction module, we get five main features of the storage node: the security, the trustworthiness, the activeness, the stability and the communication costs by calculating the inspection results. These features can describe the reliability of a node completely.

Finally, in the classifier module, the reliable storage nodes are classified based on these five important features by using ELM. Then, user nodes save blockchain data in these reliable storage nodes. In ELM classifier, some of the nodes are sampled as training data. The sample nodes are input of the classifier module. They consist of two kinds of nodes, the reliable storage nodes, and the unreliable storage nodes. The way to create and sample training data are introduced in Section 4.3.

Next, we describe the feature selection process in detail.

4.2 Feature selection

In ElasticChain system, multiple features will affect the reliability of storage nodes, and we choose five important features among them in this paper. The chosen features are the security of storage nodes (S), the trustworthiness of storage nodes (TR), node activeness (NA), node stability (NS) and communication costs (CC). The system can make a correct evaluation on the reliability of nodes in most cases by using these five features. Admittedly, other features will also have an impact on node reliability in some special environments. This problem is easy to solve because new features that have an impact on the environment can be added to the evaluation criteria without changing the structure of the model.

For the five features, each of them is calculated from two to four related features. The work of summarizing several related features into one feature can reduce the dimension of the classifier. The dimension reduction can reduce the calculation of classifiers and increase the speed of classification.

The storage nodes U (U = {u1, u2, ..., ui}) will be detected at a fixed interval. The five features are updated after each detection. When there are i storage nodes in the system, i sets of feature data will be generated. Each set has five features, so the data sets (DS) of the storage node ui can be expressed as a 5 × i matrix:

$$ {DS}=\left[ \begin{aligned} S_{1} \ \ \ \ & TR_{1} & NA_{1} \quad\ & NS_{1} & CC_{1}\\ S_{2} \ \ \ \ & TR_{2} & NA_{2} \quad\ & NS_{2} & CC_{2}\\ {\vdots} \ \ \ \ \ \ & \ \ {\vdots} & {\vdots} \quad\ \ \ \ \ & \ \ {\vdots} & {\vdots} \ \ \ \ \\ S_{i} \ \ \ \ & TR_{i} & NA_{i} \quad\ & NS_{i} & CC_{i} \end{aligned} \right]_{5 \times i} $$
(8)

Then, we define the five node reliability features.

Definition 1

(Security of storage nodes (S)) For the storage nodes U = {u1, u2, ..., ui}, where i is the number of storage nodes in the blockchain system, they keep the duplicates of blockchain. When some of the storage nodes attempt to modify the data in the blockchain, these malicious nodes will tamper with the local data stored. It will have a great impact on the security of blockchain data. There is another situation where the blockchain data in the storage nodes is lost. This will cause great difficulties in data recovery. Therefore, the verification nodes V = {v1, v2, ..., vj}, where j is the number of verification nodes in the blockchain system, check the integrity of blockchain data stored in the storage nodes at every same period time. The inspection results are the security of storage nodes, which are represented by S. It can be expressed as follows:

$$ S = \sum\limits_{k=1}^{K}({\omega_{k}} - {\mu_{k}}{a^{-(N-n)}}) $$
(9)

where k is the k th inspection, and K is the total number of inspections. ω is the weight when data is complete, while μ is the weight when data is modified. If the blockchain data is integrated in the k th inspection, μk = 0 and ωk = ω. If the data is modified, μk = μ and ωk = 0. Security is one of the most important evaluation indicators of the blockchain, and we do not tolerate the emergence of malicious nodes. Therefore the value of μ is much greater than ω.

Next, we explain the coefficient of μk. We assume a complete blockchain B = {b1, b2, ..., bN}, where N is the total number of blocks in the system. n (1 ≤ nN) is the number of blocks in which the modified data resides. Then, we know that the security of a block shows an exponential relation with its distance from the latest block [17]. Therefore, we add this coefficient (a−(Nn)) for μk (here, a > 1). When the location of this modified data is near by the latest block, this data is easier to modify. Thus, if the newer data is modified in one storage node, we will give more weight to μk in order to the punishment and significantly reduce the security of this node.

Definition 2

(Trustworthiness of storage nodes (TR)) In ElasticChain, the user nodes just like light nodes that do not store the blockchain data. Operations of the user node, such as querying and validating data, must be completed by visiting the storage node. Each time a user node generates a transaction, the transaction needs to be verified. The querying process is as follows. First, the user node sends the transaction information to the storage nodes. Then, the result and the validation path of this transaction obtained by the query between storage nodes will be returned to the user node. Since the user node saves all block header data, the user node can verify the authenticity of this result by performing a Merkle check on the path information. The validation results can be regarded as the trustworthiness of storage nodes, which are represented by TR. It is stored in the verification nodes, and it can be expressed as follows:

$$ TR = \sum\limits_{p=1}^{P}({\psi_{p}} - {\phi_{p}}) $$
(10)

where P is the total number of times the storage node completes the verification request initiated by the user nodes. ψ is the weight when the user node verifies the returned result successfully, while ϕ is the weight when the result is verified to be false. If the verification is successful in p th verification, ϕp = 0 and ψp = ψ. If validation fails, ψp = 0 and ϕp = ϕ.

The security (S) and trustworthiness (TR) are two different evaluation indicators. The security (S) evaluates whether the data stored locally by a storage node is safe. The trustworthiness (TR) evaluates whether the storage node can honestly return the correct data to the user nodes.

Definition 3

(Node activeness (NA)) A storage node ui cannot perform their job well in this case that it stays a few times in the network and only a few data stored in it. The system tends to choose storage nodes that can store large amounts of data and work online for a long time. Therefore, we have added this indicator, node activeness (NA), which can be expressed as follows:

$$ NA_{i} = \xi_{i}A_{i} \times T_{i} $$
(11)

where i is the i th storage node, and ξi is the activeness weight of the i th node, and we set this weight to 1 in the paper. Ai is the amount of blockchain data in the i th storage node. Ti is the online time that the i th storage node stays in the network. The value of Ai × Ti shows the total amount of data that can be retrieved traceable in the network for the i th storage node.

Definition 4 (Node stability (NS))

In ElasticChain, each storage node ui can not be online at all times. If the storage nodes are always offline, the user nodes cannot get the blockchain data in time, and the security of blockchain system will be reduced. Moreover, although an active node has a long online time, it is incompetent to finish its job when the online time is intermittent. Therefore, the number of times a node disconnects to the network affects the stability of this node. Here, the stability (NS) for a storage node ui is given as follows:

$$ Ns = \sum\limits_{q=1}^{Q}{\lambda_{q}}f(t_{q}) $$
(12)

where q is the q th disconnection to the network, and Q is the total number of disconnections. λq is the weight of the q th disconnection. f(tq) is the time weight of the q th disconnection. \(f(t_{q}) = \dfrac {k} {t_{q}}\), where k > 0. tq is the time from the q th disconnection occurred to now. The function f(tq) decreases with the increase of tq, which means that when the time of disconnection is far from now, this disconnection will have less effect on the current system.

Definition 5

(Communication costs (CCi)) The blockchain technology is based on the P2P network, and the communication costs are one of the important indicators for selecting the node for communication. The fewer communication costs, the less network load will be generated. Many factors that affect communication costs, and we considered three important factors: the distance between nodes, link capacity and network conditions. The communication costs(CCi) for the i th storage node can be expressed as follows:

$$ CC_{i} =\sum\limits_{r=1}^{R}{\rho_{r}} \frac{L_{r}} {R} \times C_{r} \times \sigma_{r} $$
(13)

where R means the storage node ui connects R user nodes, and ρr is the weight of the r th link, and we set this weight to 1 in the paper. Lr is the distance between ui and the r th user node, and the cost increases with distance. C is the weight of the network link capacity. C can be evaluated by the Shannon formula [19]. A small amount of cost requires sufficient capacity. σr is the evaluation index for network conditions. Due to the existence of a large number of evaluation indicators, the evaluation process is not described in detail in this paper. In practice, σr can be given by professional organizations.

4.3 Training ELM

After the feature extraction, ELM is selected as the classifier to learn security feature, trustworthiness feature, node activeness feature, node stability feature and communication costs feature. Each storage node responds to different user node requests, and the array of features is generated. This array is used as inputs to train his ELM. In the ELM-based classifier, each storage node can be classified into “reliable” class or “unreliable” class.

4.4 The optimized data distribution model

The pseudo-code of the main program of the optimized data distribution model for ElasticChain based on ELM is shown in Algorithm 2.

Firstly, the weight values(ω,μ,ψ,ϕ,ξ,λ,ρ) will be given according to system requirements. Secondly, the verification nodes V (V = {v1, v2, ..., vd}) visit storage nodes {u1, u2, ..., ui} at every same period time and record its evaluation data (each ωk and μk; each ψp and ϕp; each Ai and Ti; each tq; each Lr, Cr and σr). Then, the optimized model calculates the feature values of these storage nodes according to the evaluation data. The features include the security (S), the trustworthiness (TR), the activeness (NA), the stability (NS), and the communication costs (CC). Next, these feature values are input into the trained ELM classifier and output the classification of node reliability. Verification nodes update and record the classification results (reliable node or unreliable node) in the ledger of the POR chain. Finally, when user nodes H (H = {h1, h2, ..., hg} ) generates new data, the verification nodes will provide reliable storage nodes for user nodes to store the new data.

figure b

In the optimized data distribution model, the process of the user nodes reading the reliable storage nodes from the verification nodes and the process of the verification nodes classifying the reliable storage nodes through the ELM method are asynchronous. In other words, the verification nodes will evaluate the reliability of the storage nodes after a fixed period of time and then save the evaluation result locally. When a user node initiates a storage request, the verification nodes will provide the user nodes with reliable storage nodes in the evaluation result within this period. In this way, although the node feature extraction module and the ELM classifier module take some time, these two modules will not affect the working efficiency of the ElasticChain system.

For the efficiency of the node feature extraction module, the time complexity is proportional to the number of storage nodes. When a new storage node is added, six knowledges of this node need to be extracted. Therefore, the time complexity of the node feature extraction module can be expressed as O(n). Here, n is the number of storage nodes.

For the efficiency of the ELM classifier module, the training complexity of the model is the same as the ELM model, the main computational cost comes from calculating the Lagrange multipliers [8]. ELM can get the calculation result based on Equation (37) in [8], where HH (size: L × L) is used. The number of hidden nodes L can be much smaller than the number of training samples.

5 Evaluation

In this section, a series of experiments are implemented to verify the accuracy and efficiency of the optimized data distribution model in synthetic data. Experiments are carried out on the machine of Microsoft Windows 7, Intel Core i5 CPU, 3.20 GHz, and 16GB memory in java JDK 1.6. In the following, first, experiment settings are described in Section 5.1. Then we present and discuss the experimental results of the evaluation in Section 5.2.

5.1 Experiment settings

In our experiments, all experimental nodes are created using VMware Workstation 12.5.2. Each node has an ubuntu16.04 system with 300MB of memory and 1GB of hard disk space. We built ElasticChain and POR chain by use of the open-source Hyperledge fabric v0.6. The experiment established 10, 20, 30, 40 and 50 nodes, respectively. All nodes are storage nodes, user nodes and verification nodes. So, the feature values for a storage node is 9, 19, 29, 39 and 49 groups, and each group has five features.

We assign a value to each weight: ωk = 1, μk = 10, a = 1, ψp = 1, ϕp = 10, ξi = 1, λq = 1, ρr = 1. And we set multiple groups of parameters for the storage nodes through the control variable method. The parameters includes ωk, μk, ψp, ϕp, Ai, Ti, tq, Lr, Cr and σr. Therefore, we calculate the feature values (S,TR,NA,NS,CC) of this storage node by (9), (10), (11), (12) and (13), respectively.

For the dataset used in the experiment, the real dataset is the best choice. However, there is no real dataset suitable for the experiment. For example, the data in popular public blockchain systems, such as Bitcoin and Ethereum, contains a large amount of block information and transaction information, but the reliability information about the nodes is hard to find (e.g. the number of disconnections for a node, the online time of a node, etc.). Therefore, the node features (S,TR,NA,NS,CC) cannot be calculated by these real datasets, and synthetic data is used in this experiment. In the future, if we can get the node features in the public blockchain systems, the optimization model is suitable for most blockchain systems.

The synthetic data is obtained by multiple testing the running results of the storage nodes when we set different groups of parameters in the ElasticChain. In the fabric v0.6, the system will continuously initiate PBFT-based consensus [1] requests. The running results of our test are whether the storage nodes obtain the correct verification information and broadcast the information on time in each round of consensus. The broadcast information includes pre-prepare message, prepare message and commit message [1]. If the storage nodes broadcast the three messages authentically and promptly, this node is considered to be reliable in this round of consensus. In the data set, after multiple rounds of consensus, when a storage node is a reliable node in more than 90% of the rounds, we define this node as a reliable node. In other applications, a higher ratio (maybe 99%, 99.9%, or more) can be used to define reliable nodes according to requirements.

Next, we divide the data set into four groups. Each group of dataset has contains 50 sets of data and each has its unique characteristics. 60% of nodes are reliable in Dataset1 and Dataset2. In Dataset3, there are more reliable nodes (80%), and more unreliable nodes appear in Dataset4 (40% reliable nodes). Then, we use the Dataset1 to train the ELM classifier, and use Dataset2, Dataset3 and Dataset4 to test the performance. We set the number of hidden layer nodes as 10 when using the ELM classifier.

Furthermore, an SVM-based classifier is added to compare performance with the ELM classifier. The SVM-based classifier replaces the ELM classifier in the classifier module of the optimized model, and distinguishes reliable storage nodes. We choose a sigmoidal kernel function and set the penalty parameter as 10 for the SVM-based classifier.

5.2 Experimental results

We experimented on the accuracy, precision, recall and F1-measure of node reliability evaluation in different datasets(Dataset2, Dataset3 and Dataset4) by using the optimized data distribution model(ELM-OM), SVM-based optimized model(SVM-OM) and ElasticChain model when there are 10, 20, 30, 40 and 50 nodes in model.

The accuracy of a model can be expressed as follows:

$$ Accuracy = (TP+TN)/(TP+FN+FP+TN) $$
(14)

where TP is True Positive, FP is False Positive, TN is True Negative, FN is False Negative. And the precision of a model can be expressed as follows:

$$ Precision = TP/(TP+FP) $$
(15)

The recall of a model can be expressed as follows:

$$ Recall = TP/(TP+FN) $$
(16)

The F1-measure of a model can be expressed as follows:

$$ F1-measure = 2 \times \frac{Precision \cdot Recall} {Precision + Recall} $$
(17)

Experimental results on Dataset2, Dataset3 and Dataset4 are shown in Figures 45 and 6, respectively. We can get the following conclusions from these figures.

Fig. 4
figure 4

Experimental results on Dataset2

Fig. 5
figure 5

Experimental results on Dataset3

Fig. 6
figure 6

Experimental results on Dataset4

(1) Overall, the first conclusion we can draw is that the accuracy, precision, recall and F1-measure of node reliability evaluation in ELM-OM are higher than that in the SVM-OM when 10, 20, 30, 40 and 50 nodes exist in models. The reason is that ELM method has a higher performance of training and classification than the SVM method.

The second conclusion is that the accuracy, precision, recall and F1-measure of the node reliability evaluation in ElasticChain model are the lowest, and far below ELM-OM and SVM-OM. Because the ELM-OM and SVM-OM use the new evaluation strategy. The five features (security, trustworthiness, activeness, stability and communication costs) in new evaluation strategy can describe the reliability characteristics of a storage node in a comprehensive way.

(2) Under the same dataset, except for 10 nodes, when the number of nodes is small, the values of the four evaluation indexes of the classification result are relatively low. The accuracy, precision, recall and F1-measure of node reliability evaluation in ELM-OM, SVM-OM and ElasticChain model become more and more higher with the increasing number of nodes in models. It is because that there are few reliability evaluations for a node when the number of nodes is small, while the number of reliability evaluations increases as the number of nodes increases, then the classification of reliable nodes is more accurate.

(3) When the number of nodes in the models is the same, the accuracy, precision, recall and F1-measure of the three models under Dataset4 are slightly higher than the values under Dataset3 and Dataset2. It is because that the proportion of reliable nodes is large in Dataset4, and the characteristics of nodes are obvious. Therefore, the classification results in Dataset4 are better.

Next, we tested the efficiency of the optimization model. We tested the processing time of the optimized data distribution model and ElasticChain when there are 4, 8, 12 and 16 nodes in systems. Both systems are implemented based on fabric v0.6, and the condition for fabric v0.6 to operate normally is that there are at most 16 nodes [4]. Therefore, we set the number of nodes as above.

Meanwhile, the nodes in both systems are set as reliable nodes. The reason why we did not choose unreliable nodes for testing is that the job of the optimized model is to select reliable nodes to store the fragmented data. Reliable nodes can respond to query requests in a timely manner and ensure that there are enough copies of data in the entire system.

The job of the optimized model will not affect the consensus process of ElasticChain. The reason is that when the optimized model and ElasticChain produce new blocks, the blocks will be confirmed based on the PBFT consensus. The condition for confirmation is that more than two-thirds of the participating nodes are honest nodes. The new block confirmation process and the long-term maintenance of the data by the storage node are two different stages.

In the experiment, we executed the chaincode_example02.go [10] transaction code 930 times and 1860 times, generating 5.00MB and 10.00MB data. 500KB of data will be stored as a shard, and each shard will be stored in at least 2 copies. The running times of the optimized model and ElasticChain are shown in Figure 7. The running is stopped recording after the new data is verified and written into the block.

Fig. 7
figure 7

The running times of the optimized model and ElasticChain

From Figure 7, we can see that as the number of nodes increases, the running time of the two systems increases linearly. The reason is that a new block needs to be confirmed by more than 2/3 of the nodes in the PBFT consensus. The systems contain a large number of nodes and require a long confirmation time.

Meanwhile, the running times of the optimized model and ElasticChain are almost the same. The reason is that there is not much difference in the process of generating blocks between the two systems. However, in practical application, if the verification node frequently classifies the storage node and the verification node reaches the performance bottleneck, the block validation may be delayed. Therefore, we should set an appropriate interval time to classify storage nodes, rather than blindly set the interval time too small.

6 Conclusion

In our study, we present an optimized data distribution method for the ElasticChain, which combines the blockchain technology with the machine learning method. This method classifies the nodes according to their reliability by using the Extreme Learning Machine (ELM) and distributes the blockchain data in reliable nodes to increase data security. Moreover, we propose a new strategy to extract the node reliability in order to fully evaluate the reliability of the node. It includes five features, which are the security, trustworthiness, activeness, stability and communication costs of storage nodes. Finally, the experimental results on synthetic data demonstrate the accuracy and efficiency of the optimized data distribution model.

In the future, we will focus on studying the optimized method to achieve the safe distribution of data and extracting the other more features for the storage nodes in ElasticChain to improve the comprehensiveness of node evaluation. Furthermore, we will experiment with multiple machine learning algorithms and nature-inspired algorithms as classifiers. By analyzing and comparing the classification results, we find a more accurate classification method and save the blockchain data in high-reliability nodes.