1 Introduction

The traceability and integrity of data are critical for many data storage systems. Traditional centralized and distributed storage systems face the problem that the data are prone to be tampered by outside or internal attackers. Two techniques, digital signature and digital watermarking are currently widely used to protect data integrity. Using the digital signature, the sender signs the protected data with the tag of the data, and the tag is the hash value of the data with the private key of the sender using the public key cryptographic mechanism. The sender transmits both the data and the tag to the receiver who verifies the integrity of the received data. That is, the receiver calculates the hash value of the received data using the public key of the sender. If the hash value calculated is the same as that received, the verification is successful and the data are not modified; otherwise, the verification failed and the data are modified. Digital watermarking embeds digital markup information in a carrier signal, such as digital multimedia data, by means of signal processing. Such information is usually invisible and can be extracted only through a dedicated detector or a reader algorithm. Digital watermarks may be used to verify the authenticity or integrity of the carrier signal.

However, there are some known attacks to these two techniques of digital signature and digital watermarking, such as the attack from the internal attackers inside the system [1,2,3,4,5]. Blockchain has the characteristics of immutability, decentralization, distributed ledgers, consensus, etc. Compared to the traditional data storage systems, blockchain greatly increases the difficulty of tampering the stored data, and hence blockchain has been applied to data storage to improve the data security [6,7,8,9,10,11].

Blockchain platforms use decentralized consensus to maintain consistency in a distrubted environment, and hence consensus is the sine qua non of blockchain. The state-of-art consensus algorithms, e.g. Proof-of-Work (PoW) [12], Proof-of-Stake (PoS) [13], Delegated PoS (DPoS) [14], Practical Byzantine Fault Tolerance (PBFT) [15], Algorand [16], YAC [17], etc., either require coins (stake) and a large amount of computation or use a centralized point which is in contradiction with the decentralization rationale of blockchain.

A wrong block in the blockchain causes data inconsistency [18]. In addition, a wrong block may potentially cause all the successive blocks to be wrong, which makes the node useless. Studies have been conducted to detect such inconsistency [19,20,21]. The majority rule is often adopted to obtain the correct data. For example, the Gossip data dissemination protocol maintains data consistency among the nodes by transmitting information about consistency to other nodes [22]. Nevertheless, the transmission cost is high due to the large and complex structure of Hyperledger Fabric [23].

In this paper, we achieve data integrity in data storage applications based on blockchain. The data, the hash value of the data and other information are stored as blocks in the blockchain, and the clients can search the data without storing the data. The contributions of this paper are as follows:

  1. (1)

    We investigate the error-correcting and tamper-proofing data storage problem and propose a Rotating multiple random Masters and Error-Correcting data storage framework based on blockchain (RMEC), where the framework consists of Application Layer, Processing Layer, and Infrastructure Layer. We also propose a two-dimension chain data structure to store the data. The two-dimension chain data structure divides the blockchain into two parts: horizontal chain and vertical chain. The horizontal chain is generated to store the basic information of data, and a block is created in the horizontal chain for the data. We maintain a vertical chain for the data. When data are updated, a new block is generated to store the updated data in the vertical chain. All the blocks containing the updated information of the data form a vertical blockchain.

  2. (2)

    We propose a Rotating Multiple Random Sampling (RMRS) consensus mechanism to balance decentralization and communication costs. A portal in the infrastructure layer receives the client’s requests and forwards the requests to the servers for data storage and query. When storing data, the portal randomly selects K servers as temporary master nodes that own the right to generate and broadcast the block copy. The other nodes receive the block copy sent by each temporary master node, compare the K block copies, and keep the copy with the number of duplicated times no less than \(\lceil {\frac {K}{2}} \rceil \). In the data query process, K servers are also randomly selected by the portal as the temporary master nodes. After verifying the queried block, the K temporary master nodes send the block data copies to the client. The client accepts the data copy with the number of duplicated times no less than \(\lceil {\frac {K}{2}} \rceil \). Each node transmits the hash value of the data instead of the data itself during the consensus.

  3. (3)

    We propose two data tamper-proofing mechanisms (single-block validation mechanism and periodic blockchain verification mechanism) to correct the wrong blocks and ensure data integrity. The single-block validation mechanism has two main functions: verifying the integrity of a block in a local server and correcting the wrong block. With the periodic verification mechanism, each server periodically verifies the correctness and integrity of all blocks in the blockchain and invokes the single-block validation to correct the wrong block.

  4. (4)

    We implement a prototype and conduct analyses on the system performance. The experiments demonstrate that the framework can achieve data tamper-proof and effectively reduce the communication cost.

The rest of the paper is organized as follows. The related work is introduced in Section 2. The proposed framework RMEC and the modules are presented in Sections 3 and 4, respectively. The system performance is analyzed and evaluated in Sections 5. The paper concludes in Section 6.

2 Related work

Two techniques, digital signature and digital watermarking are currently widely used to achieve data integrity. Using the digital signature, the sender signs the protected data with the tag of the data, and the tag is the data hash value calculated with the private key of the sender using the public key cryptographic mechanism. The sender transmits both the data and the tag to the receiver who verifies the received data integrity. That is, the receiver calculates the hash value of the received data using the public key of the sender. If the hash value calculated is the same as that received, the verification is successful and the data are not modified; otherwise, the verification fails and the data are modified. However, the following risks exist using the digital signature: (1) there exist attack methods to some of the public key cryptography; for example, it is possible to use a brute-force attack to find the private key in practice; (2) it is difficult to prevent the attack from the internal attackers inside the system [1].

Digital watermarking embeds digital markup information in a carrier signal, such as digital multimedia data, by means of signal processing. Such information is usually invisible and can be extracted only through a dedicated detector or a reader algorithm. Digital watermarks may be used to verify the authenticity or integrity of the carrier signal. However, there are some concerns with digital watermarking: (1) the location of the watermark in digital data must be imperceptible; (2) digital watermarking is easily lost due to signal distortion incurred by various unintentional or intentional signal processing; (3) editing information is easy to affect or even destroy the watermark; (4) there are some known attacks on digital watermarking, such as stego only attack, known cover attack, known message attack, and so on [2].

The blockchain is a new technology that combines distributed data storage, peer-to-peer (P2P) transmission, consensus mechanism, encryption algorithms, etc. It is a new distributed infrastructure and computing paradigm, which uses the block-chain data structure to store and verify data, exploit distributed node consensus algorithm to generate and update data, and apply cryptography to ensure data transmission and access security. In a narrow sense, the blockchain is a type of chain data structure that combines data blocks in chronological order and distributed ledger using cryptography to guarantee the data integrity [24]. Compared to traditional data storage systems, the nodes in the blockchain system do not need to trust each other or any third party [25,26,27,28], and hence the blockchain greatly increases the difficulty of tampering the data stored in the system. The blockchain achieves data tamper-proof with the following methods.

  1. (1)

    The blockchain uses SECP256K1 signature algorithm to generate public keys for identity authentication [29]. Unlike a signature algorithm based on the mathematical problem that a large prime number product is difficult to decompose, SECP256K1 is founded on the mathematical problem similar to the elliptic curve grounded on a discrete logarithm solution. The mathematical problem cannot be solved theoretically and it can hardly be solved by brute-force either, which guarantees an indispensable modification in the process of information transmission in the blockchain transaction process.

  2. (2)

    The blockchain is also a P2P network without a central node. All the nodes involved in the calculation are connected in a P2P mode, and each node saves a copy of the data block. The consistency of the block replicas can be guaranteed by consensus protocol so as to achieve the fault-tolerance and security of the blockchain system.

  3. (3)

    A block is linked to the previous block via a hash pointer so that one has to recalculate the hash value of all the successive blocks if he wants to modify a block, which increases the computational complexity [24].

The consensus protocol, the core of the blockchain, is a vital approach to preventing the data tamper. The state-of-the-art algorithms are PoW, PoS, DPoS, PBFT, Algorand, YAC, etc. With PoW [12], each block has a corresponding hash value calculated by the SHA256 algorithm, and the nodes achieve the consensus by comparing the speed of the hash value calculation. A node with high computing power has a good chance of working out the hash value and thereby has a high probability to obtain the right to generate the block [30]. Therefore, PoW faces the security risk of excessive concentration of power. Different from PoW, PoS determines the next node which can create the next block via various combinations of random selection and wealth or age [13]. With DPoS [14], a block is created by a trusted user node (trustee) elected by the community; the users vote to elect the trustee, and the users ranked the top 101 in the final vote become the trustee. Algorand has stake-weighted voting by validators and combines the stake values with a cryptographic sortition function to choose the voting committee [16]. In general, PoS, DPoS, and Algorand are used for financial purposes because of the stake.

With PBFT [15], the nodes are sequentially ordered with one node being the master and others referred to as backup nodes. PBFT consensus rounds are called views. The nodes that have the right to generate the blocks are determined by PBFT’s view rotation mechanism and Moore’s calculation formula. All nodes in the system communicate with one another with the goal being that all honest nodes will come to an agreement of the system state using a majority rule. Therefore, PBFT requires a huge communication cost.

The basic process of YAC [17] is as follows: (1) after generating and signing a transaction, the client sends the transaction to a peer; (2) the peer performs stateless validation of the transaction, and relays the transaction to the ordering service; (3) the ordering service generates a proposal containing an ordered list of transactions and sends the proposal to the peers; (4) a supermajority of peers determine whether a block can be generated, where the supermajority is defined to be a number greater than two-thirds of all peers in the network; (5) the peer commits the block to its local block store. YAC reduces the communication cost by using the centralized ordering service which is responsible for creating block proposals. However, YAC has the following concerns: (1) the proposal is generated and broadcast by the ordering service which is a centralized point with a strong assumption that the ordering service is safe and honest; (2) when a user sends a signed transaction to a peer, it is not randomly selected, which increases the possibility of system being compromised.

Some studies apply the blockchain to data storage. A blockchain-based data storage framework was built using Ethereum [31]. To cope with the problem of the slow speed of generating blocks in Ethereum, the system adds a centralized point to convert the data into transactions for packaged storage, while extra processor overhead is demanded and the new centralized point is prone to be attacked. A decentralized data storage model suitable for storing metadata was proposed to improve the security of cloud storage [6]. A decentralized privacy data management method was proposed, which uses the blockchain to control access rights and effectively protects the private data [7]. A secure P2P-type storage scheme was proposed to use the secret sharing and the blockchain; user data are divided into different parts, assigned to different nodes, and the node attacking the other nodes can be identified if the nodes supervise each other [8]. A blockchain-based access control system for cloud storage was proposed; the intelligent contract implemented with Ethereum stores all the information related to security operations in the blockchain, and the access control scheme is based on attribute encryption [9]. A blockchain-based data sharing verification scheme uses the blockchain to record the hash value and other necessary information of data sharing by other people and guarantee that the data that the user receives from a third-party source is the originally uploaded data by comparing the hash value of the corresponding data [10]. A consortium blockchain made of different autonomous vehicle manufacturers was proposed to ensure the authenticity and integrity of firmware updates, without using the centralized third parties to distribute the new firmware updates [32]. A decentralized ride-sharing service system based on the public blockchain was proposed to preserve the anonymity and privacy of the users and the data [33]. A decentralized charging coordination mechanism based on the blockchain was built to avoid a centralized charging coordinator and provide privacy protection [34]. A distributed privacy-preserving smart parking system was proposed to use a consortium blockchain created by different parking lot owners, and the system stores the parking offers on the blockchain to ensure security, transparency, and availability [35].

3 Rotating multiple random masters and error-correcting data storage framework

Our proposed Rotating multiple random Masters and Error-Correcting (RMEC) data storage framework based on blockchain adopts a three-layer hierarchical structure consisting of Application Layer, Processing Layer, and Infrastructure Layer, as shown in Fig. 1. The infrastructure layer includes a portal and N servers interconnected via P2P networks. The portal receives the data access requests and forwards the requests to the servers for processing. Each server establishes and maintains a blockchain for storing data. Each block in the blockchain includes data, the hash of the data, timestamp, user information, other data-related information, and the hash of the previous block. The processing layer consists of four modules, e.g. data upload, single-block validation, periodic blockchain verification, and data query, which realize the functions of data storage, data search, and data verification.

Fig. 1
figure 1

System architecture

The data upload module is responsible for appending the data submitted by the application layer to the blockchain in each server at the infrastructure layer. The single-block validation module is responsible for verifying the information of a specific block maintained in the blockchain in each server and correcting the data in the block if it is wrong. The periodic blockchain verification module verifies the correctness of the blockchain maintained by each server on a periodic basis. The data query module receives the data search request from the application layer, searches the corresponding block and the data in the blockchain, and returns the query result. Example 1 illustrates how the system is used to achieve data integrity for food production, distribution, and sale.

Example 1

Assume a canning factory uses the proposed framework to store the information of the produced cans. After producing a can, the cannery assigns the can an ID, which is uploaded with the other related information, such as cannery, production date, expiration date, etc., through the application layer of the system. The application layer passes the data storage request to the processing layer, which uses the data upload module to send the data to the infrastructure layer. Every sever in the infrastructure layer generates a block to store the data and appends the block to the blockchain maintained by the server. The can goes through some steps during the logistics and distribution process, and the information about each step is uploaded to the system where a new block is created for the uploaded information. Any modification to the information of the can will create a new block, and the blocks before the modification are kept in the system without being changed. The consumers use the can ID to search the relevant information of the can, such as the manufacturer, production date, expiration date, etc., through the data search function provided by the application layer, which gets the query results from the blockchain maintained by the servers and returns the results to the application layer.

The two-dimension chain data structure shown in Fig. 2 is designed to accelerate the data storage and search. The blockchain is divided into two parts: the horizontal chain and the vertical chain. The horizontal chain is generated from the basic information of the data. When a data is added in the blockchain, we create a new block and append the block to the horizontal chain. Each data has its own vertical chain. When the data is updated, we append a new block containing the updated data to the vertical chain. All the blocks containing the updated information of a data form a vertical blockchain. During the data search, instead of traversing the blocks one after another along with the blockchain, we first find the block for the data on the horizontal chain, and then search the specific data update on the vertical chain. Therefore, the number of blocks traversed can be reduced during the data query.

Fig. 2
figure 2

Two-dimension chain data structure

We propose a Rotating Multiple Random Sampling (RMRS) consensus to randomly select multiple temporary master nodes to handle each data access request so as to reduce the communication cost. K temporary master nodes are randomly selected upon the arrival of each client request. When storing data, the portal randomly selects K servers as temporary master nodes that own the right to generate and broadcast the block copy. The other nodes receive the block copy sent by each temporary master node, compare the K block copies, and keep the copy with the number of duplicated times no less than \(\lceil {\frac {K}{2}} \rceil \). In the data query process, K servers are also randomly selected by the portal as the temporary master nodes. After verifying the queried block, the K temporary master nodes send the block data copies to the client. The client accepts the data copy with the number of duplicated times no less than \(\lceil {\frac {K}{2}} \rceil \). The communication during the consensus process is encrypted using the combination of AES and ECC encryption schemes. The RMRS consensus process is shown in Fig. 3.

  1. Step 1.

    The user sends a data access request to the portal.

  2. Step 2.

    The portal randomly selects K temporary master nodes from all the N nodes in the network and sends the addresses of the K temporary master nodes to the user.

  3. Step 3.

    The user sends specific data access requests to the K temporary master nodes.

  4. Step 4.

    Each temporary master node broadcasts the data or the requests sent by the client to the other N − 1 nodes in the network.

  5. Step 5.

    After receiving K data copies, each node keeps the copy with the most repetitions and sends a response to the K temporary master nodes.

  6. Step 6.

    Each of the K primary nodes receives N − 1 responses, takes the data with the most repetitions as the result, and sends a response to the user.

Fig. 3
figure 3

The process of RMRS

4 Framework modules

The processing layer, the core of framework RMEC, includes the data upload module, the data query module, the single-block validation module, and the periodic blockchain verification module.

4.1 Data upload module

The data upload module receives the data storage request from the application layer, creates a block for the data, and stores the block in the blockchain in each server. After receiving the data storage request, the portal randomly selects K(K ≥ 3) servers as the temporary master nodes for the request and returns the addresses of the selected K servers to the client. The client uploads the data to the K servers which receive the data, generate blocks, and broadcast the hash of the generated block copies to the remaining N − 1 servers in the network. Each server in the network accepts the hash value with the number of duplicated times no less than \(\lceil {\frac {K}{2}} \rceil \). The server then requests the generated block from one of the servers whose generated block has the same hash value as the accepted one. After verifying the received block and the last block in the local blockchain, the server appends the block to the blockchain the server maintains. The process of data uploading is depicted in Fig. 4.

  1. Step 1.

    The data upload module sends the client’s data storage request to the portal in the infrastructure layer.

  2. Step 2.

    The portal receives the data storage request, randomly selects K servers in the network as the temporary master nodes, and sends the addresses of the K servers to the client.

  3. Step 3.

    The client uploads the data to the K selected servers after necessary processing. According to different application requirements, the client’s processing of the data may include encryption, generating digital watermarks, etc.

  4. Step 4.

    A new block is generated for the uploaded data by each of the K selected servers, and the hash of the block is broadcast to the remaining N − 1 servers.

  5. Step 5.

    Each server accepts the hash value with the number of duplicated times no less than \(\lceil {\frac {K}{2}} \rceil \).

  6. Step 6.

    Each server keeps requesting the generated block from one of the servers whose generated block has the same hash value as the accepted one, and calculating the hash value of the received block until it is the same as the previously received hash value.

  7. Step 7.

    Each server calculates the hash value of the last block in the local blockchain and compares the hash value with the previous hash value in the received block. If they are the same, Step 8 is performed; otherwise, the single-block validation module of the processing layer is executed to correct the last block on the local blockchain.

  8. Step 8.

    Each server appends the received block to the local blockchain.

Fig. 4
figure 4

Data upload module

4.2 Data query module

After receiving the data search request at the application layer, RMEC calls the data query module to complete the data search. Upon receiving the data search request, the portal randomly selects K(K ≥ 3) servers from N servers as the temporary master nodes. Each of the K servers finds the queried block in the server’s local blockchain. After verifying the block, each server sends the hash of the block to the client. The client accepts the hash value with the number of duplicated times no less than \(\lceil {\frac {K}{2}} \rceil \) and requests the block from one of the servers whose queried block has the same hash value as the accepted one. The data in the received block is accepted as the query result. The operation of the data query module is illustrated in Fig. 5.

  1. Step 1.

    The data query module sends the data search request to the portal.

  2. Step 2.

    The portal randomly selects K servers from N servers as the temporary master nodes and returns the addresses of the K servers to the client.

  3. Step 3.

    The client sends the search requests to each of the K servers.

  4. Step 4.

    Each of the K servers finds the block corresponding to the search request and calls the single-block validation module to verify the block.

  5. Step 5.

    Each of the K servers sends the hash of the corresponding block after verifying the block. According to different application requirements, the processing of data may include encryption, generating digital watermarks, etc.

  6. Step 6.

    The client accepts the hash value with the number of duplicated times no less than \(\lceil {\frac {K}{2}} \rceil \) as the query result.

  7. Step 7.

    The client keeps requesting the block from one of the servers whose block has the same hash value as the accepted one, and calculating the hash value of the received block until the calculated hash value is the same as the previously received one.

Fig. 5
figure 5

Data query module

4.3 Single-block validation module

The single-block validation module performs the integrity verification of the block in the server and corrects the wrong block. After receiving the call for the single-block validation module, the server sends a request to obtain the hash value of the corresponding block copy from each of the remaining N − 1 servers and accepts the hash value with the number of repetition times no less than \(\lceil {\frac {N}{2}} \rceil \). If the accepted hash is different from the hash of the corresponding block in the local server, the server requests the block from one of the servers whose block has the same hash value as the accepted one. The received block is used to update the corresponding local block. The operation of the single-block validation module is shown in Fig. 6.

  1. Step 1.

    Upon receiving a block validation request for a specific block, the server finds the corresponding block in the local blockchain the server maintains.

  2. Step 2.

    The server broadcasts the block validation request for the specific block to the remaining N − 1 servers.

  3. Step 3.

    The server receiving the block validation request returns the hash of the corresponding block in the blockchain maintained by the server.

  4. Step 4.

    The server sending the validation request accepts the hash value with the number of repetition times no less than \(\lceil {\frac {N}{2}} \rceil \).

  5. Step 5.

    The server checks whether the accepted hash is the same as the hash of the corresponding local block. If yes, the validation is successful; otherwise, the validation fails, and Step 6 is performed.

  6. Step 6.

    The server keeps requesting the block from one of the servers whose block has the same hash value as the accepted one, and calculating the hash value of the received block until the calculated hash value is the same as the previously accepted one. The server updates the block in the local blockchain as the received block.

Fig. 6
figure 6

Single-block validation module

4.4 Periodic blockchain verification module

Each server in the infrastructure layer periodically verifies the integrity of all the blocks in the local blockchain, starting from the first block. The operation of the periodic blockchain verification module is shown in Fig. 7.

  1. Step 1.

    The blocks the server maintains are numbered in sequence, starting from 1; denote the last block as the n-th block; initialize m = 1.

  2. Step 2.

    Calculate the hash value of the m-th block as h1, and compare h1 with the previous block hash value, h2, stored in the (m + 1)-th block; if they are the same, the m-th block is correct; otherwise, the m-th block is wrong and Step 4 is performed.

  3. Step 3.

    m = m + 1; if m > n − 1, the verification is completed; otherwise, return to Step 2.

  4. Step 4.

    Invoke the single-block validation module to correct the m-th block, and return to Step 2.

Fig. 7
figure 7

Periodic blockchain verification module

5 Performance analysis and evaluation

5.1 Prototype implementation

We implement a prototype of the proposed framework RMEC as shown in Fig. 8. We use laptops and desktops in LAN to simulate the blockchain servers and the portal in the infrastructure layer. The processing layer is developed based on the combination B/S and C/S modes. In the application layer, we use Java Web to implement the front-end pages. The two-dimension blockchain on each node is implemented via a two-dimension link list. Each block in the horizontal link list maintains the initial main information of a data, and the blocks in the vertical link list contain the subsequent update of the corresponding data. The block information includes data ID, timestamp, user ID, the hash of the data, and previous block hash, etc., as depicted in Table 1. The client can upload or update the data through a simple webpage as shown in Fig. 9, which requires little computation and storage capability.

Fig. 8
figure 8

A prototype of the proposed framework RMEC

Fig. 9
figure 9

Data upload or update webpage

Table 1 Block information

Assume a data has an error in a node in the system, the proposed error-correcting mechanisms can find and correct the error. For example, the data with ID 1 on node 2 has an error and the data is different from the other nodes as shown in Fig. 10. When the user submits a data query request for the data, the data query module in the processing layer will call the single-block validation module to correct the error, and hence the data can be corrected and the client can get the correct data as shown in Fig. 11.

Fig. 10
figure 10

An error of Data 1 on node 2

Fig. 11
figure 11

The error is corrected upon receiving data query request

5.2 Data tamper proof performance

The combination of distributed data storage, encryption algorithms, and consensus mechanism technologies makes the data in the blockchain computationally hard to tamper. The proposed data storage framework RMEC provides tamper-proof during the processes of data uploading, query, and storage.

  1. (1)

    Data tamper-proof during uploading

Assuming the client sends a data storage request, the portal receives the data storage request and randomly selects K(K ≥ 3) servers as the temporary master nodes for the request. The data are uploaded to the K servers. The data are not uploaded through the portal, and hence the attacker cannot acquire or modify the data by attacking the portal. The attacker cannot know the specific servers that need to be attacked in advance since the servers are randomly selected by the portal upon each data storage request. The K servers receive the data, generate blocks, and broadcast the hash of the generated block copies to the remaining N − 1 servers in the network. If there is a hash value with the number of duplicated times no less than \(\lceil {\frac {K}{2}} \rceil \), each server accepts the copy and keeps requesting the generated block from one of the servers whose created block has the same hash value as the accepted one, and calculating the hash value of the received block, until the calculated hash is the same as the previously received one.

  1. (2)

    Data tamper-proof during query

During data query, K(K ≥ 3) servers are randomly selected as the temporary master nodes, each performing the single-block validation to check the corresponding block and correct it if necessary. In addition, the data are transmitted to the client by the K servers directly without going through the portal, and hence the attacker cannot modify the data by attacking the portal. Before transmitting the data to the client, each of the K temporary master nodes may perform necessary processing, such as encryption, generating a digital signature or digital watermark, etc. At the same time, in the data query module, there is a step of calling the single-block validation module to correct the queried block, so the destroyed or falsified data does not affect the correctness of the information obtained by the user. If and only if at least \(\lceil \frac {K}{2} \rceil \) nodes in the master nodes are maliciously controlled, the user will get the wrong data.

  1. (3)

    Data tamper-proof during storage

The data tamper-proof during storage is guaranteed by the characteristics of the blockchain. That is, once a block is generated and added to the blockchain, no block modification or deletion is allowed. Each block in the blockchain is linked to the previous block via the hash. Therefore, if there is any modification on the m-th block, the hash value of the m-th block will be different from the previous hash value stored in the (m + 1)-th block.

When the client uploads data, each server calculates the last block hash value and compares it with the hash value of the previous block stored in the accepted block. If the two hash values are the same, the last block in the blockchain is correct; otherwise, the server performs the single-block validation to correct the wrong block.

Assuming that a server node has not successfully stored data or its data has something wrong, the node will synchronize or correct the data via periodic blockchain verification. Assuming that there are N servers in the system and the periodic time interval is T if the number of compromised servers is less than \(\lceil \frac {N}{2} \rceil - 1\), the error can be corrected by calling the single-block validation module within time T.

Suppose the probability of each server being compromised is p(p ≪ 1), and there are N servers in the system within which there are K temporary master nodes. Assume the total number of compromised nodes is X which follows a binomial distribution, and the number of compromised nodes in the K master nodes is Y. The possibility that the total number of compromised nodes in the current system is x can be calculated by Eq. (1) and the possibility that the current number of malicious nodes in the K temporary master nodes is y can be computed by Eq. (2).

$$ \begin{array}{l} P(X=x) = {C_{N}^{x}}(1-p)^{N-x}p^{x}. \end{array} $$
(1)
$$ \begin{array}{l} P(Y=y) = \frac{{C_{x}^{y}} \times C_{N-x}^{K-y}}{{C_{N}^{K}}}. \end{array} $$
(2)

Assume the nodes are compromised randomly, with each occurrence being independent of the others and following the same distribution. Only when there are at least \(\lceil \frac {K}{2} \rceil \) compromised nodes in the selected K master nodes, an error occurs in the system. The probability of system security can be calculated by Eq. (3). We can guarantee the system security when K ≥ 2pN. When K < 2pN, it is more difficult for the attacker to control at least half of the temporary master nodes with a larger K, and hence a larger K leads to a more secure system than a smaller K.

$$ \begin{array}{l} P(Y \ge \lceil \frac{K}{2} \rceil) = \sum\limits_{i=0}^{\lceil \frac{K}{2} \rceil} \frac{C_{p \times N}^{i} \times C_{(1-p) \times N}^{K-i}}{{C_{N}^{K}}}. \end{array} $$
(3)

5.3 Error-correcting mechanisms

Two data error-correcting mechanisms to ensure the integrity of data blocks are proposed: single-block validation and periodic blockchain verification. The single-block validation module is used in several scenarios, where the module detects and corrects the error. During data uploading, each server verifies the last block and correct it if it is wrong. Therefore, the last block is validated each time a new data is stored in the framework so that the correctness of the blockchain is achieved. During the data query, the temporary master nodes perform the single-block validation to check the corresponding block and correct it.

The main function of the periodic blockchain verification module is to periodically detect errors. If there is an error, the server calls the single block verification to fix it. Each server conducts the periodic blockchain verification to validate all the blocks in the local blockchain. When errors are detected, the server performs the single-block validation to correct the block. The periodic block verification enables the data storage framework to have the self-correcting capability and requires no communication overhead during the check, but only needs to calculate and compare the hash values of the local data.

If a node has not successfully stored the data or its data contain an error, the node invokes periodic blockchain verification. Assuming the periodic time interval is T and there are N servers in the system if the number of compromised servers is less than \(\lceil {\frac {N}{2}} \rceil - 1\), the error can be corrected within time T.

5.4 Performance of consensus mechanism

5.4.1 Performance of system security

Figure 12 shows the performance of system security under a different number of temporary master nodes and network nodes when p = 0.1, which means that the probability of each node to be compromised is 10%. The system security probability is the probability of the state that the system remains uncompromised. It can be observed that the security of the data storage system increases with the increase in the number of temporary master nodes. A big K indicates that it will be difficult for an attacker to control more than half of the temporary master nodes. When K ≥ 2pN, the system can achieve 100% security. The probability of the data storage system being compromised is almost 0 when K = 7 for all N in Fig. 12, which shows that we need only a few numbers of temporary master nodes to ensure the security of the system. A small number of total network nodes also lead to good system security performance. Given p, a small N leads to a small number of compromised nodes, and hence it is difficult to control half of the K randomly selected temporary master nodes.

Fig. 12
figure 12

System security probability w.r.t different number of temporary master nodes for various number of nodes

Figure 13 illustrates the performance of system security in terms of security probability with respect to the number of temporary master nodes and various p (probability of each server being compromised) assuming the number of nodes is 50. In general, a smaller p outperforms a bigger p. A small p indicates that each node in the system is hard to compromise, and hence the system can be more securely protected from the attack. When p = 0.05, five random temporary master nodes lead to almost zero probability of the system being compromised. Since p ≪ 1 in practice, we can achieve system security with a small number of temporary master nodes.

Fig. 13
figure 13

System security probability w.r.t. different number of temporary master nodes for various p

5.4.2 Communication overhead

The proposed framework achieves a good tradeoff between system decentralization and the amount of communication. The data tamper-proof is achieved via the decentralization of blockchain.

In the proposed framework RMEC, we introduce a portal to receive the data access requests from the clients. The portal finds the servers to accommodate the data requests. K temporary master nodes are randomly selected upon each data access request, so there is more than one master node in the system to increase the decentralization performance. The K temporary master nodes are responsible for storing the uploaded data in the blockchains and returning the query data by searching and validating the corresponding block. The data are transmitted from the client to the servers without being forwarded by the portal during the data storage process. The data are sent from the servers to the client without the portal being an intermediate node during the data query process. That is, the data are not handled by the portal. Therefore, by randomly selecting multiple temporary master nodes, we avoid having a node or some specific nodes which have all the rights to operate the data, such as block query and block broadcasting.

We evaluate the proposed consensus RMRS in terms of communication overhead against the state-of-art consensus algorithms of PBFT and YAC. PBFT requires heavy network communication during the pre-prepare, prepare and commit phases since all nodes in the system communicate with one another to achieve system security. (1) Pre-prepare process: Each node broadcasts the data to the entire network. The current master node broadcasts the data to all the nodes. Assume the size of the data is s KB, and the size of the communication overhead, including feedback, signature, timestamp, address and other information, is q KB. The communication cost of the pre-preparing process is (N − 1)(s + q) KB. (2) Prepare process: After each node receives the data, the hash of the new block is calculated and broadcast to the entire network. The broadcast of the block hash value consumes less overhead than the broadcast of the block itself. The communication cost of the preparation process is q(N − 1)2 KB. (3) Commit process: Each node broadcasts a commit message to the whole network after receiving more than 2f same data copies, where f is a tolerable Byzantine node number. The communication cost of the commit process is q(N − 1)2 KB too. Therefore, the total communication overhead of PBFT is (N − 1)(s + q) + q(N − 1)2 + q(N − 1)2 = 2q(N − 1)2 + (N − 1)(s + q) KB.

During the consensus of YAC, the communication cost that the user sends the generated transaction to a peer is s + q KB. The peer consumes s + q KB communication cost to send the transaction to the ordering service. It takes N(s + q) KB communication cost for the ordering service to send the proposal to the peers. The communication consumption of voting across the network and the commit message broadcast is 2Nq KB. Therefore, the total communication overhead of YAC is 2(s + q) + N(s + q) + 2Nq = (N + 2)s + (3N + 2)q KB.

With the proposed consensus algorithm RMRS, K nodes, instead of all the N servers, are randomly selected to participate in the consensus process. Only the hash value of the block is transmitted in the network during the consensus for data uploading, search, and validation. Therefore, the amount of communication is reduced. The communication overhead between the user and the portal is 2q KB. The K temporary master nodes receive K(s + q) KB data and broadcast the hash of the data with the size of Kq(N − 1) KB. In the process of requesting the block, the communication cost is (s + q)(NK) KB. Therefore, the total communication overhead of RMRS is 2q + K(s + q) + Kq(N − 1) + (s + q)(NK) KB.

For example, if N = 5, K = 3, s = 2KB, and q = 1KB, the communication overhead of PBFT, YAC, and RMRS is 44 KB, 31 KB, and 29 KB, respectively.

Figure 14 depicts the communication overhead of PBFT, YAC, and RMRS, under different numbers of temporary master nodes with various numbers of network nodes, assuming s= 10KB and q= 1KB. In general, the communication overhead of RMRS increases as the number of temporary master nodes increases. More temporary master nodes result in more broadcast information in the network, which increases the communication overhead. The communication overhead of PBFT and YAC keeps stable with the change in the number of temporary master nodes. PBFT always consumes more communication overhead than RMRS. All the nodes participate in the broadcast process with PBFT, and hence a big number of network nodes result in large communication overhead. When K = 1, that is, when both RMRS and YAC has only one master node, RMRS outperforms YAC. In most cases when K ≥ 3, YAC performs better than RMRS since YAC uses one ordering service entity to reduce the communication cost, while RMRS depends on K random master nodes to achieve consensus. However, it is hard to guarantee the security and honesty of the centralized ordering service, and the centralized point is in contradiction with the decentralization rationale of blockchain.

Fig. 14
figure 14

Communication overhead under different number of temporary master nodes

Figure 15 illustrates the communication overhead of PBFT, YAC, and RMRS, under various numbers of network nodes with different numbers of temporary master nodes, assuming s= 10KB and q= 1KB. In general, the communication overhead of all the three consensus algorithms increases as the number of network nodes increases, since the three consensus mechanisms require more broadcast communication to enable all the nodes to receive the same data with more network nodes. The communication overhead of the proposed consensus algorithm RMRS increases slowly with the increasing network size, which shows that RMRS is scalable. RMRS achieves much better performance than PBFT, since only part of the network nodes broadcast the data in the network, while all the nodes participate in the broadcast process with PBFT. It can also be observed that we need more communication with more temporary master nodes in RMRS since the communication overhead is closely related to the number of temporary master nodes. In most cases when K ≥ 3, YAC performs better than RMRS. However, when there is only one temporary master node in the network, which is the same as the assumption of YAC having the centralized ordering service, RMRS outperforms YAC. Note that even when K = 1, the temporary master node is chosen randomly, while the ordering service entity in YAC is fixed in the network.

Fig. 15
figure 15

Communication overhead under a various number of network nodes with a different number of temporary master nodes

6 Conclusions and discussion

In this paper, we proposed a three-layer framework to store data in the blockchain and a two-dimensional chain data structure that consists of horizontal and vertical chains. The horizontal chain stores the basic information of the data and maintains a vertical chain for the data. When the data are updated, a new block is generated to store the updated data in the vertical chain. All the blocks containing the updated information of the data form a vertical blockchain. A Rotating Multiple Random Sampling consensus mechanism was devised to balance decentralization and the computation and communication cost. Two error-correcting mechanisms (single-block validation and periodic blockchain verification) were developed to correct the wrong blocks and ensure data integrity. Our experiments demonstrated that the framework could achieve data tamper-proof and effectively reduce the communication cost.

The framework proposed in this paper can ensure the system security with a few numbers of temporary master nodes, assuming that all the nodes have the same probability of being compromised. Note that the number of temporary master nodes depends on both the number of network nodes and the probability of each node being compromised when the probabilities of the nodes to be compromised are different. In the future, we will optimize the number of temporary master nodes and develop a mechanism to adaptively adjust the number of temporary master nodes in different application scenarios to achieve data tamper-proof and efficient system costs. Traditionally, the blocks are generated sequentially due to the consensus process and distributed ledger. The sequential block generation limits the blockchain throughput since all system participants have to wait for the current consensus process to finish so as to start a new consensus for a new block. The proposed consensus mechanism allows different multiple temporary master nodes of different consensus processes to exist simultaneously, and the portal can select the temporary master nodes for multiple concurrent consensus processes. The blocks generated for different vertical chains do not interfere with each other using the proposed two-dimension blockchain structure. Therefore, the proposed consensus mechanism and blockchain structure in the paper make it possible to generate multiple blocks simultaneously, which can increase the system throughput. In our future work, we will discuss how to select the temporary master nodes to enable simultaneous block generation without degrading the system security and data consistency.