1 Introduction

With the continuous development of economic globalization, enterprises are developing toward transregional and transnational, and the distributed management of enterprise data has become an inevitable trend (Mengelkamp et al. 2018). The enterprises’ data quantity is increasing day by day, and traditional centralized data management will lead to a heavy load on the central node, which will easily cause the failure of the central node and the phenomenon of paralysis of the whole network. More and more enterprises tend to have a distributed storage of data in various sites according to actual business requirements (Huckle et al. 2016). However, distributed data storage brings a difficult data cooperation and interaction between sites, low communication efficiency, poor reliability, and other problems, which seriously affect the enterprises’ response speed and timeliness for data acquisition (Wang et al. 2017).

Enterprises are seeking for a more effective way to solve the problem faced by distributed data management (Singh et al. 2019). Blockchain is one of the emerging information technologies supporting the development of management information system, and it provides a solution for the storage, verification, transmission, and communication of distributed data. In the past 2 years, enterprises have gradually realized the broad application prospect of blockchain technology in distributed data management and excavated the huge potential of blockchain in various industries. Currently, there is no widely accepted definition of blockchain (Hussein et al. 2018). In distributed storage environment, users pay more and more attention to the timeliness of interaction experience and the reliability of information. However, the efficiency of users is often limited by the low efficiency of data communication among distributed sites. One of the important goals of distributed data management is to improve the efficiency of data transmission and ensure the reliability of data transmission. It is one of the key issues in blockchain research field that which communication model and which communication algorithm can effectively organize nodes to conduct transaction verification, thereby shortening the confirmation time of blockchain transaction and improving the processing capacity of blockchain transaction business (Meeuw et al. 2019; Puthal et al. 2018; Chung and Cha 2018).

The computing efficiency and storage space of blockchain will be improved with the continuous updating of computer hardware configuration, whereas the improvement of data transmission efficiency requires a reasonable communication model and effective data communication algorithm. To optimize the communication performance from the aspects of communication topology and communication mechanism, this paper first constructed a multi-connection concurrent communication tree model. To improve blockchain communication technology, an interactive design research method based on big data rule mining and blockchain communication technology was proposed in this paper, in order to mainly solve the optimization problem of blockchain data transmission performance. Meanwhile, blockchain data transmission paths were planned by scientific and reasonable methods, so that the data can be transmitted rapidly, reliably, and fairly in the face of the distributed data communication, thus accelerating the users’ response speed, enhancing their working efficiency, and improving their experience.

2 Related work

The research on blockchain mainly starts from bitcoin, and bitcoin system is the first typical application example of blockchain technology. Although in 2016, the founders of bitcoin announced that the project was a failure, in recent years, blockchain technology, as the underlying architecture of bitcoin, has been sought after by the government and the capital market. In 2015, blockchain became the highest financing sector in American venture capital. Super ledger is an open-source project launched by Linux foundation at the end of 2015 to promote blockchain transaction verification (Cocco et al. 2017). In April 2017, Samsung Electronics released a B2B (business to business) digital commercial blockchain platform, which can provide large-scale real-time transactions, detect control information, and conditionally isolate block data to ensure data security.

Blockchain is a technology rising with the encryption of digital currency. Its main feature is to form a decentralized architecture that does not rely on a third party, thus achieving the unification of technical standards and meeting the needs of various social scenarios for data storage (Chang et al. 2019; Grover et al. 2019; Pustišek and Kos 2018). Blockchain can be regarded as a new technological revolution in the context of big data and the internet. It is committed to building a credit system-based participatory society, and sharing economy will be its best application scenario. Consensus mechanism is a key point of blockchain, which realizes trust through code design, that is, to determine the block constructor and to maintain data consistency throughout the network, which is an indispensable part of blockchain (Rahman et al. 2019). Different consensus mechanisms can be selected according to different basic chains, implemented applications or business models of blockchain.

Blockchain technology was first written in the 13th five-year plan for national informatization issued by the State Council in December 2016. In May 2017, under the guidance of the Ministry of Industry and Information Technology of China, the China blockchain technology and industry development forum published The Reference Architecture of Blockchain and Distributed Ledger Technology, which is the first domestic blockchain standard under the guidance of the government. 2017 can be regarded as the first year of blockchain, and the application value of blockchain technology in various industries is being studied all over the world. Regarding the application layer of blockchain technology, it undoubtedly originates from the financial industry, but it has been rapidly popularized in social management, efficiency improvement of energy technology, and other aspects (Xie et al. 2019). This mainly considers that centralization causes efficiency problems to the development of these fields, while blockchain can perfectly realize industry decentralization management.

In general, blockchain builds a value transmission network. In order to solve the problem of communication efficiency and reliability in the transmission layer of blockchain, the optimization of data communication performance of blockchain is discussed, hoping to effectively organize blockchain nodes for transaction verification.

3 Structural model description

3.1 Communication structure and advantages

Blockchain adopts linear or star structure for its data verification, that is, accounting node v0 distributes blockchain data to other nodes in the blockchain for verification, and communication may result in paralysis due to the failure of communication source nodes. The algorithm studied in this paper adopts tree communication structure, which is more suitable for blockchain data forwarding (Zhao et al. 2019). For the verification of a blockchain transaction, all nodes communicate according to the self-organized tree structure of the communication tree algorithm. The tree structure does not change the network structure of realizing physical network, but it is a virtual logical structure abstracted from realizing physical network. Figure 1 shows the physical network topology of blockchain, which is a network structure. All nodes in the figure are distributed in geographically dispersed areas and connected together by wired or wireless means. The lunk in the figure represents the actual wired or wireless connection (Sikorski et al. 2017). As shown in Fig. 1, the physical topology of blockchain is a network structure, and Fig. 1 is a multi-connection concurrent communication tree, in which all nodes of the blockchain are abstracted into a tree structure for blockchain transaction verification. Therefore, it is a virtual logical structure and shields the underlying physical topology. The communication source node is responsible for building communication tree, so it is a temporary self-organizing network. All the nodes participating in the verification commonly maintain the tree structure, and each node has the function of routing and storing and forwarding data (Grover et al. 2019; Muthanna et al. 2019). The wires in Fig. 1 are a logical connection, and the wires between any two nodes ignore network relay devices such as router, wireless access device (AP), and switch.

Fig. 1
figure 1

Physical network topology of blockchain

3.2 Concurrent communication tree model

Blockchain adopts P2P network, and the physical topology of the actual network is reticulate (Cho 2018). The tree structure studied in this paper does not change the physical connection status of the network, but only organizes the nodes into a tree structure from the logical level to conduct blockchain transaction verification. The location of nodes in the communication tree and the selection of communication parent–child relationship, edge and weight all depend on the communication tree algorithm adopted by them. The forms of communication trees constructed by different communication tree algorithms are different. If the blockchain communication is completed according to different tree structural nodes, they will show different performance in terms of reliability index, efficiency index, and fairness index. Therefore, multiple performance evaluation indexes shall be adopted to evaluate communication algorithms.

In a blockchain, nodes are not only resources or service providers but also resources or service consumers, and each node has an equal status (Yang et al. 2018). After introducing concurrent communication mechanism in blockchain communication, all nodes that obtain data can undertake corresponding communication tasks. The basic idea of concurrent communication mechanism is to assign appropriate communication tasks to nodes except communication source nodes with communication capability, thereby improving the communication efficiency through the concurrency of communication task execution. With the constant expansion of communication scale, the communication concurrency degree is increased significantly and communication efficiency is improved greatly (Goranović et al. 2017; Qian et al. 2018; Dubey et al. 2020). If only concurrent transmission is considered, it is a single-connection concurrent communication tree model, but after considering the heterogeneity of node communication ability, the model becomes a multi-connection concurrent communication tree model, as shown in Fig. 2.

Fig. 2
figure 2

Single-connection concurrent communication tree model evolves into multi-connection concurrent communication tree model

To describe the difference between node forwarding capabilities, the concept of node communication connection number is introduced. Since any node can simultaneously forward data to other multiple nodes, the single-connection concurrent communication tree model is evolved into the multi-connection concurrent communication tree model (Kim 2018). The nodes in Fig. 1a become the node cluster in Fig. 1b. The number of nodes in any node cluster is equal to the sum of all nodes in the parent node cluster; for example, the number of nodes in cluster 1 is \( l\left( {v_{0} } \right) \). The node cluster in the lower communication tree contains a great number of nodes. In order to give play to the communication ability of nodes to the maximum extent, it is required that each node \( v_{i} \) is in full connection communication, namely the number of communication nodes is equal to the maximum number of communication connections of the node. In addition, each node \( v_{i} \) forwards the data immediately after receiving the data, and there is no interval in the whole communication process.

4 Parallel clustering based on MapReduce

4.1 System structure of MapReduce

Google proposed MapReduce computing model for the first time in 2004. It helps Google rebuild the index file system, which is one of the three core technologies of Google (the other two are GFS and BigTable, respectively) (Saberi et al. 2019). MapReduce programming model allows developers to focus on data operation, and encapsulate the details of data distribution, parallelization implementation, fault-tolerant mechanism, and load balancing in a library, thus making it easier to process large data. Now, MapReduce has been widely used in the sorting of massive data, log analysis, looking for specific patterns in massive data, and other scenarios.

In Hadoop, two types of servers are mainly used to execute MapReduce task: one is JobTracker, also known as job server, and the other is TaskTracker, also known as task server (Hussen 2018). JobTracker is mainly responsible for managing and scheduling all jobs under this architecture, while TaskTracker is mainly used for executing jobs. Only one JobTracker is allowed in a Hadoop cluster, but more than one TaskTracker.

In actual operation, each MapReduce task will be instantiated as a job, and each job is divided into two phases: Map phase and Reduce phase. These two phases are represented by Map function and Reduce function, respectively, and developers define tasks by implementing these two functions. The Map function takes the key-value pair <key, value> as input and produces an intermediate output of the set of <key, value>. Then, MapReduce library will gather all intermediate values with the same intermediate key values together and pass them to the Reduce function, which receives an input whose form is <key, (list of values)>. After processing, the final results will be output. The output form of Reduce function is still <key, value>. The system structure of MapReduce in Hadoop is shown in Fig. 3.

Fig. 3
figure 3

MapReduce framework

4.2 Parallel K-means algorithm

K-means algorithm has been proved to be able to carry out parallel processing of large data sets on Hadoop platform. MapReduce computation model is used to redesign the improved L + Cop-K-means algorithm in chapter 3, in order to design and implement Map function, Reduce function, and Combine function which is responsible for processing intermediate results. In this paper, only the allocation rules of data samples in the Map function need to be modified, and the processing of Combine function and Reduce function is the same as that of parallel K-means algorithm.

First, K samples are randomly selected from the data set as the central points and stored in HDFS together with prior knowledge as global variables to guide the whole clustering process. The iterative process is mainly completed by Map function, Combine function, and Reduce function.

K-means algorithm

Initialize k cluster centers; update the cluster ownership of all sample points: the cluster to which the sample points belong most recently; recalculate the center of each cluster (until the cluster center no longer changes or reaches the maximum number of updates).

figure a

Map function

The input form of Map function is <key, value> key-value pair. Key represents the offset of the current sample data in the entire input file, and value represents the string values composed of the dimensional coordinates of the current sample. First of all, the sample values are parsed from the value, and all samples are divided into appropriate classes according to the allocation rules of stand-alone L + Cop-K-means algorithm. Then <key’, value’> key-value pair is used as the output of Map function. Here, key’ represents the class label after dividing each sample, and value’ still represents the string value of this sample. The pseudocode of Map function is shown below:

figure b

To reduce the amount of communication (mainly data transfer between various nodes) in the iteration process, a Combine operation is required after Map function processing. The output of Map function is combined locally. Since the output of Map function is always stored locally first, the communication cost of executing Combine operation is very low.

4.3 Blockchain communication performance optimization based on IFT algorithm and MMWT algorithm

Link first communication tree (LFT) algorithm makes full use of nodes with strong communication ability to forward communication tasks and has high communication efficiency. f(t) can reach the optimal solution, but does not consider the trust degree of nodes. If the nodes are added to the communication tree according to the trust degree, the algorithm is called trust first communication tree algorithm (TFT). Considering the security and stability of the whole communication tree, the TFT algorithm improves the trust of the communication tree by improving the trust of the forwarding nodes, and preferentially selects the nodes with high trust to join the top or upper layer of the communication tree. However, the nodes with high trust may have poor communication ability (Khezr et al. 2019). Thus, TFT algorithm can improve the stability of communication tree and shorten the concurrent communication time, which may lead to low communication efficiency.

MMWT algorithm takes a comprehensive trade-off between the number of communication connections, the degree of trust, and the weight of communication. It selects the target node and the communication path from the blockchain network to build the communication tree based on the accounting node. When constructing the communication tree, to greatly improve the communication concurrency, the MMWT algorithm first adds the nodes with large number of communication connections to the communication tree and then eliminates the nodes with too large communication weight in the top and upper communication trees, thus reducing the communication time (Zeng et al. 2020).

LFT: Based on the multi-connection concurrent communication tree model, with the expansion of the scale, the logical connection between the nodes in the communication tree will become complex, unable to define the communication relationship between each node. Therefore, a communication tree representation method is proposed, which transforms the communication tree into a binary tree representation. The time from the root node to the node \( v_{i} \) for obtaining the communication data is called the time when the node \( v_{i} \) obtains the data, recorded as \( f_{{v_{i} }} (t) \). The cumulative communication time of a communication tree with n nodes can be expressed as \( f_{A} (t) \):

$$ f_{A} (t) = \sum\limits_{i = 1}^{N} {f_{{v_{i} }} (t)} $$
(1)

For the communication tree whose communication node is N, the ratio of the accumulated communication time \( f_{A} (t) \) to n is called the average end-to-end delay of the node, recorded as \( f_{\text{AED}} (t) \), which can be expressed as:

$$ f_{\text{AED}} (t) = \frac{{\sum\nolimits_{i = 1}^{N} {f_{{v_{i} }} (t)} }}{N} $$
(2)

From the communication root node \( v_{0} \) to a certain leaf node \( v_{i} \) receiving data, the edge passed is called the communication branch of the leaf node \( v_{i} \), which is expressed as \( {\text{Route(}}v_{i} ) \). The number of times each branch forwards the same group of data is called the link pressure of the communication branch, which is expressed as \( {\text{Num}}[{\text{Route}}(v_{i} )] \). The average link pressure \( A_{\text{ls}} \) of the communication tree can be expressed as:

$$ A_{\text{ls}} = \frac{{\sum\nolimits_{i = 1}^{{R_{n} }} {{\text{Num[Route}}(v_{i} )]} }}{{R_{n} }} $$
(3)

\( R_{n} \) indicates the number of communication branches needed to complete the whole communication.

LFT algorithm makes full use of nodes with strong communication ability to forward communication tasks, which has high communication efficiency and \( f(t) \) can reach the optimal solution. On this basis, according to the trust degree of the nodes, the nodes are added to the communication tree in turn, that is, TFT, which can improve the trust degree of the forwarding nodes and the communication tree.

MMWT: according to the actual needs, the blockchain network can be abstracted as a graph \( G = (V,E,W,L,T) \), where V is the node set, E is the edge set, W is the weight set, L is the communication connection number set, and T is the node trust set. In blockchain network, if the weights of any two nodes can be obtained, G is a complete graph with weights. However, when the blockchain network scale N is large, it is difficult to calculate the transmission delay of any two nodes, and the calculation distortion is caused by the poor timeliness. For the blockchain network with N nodes, the final number of communication tree nodes is N, and the number of sides is N − 1, so there is more choice space for communication weight than the number of communication connections. In MMWT algorithm, nodes with large number of communication connections \( l(v_{i} ) \) are added to the communication tree first, and then, nodes with too large communication weight in the top layer and upper layer communication tree are eliminated, thus shortening the communication time.

5 Experimental analysis

5.1 Simulation experiment environment

Block data in blockchain consist of block head and block body. Block head includes version number, parent block hash value, Merkle tree root, timestamp, difficulty value, Nonce, and other information. The block body records the number of transactions in the block, the verified historical transactions, and the transactions generated in the block creation process. According to the transaction verification process of blockchain, the following agreements are made:

  1. 1.

    The node has the function of “staging-forwarding.” Compared with the forwarding delay, the staging delay can be ignored. The transmission delay of blockchain nodes is mainly caused by forwarding delay.

  2. 2.

    For a transaction, the generated hash value is the same as the block data, that is, all nodes participating in the verification receive the same hash value or block data. Without considering the routing control information of nodes, it can be agreed that the task of communication subject between any two nodes in the blockchain network is the same.

  3. 3.

    Due to the frequent verification transactions in the blockchain, each transaction needs to be verified by all nodes, so each node knows the information of the node itself and the information between nodes in the blockchain.

  4. 4.

    The node obtaining the bookkeeping right needs to send all information to other nodes for verification. The transmission mode is point to multipoint. For the N-node blockchain network, the number of accounting right nodes obtained through computing power competition is 1, and the block data should be sent to other N − 1 nodes for verification.

  5. 5.

    For a transaction verification task of blockchain, all nodes do not participate in other tasks and only complete the blockchain verification task. Hence, the number of communication links of nodes is constant throughout the communication period, and the node load and available bandwidth is not considered.

  6. 6.

    According to whether to undertake communication task in communication, it can be divided into communication root node, forwarding node, and leaf node. The accounting authority node needs to forward the block data to other nodes, so the accounting authority node is the communication root node.

5.2 Comparison of three communication tree representation algorithms

The trust degree of the communication tree reflects the reliability and stability of the whole communication process. The greater the trust degree of the communication tree is, the higher the reliability of the communication forwarding information is, and the higher the forwarding stability is. The smaller the trust degree of the communication tree is, the worse the credibility of the forwarding information is, and the worse the link stability is. Table 1 shows the comprehensive comparison of communication performance evaluation indexes of LFT, TFT, and IFT. The IFT algorithm proposed considers the trust degree of nodes and the number of communication connections of nodes. On the premise of ensuring the reliability and stability of the communication service, it can make the concurrent communication time of the communication tree close to or reach the optimal value.

Table 1 Comparison of communication performance of three algorithms

For the communication tree with the same number of nodes, the concurrent communication time of the IFT algorithm is slightly higher than that of the LFT algorithm, while that of the TFT algorithm is larger, as shown in Fig. 4. The main reason is that the TFT algorithm preferentially selects the nodes with high trust to join the communication tree, and cannot undertake the forwarding task in the subsequent communication process, so the concurrent communication time is large.

Fig. 4
figure 4

Comparison of concurrent communication time of three algorithms

The trust degree of the communication tree reflects the reliability and stability of the whole communication process. The greater the trust degree of the communication tree is, the higher the reliability of the communication forwarding information is, and the higher the forwarding stability is. The smaller the trust degree of the communication tree is, the worse the credibility of the forwarding information is, and the worse the link stability is. Figure 5 shows a comparison of the trust degree of the three algorithms. It can be seen from Fig. 5 that the trust degree of communication tree constructed by TFT algorithm is the best, the trust degree of communication tree constructed by IFT algorithm is about 80%, and the trust degree of nodes is not considered by LFT algorithm, so the trust degree of communication tree is the worst, about 50%.

Fig. 5
figure 5

Comparison of communication tree trust of three algorithms

5.3 Comparative analysis of communication performance of several algorithms of multifactor blockchain

The communication weight is 10 × 10 weight matrix. For the single-connection communication tree, it is represented by child brother. In the actual communication of blockchain, it is specified that the proportions of communication connection number 3 and 2 are 5:100 and 10:100, respectively. Due to the weak communication forwarding ability of most nodes, the proportions of nodes with 1 and 0 communication connection number are 45:100 and 40:100, respectively. MMWT algorithm is compared with LFT and TFT in terms of communication performance. Y axis in Fig. 6 represents concurrent communication time; Y axis in Figs. 7 and 8 represents average end-to-end delay and the trust degree of the communication tree, respectively. The communication tree constructed by the MMWT algorithm proposed in this chapter is an unequal weight communication tree. As shown in Fig. 4, the concurrent communication time of MMWT algorithm and LFT algorithm increases gently with the increase in the number of nodes, and their values are close to each other, while the concurrent communication time of TFT algorithm is significantly longer than that of MMWT algorithm and LFT algorithm. Therefore, as shown in Fig. 6, the concurrent communication time of MMWT algorithm is sometimes better than that of LFT algorithm. In the case of non-equivalence, the concurrent communication time of the communication tree constructed by LFT algorithm is not necessarily optimal. The reason is that, after the introduction of the communication weight, the concurrent communication time is jointly determined by the number of communication connections and the communication weight, instead of only depending on a single impact factor.

Fig. 6
figure 6

Parallel communication time

Fig. 7
figure 7

Average end-to-end delay

Fig. 8
figure 8

Depth of communication tree

The average end-to-end delay represents the average time consumed by sending block data from the root node to the target node, and it reflects the average transmission delay of completing the whole blockchain communication. As shown in Fig. 7, the average end-to-end delay of MMWT algorithm and LFT algorithm is obviously better than that of TFT algorithm. The reason is that MMWT algorithm and LFT algorithm give priority to select the nodes with a strong communication ability to improve communication efficiency when constructing communication trees. However, TFT algorithm first selects the nodes with a strong reliability to join the communication tree, so the number of communication connections and the size of communication weights are ignored. TFT algorithm mainly aims at optimizing the reliability and stability of communication link, leading to a larger transmission delay of block data. In most cases, the average end-to-end delay of MMWT algorithm is superior to that of LFT algorithm. With the increase in node N, this trend becomes more obvious, because when constructing communication trees, MMWT algorithm additionally considers communication weights, eliminates too large communication weight, and avoids the nodes with a big communication weight from being at the top and upper layer of the communication tree, so as to effectively shorten the average end-to-end communication delay.

The depth of communication tree is also an important indicator for evaluating communication tree algorithm. According to performance evaluation index in 2.3.2, the depth of communication tree constructed by a good algorithm is low, which means that the data have fewer forwarding times on the logical link. For the communication tree with the same number of nodes, if the depth of the communication tree is small, the branches and leaves must be luxuriant. Figure 6 shows that the communication tree constructed by MMWT algorithm and LFT algorithm has a small depth, while that constructed by TFT algorithm has a large depth. Then, the communication trees constructed by MMWT algorithm and LFT algorithm are short and robust, while the communication trees constructed by TFT algorithm are relatively slender. When the communication weight of any two nodes in MMWT algorithm is 1, the research situation of this algorithm will degenerate into the research situation of IFT algorithm in chapter 3. Therefore, IFT algorithm studied in chapter 3 can be regarded as a special case of MMWT algorithm under the condition of constant weight.

6 Conclusion

Based on the multi-connection concurrent communication tree model and the above-mentioned evaluation method of node trust, a blockchain communication algorithm IFT considering node trust is put forward. In this algorithm, the number of communication connections and the trust degree of nodes are considered comprehensively. Moreover, the nodes with strong communication ability and high trust degree are preferred to join the top or upper layer of the communication tree, thereby improving the communication efficiency and reliability of the blockchain. Because of the multi-connection concurrent communication tree model, a multi-connection multifactor communication tree algorithm MMWT based on communication weight is proposed. Considering that the transmission delay of any two nodes in blockchain network is not the same, this paper introduces the concept of communication weights to represent the communication cost between nodes, and comprehensively considers node communication connections, node support degree, and communication weight. First of all, the nodes are ranked according to node communication connections and node support degree from big to small, and then, the malicious nodes and nodes with a large weight are adjusted to the tail of the queue by setting the threshold values of credibility and weight. Based on the above processing, the nodes with a strong communication ability, high credibility, and small weight are located at the top and upper level of the communication tree, thus improving the concurrency of communication and increasing the reliability of blockchain communication. MMWT algorithm, by setting trust threshold and weight threshold, avoids malicious nodes and nodes with large communication weight from being on the top or top of the communication tree, and tries to make high-performance nodes on the top or top of the communication tree, thus improving the efficiency and reliability of blockchain transaction verification. Simulation results and relevant analysis show that, compared with existing algorithms, MMWT algorithm shows a better communication performance in terms of concurrent communication time, average end-to-end delay, concurrency, communication tree credibility, average link pressure, and communication tree depth.

The algorithm studied in this paper belongs to the static routing algorithm, so every blockchain transaction needs to reconstruct the communication tree. Frequent changes to the logical topology of the network will increase the complexity of the operation process and increase the cost. Therefore, in the follow-up study, dynamic routing strategy will be considered to avoid network congestion caused by frequent link changes.