Keywords

1 Introduction

With the acceleration of information digitization and network service, digital resources have developed rapidly, and extensive electronic records have been produced. The acquisition, use, and sharing of electronic documents have unique advantages over traditional documentary resources, making them play an increasingly important role in people’s lives, learning, and work. However, people have found that the long-term preservation of electronic records is very tricky. First of all, electronic documents are stored on a physical carrier in the form of digital codes, mainly based on light, electricity, and magnetism. The transport of these materials has a very high requirement for the storage environment. The effects of high temperature, humidity, and magnetic fields all contribute to the loss of information, and the longevity of these carriers is far shorter than that of the traditional carrier papers, which is hundreds of years. The magnetic carrier storage time is about ten years, and the storage time of the optical disk is 10–20 years [1,2,3]. Second, the reading of electronic records depends on computer software. As technology continues to evolve, operating systems and software upgrades will no longer support traditional record formats, which will lead to embarrassing situations in which electronic records cannot be used. Electronic files stored in computers are bound to be subject to security threats. Nowadays, electronic records have reached a tremendous amount and are continuing to develop, which makes it impossible to estimate the cost of saving electronic records [4]. Moreover, the preservation cost of electronic records not only includes the physical space and environmental control costs required for the preservation of traditional paper records but also relates to other expenses necessary to ensure the reproduction of electronic files, such as technical updating and digital migration. Also, the long-term preservation of electronic records still lacks public standards and legal difficulties [5, 6].

Long-term preservation of electronic records can only be guaranteed by a secure storage environment to ensure the long-term validity and availability of records. The proposal of cloud storage provides a new possibility for the long-term preservation of electronic records. Compared with traditional preservation strategies, cloud storage has a reliable storage architecture, perfect backup measures, and an efficient migration mechanism that reduces initial investment, saves management costs, reduces maintenance expenses. So cloud storage is suitable for large-scale digital storage, providing a one-to-one portable service which can efficiently guarantee the continuity of service, and the speed of website access response is fast. It is a better way for current electronic records to be stored for a long time [7, 8].

However, under the cloud storage environment, electronic records are stored in the cloud server for a long time. The electronic record is completed under the control of the cloud server. The record manager entirely loses the physical control of the electronic record. Once a problem occurs with the cloud service provider, the organization may not be able to retrieve electronic records and electronic records. In 2014, the international CodeSpace cloud company was hacked. All the records in the Apache Subversion collection and Elastic Block collection on the company’s cloud service platform were all permanently deleted, and the records could not be recovered. This situation is disastrous for electronic records and archives that require permanent preservation and retention as human history [9, 10]. With the aim of enhancing the reliability of electronic records, this paper proposes a combination of erasure codes and copy backup for electronic records stored in the cloud. Based on this, the technical implementation problem was studied.

2 Related Work

Currently, magnetic disks are the key and core of electronic record storage systems in both centralized and distributed electronic record storage systems. However, due to the limitations of the mechanical characteristics of the disk itself, although some researchers have proposed RAID (Redundant Arrays of Inexpensive Disks) technology, its reliability has not been substantially improved. However, in large-scale electronic record storage systems, disk failure or storage node failure has become a regular behavior. For example, in an electronic record center with a scale of approximately 4,000 nodes, an average of four disks will fail each day. Google researchers counted disk corruption in its electronic record center, and about 1.7% to 8.6% of disks in the system would fail each year [11]. According to statistics from Carnegie Mellon University, the annual replacement rate of the disk in some systems is about 13% [12]. Each year, dozens or even hundreds of disk corruptions are commonplace for a PB-class system consisting of tens of thousands of disks. For larger EB-class storage systems, tens or even tens of thousands of disks are destroyed every year. At the same time, the magnetic media on a portion of the surface of the disk platter is often damaged or occurs read and write errors, resulting in inaccessibility or loss of data on some sectors of the disk. Net App has made statistics on disk sector errors. Within 32 months, the proportion of such errors in the disk system with a size of 1.53 million reached 3.45% [13].

In addition to regular disk damage, the storage node’s network card is damaged, and memory, CPU, and other hardware are damaged. Alternatively, the whole rack in the storage system is damaged due to the system power off, and the electronic records in the entire rack are temporarily unavailable. There are also unreliable electronic records due to system software errors. Kroll Ontrack conducted a systematic statistical study of the reasons for the loss of electronic records, of which system failures or hardware device damage caused approximately 56% of electronic record losses; about 26% was due to human-induced system failures. Software failures or virus intrusion cause the resulting loss of about 16% of electronic records; about 2% of electronic records are lost due to natural causes such as earthquakes and tsunamis. That is, one out of every 500 electronic record centers have an electronic record disaster [14].

On the one hand, the exploding storage capacity of electronic records has increased the demand for primary storage devices. On the one hand, it is the frequent failure of large-scale mass electronic record storage systems. On the other hand, the loss of electronic records to their owners and users is enormous. All of this makes the reliability of the electronic record storage system an critical challenge.

Increasing redundancy is a standard way to realize the reliability of electronic records. When an electronic record partially fails, customers can satisfy their own needs by accessing redundant data. Under the distributed storage environment such as GFS, HDFS and Amazon S3, three-replica redundancy is used, which can well meet the reliability of electronic records and load balancing requirements. The original intention of the three-copy strategy adopted by GFS/HDFS is to ensure that no more electronic record lost under the condition that keeps the node hardware performance [15].

The electronic record storage cluster mainly consists of the following components: cluster manager nodes, access nodes, and storage nodes. The cluster management node is responsible for the metadata information in the system such as the configuration of the cluster and the system’s namespace. When a block of electronic records in a cluster needs to undergo a block redundancy change, the cluster management node also concurrently manages the work of encoding record blocks. The visiting node is mainly responsible for responding to the I/O access request sent by the user. After the user request arrives at the access node, the access node first interacts with the management node to obtain the state information of the accessed record block and the address information of the record block on the production node. Then returns corresponding information from the corresponding storage node according to the address information of the electronic record block to the user. The storage node is responsible for the actual storage of the data and saves the original content and the associated verification data. A large number of storage nodes are deployed on multiple racks and interconnected by switches. The three copies of the electronic record block are distributed in the cluster in a rack-aware, random layout, specifically, two copies are placed on two nodes in the same rack, and a third copy is placed on the other rack.

When the cluster size is small, the consumption of storage space of three copies is not particularly significant. However, in a large-scale application cluster, the utilization rate of nodes is often too high, and the needs of cost control cannot be entirely satisfied only by increasing the storage space of the nodes. More importantly, in the multi-copy mode, the cluster’s scalability is limited due to the limitations of cluster metadata management [13,14,15].

The reliability enhancement technology based on multiple backups is intuitive and straightforward, easy to implement, and is the simplest and most widely used type of data redundancy mode in distributed storage systems. This strategy requires the sharing of multiple copies of the same electronic record to different storage nodes. Apparently, this strategy has a significant storage space overhead. Redundant electronic records are multiple copies of the original record. With the explosive growth of electronic records and the ever-increasing scale of storage devices, the management and operation of hardware devices will bring enormous costs. The choice of electronic record reliability preservation strategy needs to consider the record redundancy problem of the backup strategy, load balancing issues, and additional energy consumption issues.

3 Problem Statement

For the data reliability problem of storage systems, in recent years, scholars at home and abroad have conducted exploration and research and opened up a new storage path based on encoding redundancy strategy. Erasure codes are widely used in storage clusters, such as archiving systems, data centers, cloud storage, and so on. Among them, Solomon coding has become a typical data organization solution in fault-tolerant clusters because of its smooth operation and increased fault tolerance. Solomon coding guarantees data availability with extremely low storage overhead. Compared to data copies, erasure codes can provide equivalent fault tolerance with less storage overhead [14]. Most of the data is accessed for a short period throughout its life cycle. For example, more than 90% of data access in the Yahoo M45 Hadoop cluster occurs on the first day of data creation [15]. Therefore, it is economical to use erasure code to archive data copies. Today, some practical storage systems (for example, WAS [16], GFSII) adopt a mixed redundancy strategy, use a copy strategy for newly created data, and use an erasure code for archiving when the data access frequency is reduced.

Archiving improves storage utilization by reducing the storage overhead of infrequently accessed data. Existing distributed storage systems such as HDFS, and GFS. To ensure data availability, improve the degree of parallelism of operations, and use more copies to store data. The size of the default data block is 64 MB or 128 MB. With the exponential growth of data, the storage pressure of existing data centers is increasing. In the application scenario where multiple reads are written at once, after the data is generated, the frequency of use is negatively related to the time. The data with low access frequency is archived from multiple copies into an erasure code storage format, which can ensure data availability, improve storage space utilization, and relieve data center pressure.

4 Preservation Mechanism of Electronic Record Based on Erasure Code and Multi Copies

4.1 Erasure Code

RS-type erasure codes are the primary erasure coding techniques applied in distributed storage systems [17]. Its earliest application in distributed record system dates back to 1989. Rabin proposed an information splitting algorithm based on Rabin code for network server faults and bandwidth problems. Its core is the RS type erasure code. Reed-Solomon Coding [18] uses Galois Field operations for encoding/decoding, where the Galois Field addition operation is an XOR operation, and the multiplication operation is usually performed by searching for a corresponding Galois Field table.

RS code is a block-based MDS error correction coding, which is widely used in the field of communications and storage. In general, the \( \left( {k + r,k} \right) \)-type RS code indicates that each band of the code is composed of k data blocks and r check blocks. It uses data and a generation matrix to generate redundant data. The generation matrix consists of a \( k \times k \) identity matrix and a \( k \times r \) redundancy matrix. The RS coding process is actually about the linear operation of the data block, and the redundant block is calculated by the multiplication of the k data blocks and the \( k \times r \) generating matrix. The encoding process of the Andermonde-RS \( \left( {k + r,k} \right) \) algorithm is as follows.

$$ \left[ {\begin{array}{*{20}l} {r_{1,1} } \hfill & {r_{1,2} } \hfill & \cdots \hfill & {r_{1,k} } \hfill \\ {r_{2,1} } \hfill & {r_{2,2} } \hfill & \cdots \hfill & {r_{2,k} } \hfill \\ \vdots \hfill & \vdots \hfill & {} \hfill & \vdots \hfill \\ {r_{n,1} } \hfill & {r_{n,2} } \hfill & \cdots \hfill & {r_{nk} } \hfill \\ \end{array} } \right] \otimes \left[ {\begin{array}{*{20}l} {d_{1} } \hfill \\ {d_{2} } \hfill \\ \vdots \hfill \\ {d_{k} } \hfill \\ \end{array} } \right] = \left[ {\begin{array}{*{20}l} 1 \hfill & 0 \hfill & \cdots \hfill & 0 \hfill \\ 0 \hfill & 1 \hfill & \cdots \hfill & 0 \hfill \\ \vdots \hfill & {} \hfill & \vdots \hfill & \vdots \hfill \\ 1 \hfill & {2^{r - 1} } \hfill & \cdots \hfill & {k^{r - 1} } \hfill \\ \end{array} } \right] \otimes \left[ {\begin{array}{*{20}l} {d_{1} } \hfill \\ {d_{2} } \hfill \\ \vdots \hfill \\ {d_{k} } \hfill \\ \end{array} } \right] = \underbrace {{\left[ {\begin{array}{*{20}l} {d_{1} } \hfill \\ {d_{2} } \hfill \\ \vdots \hfill \\ {d_{k} } \hfill \\ {p_{1} } \hfill \\ \vdots \hfill \\ {p_{r} } \hfill \\ \end{array} } \right]}}_{\text{E}} $$
(1)

If \( r \)-blocks in the E-matrix are lost, the corresponding rows of the \( r \)-blocks in the \( A \)-matrix and the \( E \)-matrix are deleted at the time of recovery, and a new \( \left( {n \times n} \right) \)-order matrix \( {\text{A}}^{{\prime }} \) and an \( \left( {n \times 1} \right) \)-order matrix \( E^{{\prime }} \) are obtained. \( A^{{\prime }} \) is non-singular, and inverts \( A^{{\prime }} \) to get \( A^{{{\prime } - 1}} \) recovery data: \( D = A^{{{\prime } - 1}} \cdot E^{{\prime }} \). Extract the calculation part in which the redundant data \( {\text{p}}_{1} \sim{\text{p}}_{\text{r}} \) is generated, that is, the process of generating redundancy check for the code, as shown below.

$$ \left[ {\begin{array}{*{20}l} {f_{1,1} } \hfill & {f_{1,2} } \hfill & \cdots \hfill & {f_{1,k} } \hfill \\ {f_{2,1} } \hfill & {f_{2,2} } \hfill & \cdots \hfill & {f_{2,k} } \hfill \\ \vdots \hfill & \vdots \hfill & {} \hfill & \vdots \hfill \\ {f_{n,1} } \hfill & {f_{n,2} } \hfill & \cdots \hfill & {f_{nk} } \hfill \\ \end{array} } \right] \otimes \left[ {\begin{array}{*{20}l} {d_{1} } \hfill \\ {d_{2} } \hfill \\ \vdots \hfill \\ {d_{k} } \hfill \\ \end{array} } \right] = \left[ {\begin{array}{*{20}l} 1\hfill & 0 \hfill & \cdots \hfill & 0 \hfill \\ 0 \hfill & 1 \hfill & \cdots \hfill & 0 \hfill \\ \vdots \hfill & {} \hfill & \vdots \hfill & \vdots \hfill \\ 1 \hfill & {2^{r - 1} } \hfill & \cdots \hfill & {k^{r - 1} } \hfill \\ \end{array} } \right] \otimes \left[ {\begin{array}{*{20}l} {d_{1} } \hfill \\ {d_{2} } \hfill \\ \vdots \hfill \\ {d_{k} } \hfill \\ \end{array} } \right] = \left[ {\begin{array}{*{20}l} {p_{1} } \hfill \\ {p_{2} } \hfill \\ \vdots \hfill \\ {p_{r} } \hfill \\ \end{array} } \right] $$
(2)

There are two drawbacks to using Disk Reduce directly: (1) Read performance in multi-copy mode is better, multiple pieces of data services at the same time, and load balancing is possible. (2) Under the erasure code storage mode, the degraded read of the failed data block and reconstruction of data blocks will bring about a large amount of disk IO and network data transmission, while only need transmit one data block in the multiple copy mode.

By using hybrid storage of copy and erasure codes, only data with a low frequency of access is encoded to improve storage space. In a real large-scale cluster, the data is used very shortly after it is generated. In this way, when the data heat is reduced, archival storage of the three-copy data using the erasure correction code can ensure data reliability and considerably save storage space without affecting the data access speed.

4.2 Preservation Mechanism of Electronic Record Based on Erasure Code and Multi Copies

Heterogeneous Storage Cluster

With the arrival of the era of big data, the scale of the system is getting bigger and bigger. Because the old system cannot meet the increasing demands of users on capacity and performance, the system must be upgraded. In this case, if a one-time hardware upgrade is performed on the system hardware, many resources will be wasted. With the system’s multiple upgrades, and the masses will have a variety of different models, different performance hardware devices. The different performances of the nodes are specifically: the computing power of different CPUs, memory capacity, network bandwidth, and disk speed. Also, the scale of the system is expanded to increase the number of nodes, which are usually located in different racks, resulting in different bandwidths and delays between different nodes. In general, as the system and hardware upgrades, the increase in the size of the storage system makes the performance of different storage nodes heterogeneous.

On the other hand, User access requests are unbalanced, which also makes the heterogeneity between nodes more complicated. When a node is storing more hot data, a large number of users request access to the node over a period. Also, some nodes have better performance and may not have user’s requests. This condition will timeout in an idle state, thereby reducing the quality of the service system, mainly when the nodes perform poorly. This situation not only does not share resources but also causes a waste of system resources due to too large pressures on another node.

The heterogeneity between large-scale storage system nodes is an unavoidable issue that must be taken into account. What needs to be emphasized is that due to the existing enterprise-class data centers, to improve the utilization efficiency of resources and the user experience, the clusters often provide 24-h services, so that the heterogeneity of clusters will become more and more prominent as time passes.

Competence-Aware Erasure Record Archiving

In the process of online storage degradation, the performance problems caused by uneven distribution of electronic records and uneven distribution of node user loads will occur, which cannot be solved by traditional data locality scheduling methods. This paper proposes a mechanism to balance the preservation of electronic records according to nodes’ capabilities. According to the core metrics of the node bandwidth and its storage capacity, more coding tasks are allocated to nodes with robust capabilities, and fewer are assigned to nodes with weaker capabilities. So that when the heterogeneous hardware in the node, uneven distribution of electronic records, and different user load distribution occurs, people can more fully use the resources of the entire cluster, rather than that of a single node.

For the heterogeneity of node performance and the difference in I/O capacity on nodes, the bandwidth, and storage capacity. Ability value = I/O capability * time period/electronic record block size + remaining bandwidth. The capability value represents the maximum number of data blocks that can be transmitted by the network during a period. The size of the point capability value determines how much of the coded electronic record block is allocated to that node. The system allocates less coding tasks to the weaker nodes and allocates more coding tasks to the more capable nodes. This strategy can effectively prevent the nodes whose encoding speed is too slow from becoming the shortboard of the entire coding process. The node that encodes the user’s heavy load allocates less coding tasks, and the nodes with lighter user loads allocate more coding tasks, which can effectively reduce the resources competition between the coding process and the user’s access, thereby improving the work efficiency of the cluster.

Based on the capability values, the assignment of coding tasks takes place in a short period. The period divides a long massive job into small jobs within a plurality of time slices and predicts a load of a time window with the user load at the beginning of the small job, that is, the load is constant. The system also converts dynamic node loads to static encoding task schedules. Combine the completion of encoding tasks for each period on each node and correct the ability value of the node’s current period. This capability value feedback method can continuously update the node’s capabilities over the last period.

Ability Value Initialization

According to each node’s current capability value as the primary factor in the selection of coding nodes. Each node’s capability value is the remaining bandwidth of the node, and it is I/O capability. Individually, firstly, the number of processing tasks of the node i per unit time in the cluster is calculated. The value can be obtained by dividing the number of the code storage completed by the node in a certain time period by the consumed time. Second, calculate the remaining bandwidth of the nodes in the cluster.

Coding Task Assignment

The allocation of coding tasks mainly depends on the current capability value of each node and electronic record blocks distribution. Assume that \( W_{i} \) corresponds to the number of record blocks expected to be processed by node i within a unit of time. Let \( B_{i} \) denote the current remaining bandwidth of node i, sorting from the largest to the smallest according to the \( W \) and \( B \) values of each node, picking the node to which the encoding is to be assigned. If the remaining bandwidth of the node is insufficient or its capacity value does not meet the coding requirements, the node will not get a new coding task. This encoding task assignment process ensures that the sum of the encoding tasks and user loads for each node does not exceed the throughput of each node itself. In addition, in the case that the task of each node is not overloaded, the locality of the record can be combined, so that the resource consumption of the network bandwidth in the encoding process is as small as possible.

Update of Ability Value

Because the load of the task on each node changes dynamically, the value of the node’s capabilities is obsolete after each task assignment. When a coding operation task is completed, the number of coded task records \( R_{ti} \) and the coding task processing time \( T_{i} \) in the time period are obtained. Therefore, the new I/O capability value is \( R_{ti} /T_{i} \). In addition, the remaining bandwidth of the node during the calculation of the time period will also be evaluated. Such capability value updating process can correct and reflect the idle bandwidth and I/O capability of the node in real time when the user load of the node changes, so as to achieve the task of assigning codes accurately.

The total encoding time T of each stripe is determined by the longest one of all the nodes participating in the stripe encoding.

$$ T = \mathop {\mathop {\hbox{max} }\limits^{{{\text{p}}^{{\prime \prime }} }} }\nolimits_{{{\text{p = }}1^{{\prime \prime }} ,{\text{p}} \ne {\text{En}}}} \left\{ {T_{PDisk} + T_{PNet} ,T_{EnDisk} + T_{EnNet} + T_{Encode} } \right\} $$
(3)

The \( p^{{\prime \prime }} \) nodes participating in the encoding process of band \( i \) are \( \left( {SN1^{{\prime \prime }} , \ldots , SNp^{{\prime \prime }} } \right) \), \( SN_{En} \) is the node for encoding nodes and receiving data blocks. \( T_{PDisk} \) is the time for node \( SN_{p} \) to read a strip of \( p_{q} \) data blocks on the storage medium, \( T_{PNet} \) is the time for the transmission of \( p_{q} \) data blocks in the node \( SN_{p} \) network. For non-coded nodes that participate in the coding of this band, reading \( p_{q} \) data blocks on the storage medium and sending \( p_{q} \) data blocks on the network card can be performed concurrently. The network transmission speed is slower than the storage medium reading speed. So, this time formula can be equivalent to:

$$ T = \mathop {\mathop { \hbox{max} }\limits^{{{\text{p}}^{{\prime \prime }} }} }\limits_{{p = 1^{{\prime \prime }} ,p \ne En}} \left\{ {T_{PNet} ,T_{EnNet} + T_{Encode} } \right\} $$
(4)

Because the encoding calculation is not very time-consuming for network transmission, so this time formula (3) can be equivalent to:

$$ T \approx \mathop {\mathop { \hbox{max} }\limits^{{{\text{p}}^{{\prime \prime }} }} }\limits_{{p = 1^{{\prime \prime }} ,p \ne En}} \left\{ {T_{PNet} ,T_{EnNet} + T_{Encode} } \right\} $$
(5)

For node \( p \), measurement method according to the weight value \( W \):

$$ T_{PNet} \approx \frac{{P_{q} }}{{W_{i} }} $$
(6)

For coding nodes, the node reads \( En_{q} \) blocks of local storage media, while the network card accepts \( \begin{array}{*{20}c} {\sum\nolimits_{{p = 1^{n} ,p \ne En}}^{{p^{n} }} {p_{q} \left( { = k_{Enq} } \right)} } \\ \end{array} \) blocks of data. In a rack-aware random distribution of three copies, when \( k \ge 6 \), \( k_{PEn} \), this allows the reception of network data blocks to account for the primary time of the code compute node:

$$ T_{EnDisk} + T_{EnNet} \approx T_{EnNet} $$
(7)

The measurement method according to the weight value W:

$$ T_{EnNet} = \sum\nolimits_{{p = 1^{{\prime \prime }} ,p \ne En}}^{m} {\frac{{p_{q} }}{{W_{En} }}} = \frac{{\left( {K - P_{En} } \right)}}{{W_{en} }} $$
(8)

Substituting the Formula mentioned above can be derived:

$$ T \approx \mathop {\mathop { \hbox{max} }\limits^{{{\text{p}}^{{\prime \prime }} }} }\limits_{{p = 1^{{\prime \prime }} ,p \ne En}} \left\{ {\frac{{p_{q} }}{{W_{p} }},\sum\nolimits_{{p = 1^{{\prime \prime }} ,p \ne En}}^{m} {\frac{{p_{q} }}{{W_{En} }}} = \frac{{\left( {K - P_{En} } \right)}}{{W_{en} }}} \right\} $$
(9)

5 Conclusion

With the aim of ensuring the safety and reliability of the electronic record in the cloud storage servers, the preservation mechanism of electronic record based on erasure code and multi copies are proposed in the paper. This paper focuses on the erasure code archiving of electronic records and puts forward the ability aware erasure code archiving of electronic records. Moreover, the corresponding implementation algorithm and steps are described.