Keywords

1 Introduction

1.1 Target Settings

Information systems have been increasingly used in various activity spheres in recent decades, and in the last few years, the special attention has been paid to wireless communications and the Internet of Things [1], which require special attention to the reliability and security of data channels used by nodes of information systems for interaction and communication with each other and with other information systems. The number of devices connected to the Internet is growing at an incredible rate, which is the main reason for the development of IPv6 protocol in communications due to the need to expand the address space and ensure growth. A lot of Internet of Things devices connected to the global network appeared in a last few years. The combination of these factors results in the need to research, develop and improve the data transmission methods in telecommunication systems [2, 3].

The error detection and correction process is typically implemented as a part of low-level protocols that consider transmitted data as a stream or blocks of bits/bytes, without taking into account the high-level structure and semantics of messages [4, 5].

So, some of the piece of information that could theoretically be used for more successful data decoding is ignored. This approach is generally proven and mostly correct, as the protocols are low-level and should not consider the data transmitted. They operate with a set of bits only. However, all these protocols do not always cope with the tasks assigned to them, especially in the case of interference, noise that occurs during the transmission process. Therefore, it is advisable, if possible, to use all appropriate ways to improve the process of detecting and correcting of errors, including the use of as much information available as possible for recovery. One way to improve the process is to use knowledge about the data and messages transmitted, which are considered not just as a set of bits, but as a structure with its own characteristics.

The TCP/IP protocol stack is the most common in computer networks. It consists of several layers, and at each level, each protocol of the same layer is built on a protocol of a lower layer and each protocol has its own specific task. The lower layer protocols (which belong to the physical/channel layers of the OSI model) deal with, among other things, the detection and correction of errors that may occur during data transmission. These protocols consider data as a stream of bits/bytes and do not take their semantics, structure and content into account. In this regard, some of the information that could be used to improve the reliability of data transmission is not utilized. On the one hand, this approach simplifies the implementation of communication channels, but on the other hand, in case of interference in data transmission, the speed and reliability of algorithms and technical solutions decreases [6, 7].

Therefore, it is necessary to improve the data transfer process, and in this article, it is proposed to use knowledge about the data and messages transmitted, which are considered as a certain structure with its own characteristics, and not just as an encoded binary data stream.

1.2 Scientific Researches and Issues Analysis

In order to better research the achievements in the field of error control coding, the articles and developments concerning the combination of different coding methods, adaptive coding and peculiarities of their implementation, as well as modelling of their operation were considered [8,9,10]. In particular, article [8] presents the usage of turbo codes in power-line communication channel networks (PLC). An example of adaptive codes usage to correct the errors is given in the article [9], which also emphasizes the effectiveness of adaptive algorithms utilizing depending on the current state of a channel. The article [10] presents the features of the implementation of error control coding based on FPGA boards. It should be noted that such implementations are higher in complexity, but their advantage is the greater speed of information processing through the use of parallel computing, which is provided by FPGA.

In addition, the peculiarities of data processing with the use of protocols of the OSI model are considered, in order to utilize as much information about the transmitted data as possible to increase the probability of successful message transmission over network channels [11]. The article [11] states and substantiates the importance to apply algorithms for search and correction of errors during data transmission not only in large and enterprise networks, but also in personal and small user networks.

It is worth noting that this article considers the encoding of data without the loss of information (i.e. it is not about audio or video streams where the partial data loss is permissible).

1.3 The Goals

The goal is to propose and test a more integrated approach to data preparation before transmission over network channels, which includes analysis of the message structure and data contained in the message. The reliability of the transmission of structured data over network channels should be increased with the help of the revealed regularities. The research outlines the effectiveness of the content analysis of structural peculiarities and messages in order to use the results of analysis to improve the performance factors of data transmission. This article proposes to improve existing methods of data encoding by using of information about the content and nature of the messages themselves, which in theory can increase the reliability of the transmission of structured data.

2 Method of Adaptive Error Control Coding

2.1 The Typical Composition of Structured Data

Abstracting from the formats and implementations (JSON, XML, ProtoBuf, etc.), they are united by one: messages are divided into components, which are often called “tokens”. These components have a certain semantic meaning (often even the corresponding token name is specified in the message itself), usually unique within the message type, and a set of valid values for each token (type). An information system can have any number of message types and any number of tokens in one message: their number depends on the complexity of the system. In different programming languages, such messages are mapped in different data structures, but in most cases, they are classes (such as in Java) and structures (such as in C). The messages are serialized before being sent over the network channels, and deserialized after receipt.

The previous article [12] discussed a way to use all these peculiarities of structured messages to improve the quality of data transmission over the network channels. In this article we consider the features and implementation details of the proposed method.

2.2 The Client–server Communication

Let’s suppose that the information system is built on the principle of client–server architecture. The server responds to the client requests, and there are k > 0 types of requests in the system. Each request has its own request structure, and a corresponding unique response structure. All requests and responses to requests are the structured messages that consist of a set of tokens. The scheme of a request (ask) with the index k has the form of Ak = {ak1, ak2, …, akn}, n > 0. The corresponding scheme of reply (response) is: Rk = {rk1, rk2, …, rkm}, m > 0. It should be noted that the number of tokens and their type in the request and response may differ. In addition, in more or less complex systems, the tokens are hierarchical and may themselves contain embodied messages, i.e., for example, the response to request number 1 may look like this: R1 = {r11, r12}, where token r11 is a numerical identifier and token r12 is composite: r12 = S {s1, s2, …}.

2.3 Dynamic Subtypes

Let’s use the above features of structured messages and requests in order to increase the reliability of data transmission over a network channel. The concept of a “dynamic subtype” is introduced. This is statistical information about the data transmitted by a specific token. It may change over time and be supplemented. In the simplest case, we consider the numerical type of token (int, 32 bits). If a certain number of messages containing a given numeric token and the token value were in the range (100–150) were transmitted for a certain period of time, it means that the current dynamic subtype of the token is a number encoded by w = 6 bits of information (0–50) with an offset equal to s = 100. Thus, such a simple dynamic subtype for a numeric data type can be encoded by two numbers: the number of bits and the offset. Let’s consider another case, for example, for text strings. If a text token contained only a small set of constants (enum), the number of which does not exceed 256, then the dynamic subtype can encode all text values with a numeric single-byte value. In this case, such a dynamic subtype will be encoded with a dictionary, which will contain “key”–“value” pairs, where the key will be a 1-byte number, and the value—the corresponding text string. Another way to encode a text (or byte) string is to use three values: an array of valid admissible characters A and the length of a byte string {Lmin, Lmax}.

The current state of the dynamic subtype for each token can be controlled in three ways:

  • for each client of the information system separately;

  • for all clients in general;

  • for each client separately with a timeout (TTL).

In the first case, during the encoding/decoding process, it is possible to take into account the peculiarities of each client device separately. The disadvantage is the greater complexity of the algorithm and greater requirements for computing resources and memory of the device, especially on the server side, because it must store information about dynamic subtypes for each client, and the clients must be identified by a unique identifier (e.g., UUID).

In the second case, the implementation is simpler, but the operational efficiency of the method will be lower. The third way is a compromise between the first two methods. This paper considers the simplest option, i.e., shared state of dynamic subtypes for all client devices.

2.4 Method of Encoding

So, let’s have a message of a certain type, which consists of a finite set of tokens. We introduce the concept of message structure history, which contains the information about the current dynamic subtypes of tokens, as well as information about the old subtypes, because synchronization between hosts does not happen instantly. The history can be represented as a list or an array, where the index of an element is its identifier. The main requirements for such a list are:

  • the ability to quickly make changes (add a new history element in case of a subtype change detection with one of the message tokens);

  • the possibility of direct and fast access to each history element with constant complexity O(1).

The general linked lists are not suitable for the current task (because they do not meet the second requirement), so it is better to use the lists based on arrays, which are expanded by allocating a new array, if necessary, but with a certain growth rate calculated by the formula: \(S_{new} = 3 \cdot S_{old}\).

The initial value is Sold = 10. It is also possible to use a hash table, which in most cases enables to add and search for elements with almost constant complexity O(1), provided the correct hash function is selected. The hash function should be calculated quickly and has the form of y = f(x), where x is an array of data of arbitrary size, and y is a number/array of fixed size. The number of collisions in the calculation should be as small as possible. That is, the distribution of the function values should ideally approach a uniform distribution [13].

Also, in order to verify the correctness of the data contained in the message, it is necessary to add a checksum to it, with which it will be possible to answer the question with a high probability: whether the decoding process is completed successfully or not. There are many methods for calculating of checksums. The parity bit is the simplest option, which consists of adding one bit to the message, which is equal to “1”, if the number of units is even, and “0”—otherwise. In terms of coding theory, the parity bit can be considered as a code (n, n − 1) with matrices:

$$ \begin{aligned} H_{1,n} = & \, \left( {\begin{array}{*{20}c} 1 & 1 & \cdots & 1 \\ \end{array} } \right) \\ G_{n - 1,n} = & \left( {\begin{array}{*{20}c} 1 & 0 & \cdots & 0 & 1 \\ 0 & 1 & \cdots & 0 & 1 \\ \cdots & \cdots & \cdots & 0 & 1 \\ 0 & 0 & \cdots & 1 & 1 \\ \end{array} } \right) \\ \end{aligned} $$
(1)

It is used, for example, in the UART protocol. But it is not reliable, and can only detect an error in one bit [14].

IP packets use a 16-bit complement code as a checksum. The header is divided into 16-bit blocks; their sum is calculated. If there are transfers to a new bit after the calculation, they are added to the amount. This method is more reliable and is already used in the IP protocol at the network level of the OSI model.

Then the typical structure of the encoded message will look like (Fig. 1).

Fig. 1
figure 1

Logical structure (a), physical structure (b)

The error control coding method performed by the encodeN(…) function can be any method with arguments:

  • data to be encoded (d);

  • number of additional bits for encoding (Δc).

Furthermore, if the encoding method does not provide an easy way to set and change the code parameters, another way to increase the probability of successful decoding can be used by engaging the callback function, which will be called by the encoding algorithm in case the impossible decoding of a part of the message is detected. The implementation of such a function in the simplest case involves forecasting based on the current scheme of dynamic subtypes, possible variants of the symbol to be decoded, checking the result with a checksum. The disadvantage of this method is that it is necessary to change the decoding algorithm. Therefore, the usage of the first variant is considered. The general algorithm is depicted in Fig. 2.

Fig. 2
figure 2

Block scheme of message transmission algorithm

We consider the results based on the Reed-Solomon code (implementation taken from the ZXing Open Source library) and the communication protocol of the ThingSpeak service for Internet of Things devices, developed by IoBridge company (now purchased by MathWorks).

The Reed-Solomon codes are a partial case of BCH codes. The BCH codes are the cyclic codes of some length n over the field GF(q) with a distance d, under the condition that for some value of b ≥ 0, the generator polynomial is:

$$ g(x) = {\text{lcm}}\{ M_i (x),\;i = b,b + 1, \ldots ,b + d - 2\} $$
(2)

If we decrease the exponent of the min. polynomials that produce the generating polynomial of the cyclic code, the redundancy of the BCH code also decreases. The exponent cannot be less than one, and it is equal to one if x takes a value in the same field as the field GF(2m), which is used to build a check matrix of code. In such a field, the minimal polynomial of the element β is x − β. Since the exponents of the variable x correspond to the position of the codeword, the length of the code must not exceed the number of elements of the multiplicative group of the field.

The Reed-Solomon codes are often used as component codes in cascade constructions. In the method considered in this example, there is no such structure in the usual sense, but there is an add-on over the code in the form of a variable parameter of the code rate and pre-conversion of the message to a more efficient format, which increases the probability of the successful data transmission.

The advantage of the BCH codes and the Reed-Solomon codes in particular is the simplicity of their encoding and decoding algorithm in the channel with hard decision. The disadvantage is the lack of a reasonable decoding algorithm in channels with soft solutions. But this disadvantage is partially eliminated by usage of the method of “minimum generalized distance”, which is reduced to multiple decoding in the channel with hard decision [15, 16].

If the division of the information message into parts and the alternation of encoded tokens with the encoded scheme is not taken into consideration, the procedure can be clearly depicted as follows (Fig. 3):

Fig. 3
figure 3

The procedure of calculating of the amount of possible redundant bits of information

$$ R_1 = K_1 /N;R_2 = K_2 + K_{\text{S}} /K_{\text{N}} ;\Delta C = K_1 - K_2 - K_{\text{S}} $$

That’s why:

$$ R_2 = \left( {K_1 - \Delta C} \right)/N $$

where

K1:

the number of data bits of message encoded by the Reed-Solomon encoder, bits;

N:

the total number of bits in the message, bits;

K2:

the number of data bits encoded by state encoder (message without state and schema), bits;

KS:

the number of bits representing state and schema, bits;

ΔC:

the amount of space which can be used for adjusted error correction code;

R1:

the code rate of the origin Reed-Solomon code;

R2:

the max possible code rate after adjusting encoder parameters.

Let’s take the following Reed-Solomon code: (15,13), GF(24) with the corresponding generator polynomials:

$$ (15,13):g(x) = x^2 + 6x + 8 $$
(3)

The minimum polynomial is equal to: α = 2. The code (15,13) can correct 1 error (d = 3). Since the code in fact operates not with bits, but with their sets of 4, one error is equated to an error in any bit in the set (or to an error in all bits together, i.e., the number of errors in the set can be from 0 to 4, but within one set their number is equal to one error), which improves the corrective capability of the code.

We take a set of the following messages of the same size:

figure a

If the extra characters are removed, we obtain the token IDs and values:

figure b

The length of seven messages is 560 bits.

Let’s construct the scheme of dynamic types of tokens for a set of messages (let the identifier be equal to 1). We have tokens with keys 0 × 68, 0 × 74 and 0 × 77, which can be matched to single-byte identifiers of the varint type (0 × 1, 0 × 2, 0 × 3). The token subtype 0 × 1 has a value in the range [0; 2] (0 × 2) with an offset of 0 × 31.The token subtype 0 × 74 has a value in the range [122; 202] (if we convert from ASCII to a decimal number) or if we specify an offset—then it is [0; 80] (0 × 60) with an offset of 122 (0 × 7A). Performing a similar operation for the third token, we obtain the range [0; 22] (0 × 16) with an offset of 38 (0 × 26). So, the message scheme will look like:

figure c

If all numeric values are encoded with varint, and the text values are encoded with the text strings ending in zero-byte 0 × 0, and to indicate the number of tokens at the beginning of the scheme, the resulting scheme will look as follows:

figure d

The scheme is transmitted once, and the messages encoded using this scheme will look like this (the scheme number comes first, then it is the message):

figure e

So we have: an original message size—560 bit. The size of the scheme—136 bit. The size of the resulting message is 168 bits without an identifier.

The first data transfer will contain a synchronization packet, so its size will be a maximum of 176 + 136 = 312 bits of information. If the scheme does not change, then only the identifier is transmitted instead of the scheme, and therefore it is necessary to transfer 168 + 8 = 176 bits of information. Hence, we have the values of the parameters: K1 = 560, K2 = 176, KS = 136, K2 + KS = 312.

We encode the original message with the Reed-Solomon code, GF(24), α = 2 (formula 4). Dividing the data array into blocks of 13 information bits, we obtain the following data symbols:

figure f

The result of encoding with the above code will look like:

figure g

The resulting size of the encoded message obviously increases by the number of redundant symbols, and for this code—by 2 symbols for each block. The resulting encoded message size is 165 symbols, or 165 · 4 = 660 bits. Where 80 bits are parity symbols. If we take the message with the scheme of subtypes, it contains 312 data bits. Therefore, in an equivalent size encoded message can include 660 − 312 = 348 bits that can be used to improve the transmission reliability.

Code word length N = 15. Ntotal = 660—total number of bits of encoded message,

ktotal = 312—the total number of data bits, V = Ntotal/N = 660/15 = 44the total number of code words (each code word = 15 bits).

So each codeword produced by the adjusted encoder should have minimum of 8 data bits:

k = ktotal/V = 312/44 = 8 (upward rounding)—the minimal amount of data bits in each codeword.

So, we can use max. possible code (15,8) but such code does not satisfy the condition of (n − k mod 2) = 0. Therefore, it is necessary to round up, so the corresponding code that can be used is (15,9).

$$ g(x) = x^6 + 7x^5 + 9x^4 + 3x^3 + 12x^2 + 10x + 12 $$
(4)

This code can correct 3 errors (d = 7) instead of 1. So, its corrective capability is three times greater. This means that the probability of successful transmission of a data packet encoded with 1) the use of information about the structure of the message content and 2) the Reed Solomon’s code (15,9) is much higher than a data packet of the same size but encoded only by code (15,13):

figure h

The result of coding (Table 1).

Table 1 Results for the example above
figure i

The function of code parameters selecting in this case is obviously a step function. Therefore, the value of additional corrective capability (as a percentage) depends on the successful result of dividing the message into parts. In particular, if the original message was N bytes in size, and the internal encoder/decoder supports codes (15,x) in this case, the function is depicted in Fig. 4.

Fig. 4
figure 4

Additional acceptable percentage of errors which can be corrected

Let’s explain the Fig. 4 depicted above. It represents an additional portion of errors in the message which can be safely detected and corrected after receiving the message by decoder, depending of amount of data which has been encoded by the encoder2 (state encoder). For example, let’s have an origin message containing 1001 bits. Reed-Solomon (15,13)-encoder produces encoded message with the length of 1155 bits which can be transmitted over network channel. State encoder can extend the amount of space within these 1155 bits for additional parity bits by reorganizing the message content. If state encoder reports to the Reed-Solomon encoder that there are additional 1001 · 0.16 = 160 bits which can be used for parity bits then the Reed-Solomon encoder may generate code with decreased code rate, e.g. (15,11) instead of (15,13), but without increasing the amount of data to be transmitted. So within the same 1155 bits the acceptable amount of errors which can be safely corrected by the decoder are increased by 6.6%. Please note that the Fig. 4 shows the theoretical limit, actually there are necessarily additional synchronization packets that increase the number of bits of information required for decoding during the transmission.

3 Conclusions and Suggestions

The proposed method of adaptive error control coding analyzes and prepares data for transmission over communication channels by combining 2 encoders: Reed-Solomon encoder (actually it can be replaced by any parameterized encoder) and state encoder. The reliability of the transmission of structured data over network channel is increased in case of using this combined approach as the corresponding decoder may accept and correct more errors than in case of using only Reed-Solomon encoder.

  1. 1.

    The article presents the possibility of effective use of the considered coding method by example: (a) the Reed-Solomon code and (b) encoding the message structure. The possibility to increase the corrective capability provided by the proposed encoding method was tested using a test dataset and the package “communications” in GNU Octave software [17].

  2. 2.

    This method can be combined with various encoding methods, in particular, any other code with the parameters which can be parameterized is possible to use instead of the Reed-Solomon code. In the simplest case, this can be an ordinary linear code (if the goal is to use as simple encoding/decoding algorithm as possible, which can be present in devices with limited resources) or other more complex codes, such as from the family of convolutional codes or turbo codes.

  3. 3.

    The further research can be carried out with the study of the coding efficiency for the data of different types and structures, as the choice of parameters depends on the results of the breakdown of the original messages into parts, which in turn may be different, depending on the internal structure of messages.

  4. 4.

    The algorithm for finding and selecting of tokens can be theoretically improved and optimized for faster operation. It can also be made more versatile by dividing it into two parts by analogy with the approach adopted in the Apache Thrift libraries: (a) token encoding/decoding; (b) classification of subtypes and generation of the range of acceptable values.

  5. 5.

    In addition, depending on the specific implementation of the method, other data structures (binary trees, arrays, etc.) can be used for implementation in different computing architectures. There is a theoretical possibility of parallel computing usage or even full-fledged parallelization based on FPGA.