1 Introduction

The natural component of networked cyber-physical systems (CPS) is resource-constrained devices [1] and dynamic communication channels that do not provide performance guarantees [2, 3]. As a result, fulfilling the application demands in terms of tolerable delay and packet loss rate, packet sizes, and sending data rate is a challenging task [1, 4,5,6,7]. A changing environment implies frequent changes to communication parameters [8], such as end-to-end delay, loss rate, or data rate, meaning that key control functions, such as congestion [5, 6, 8,9,10], rate [9], or error control [11,12,13,14], should be able to react quickly enough to adapt to these changes.

Ensuring reliable and timely communication for these demanding applications in changing environments is done using hybrid error coding, often referred to as HARQ [15, 16] (Hybrid ARQ). However, to apply HARQ, an appropriate coding configuration must be computed based on the given application and measured channel parameters [13, 17,18,19]. Under changing conditions, this computation must even be repeated at regular intervals—incurring non-negligible computing overhead on constrained devices. When applied to the physical layer of cellular networks [20], typically only the code rate, as part of the selected modulation and coding scheme (MCS), is adapted. The incremental redundancy follows a fixed schedule with a fixed number and sequence of redundancy versions (RVs). However, on higher layers, specifically the transport layer, this parameterization needs to consider and fulfill application requirements and hence it is a complex task. Finding this configuration has been well-understood mathematically for the last decades [17, 18, 21, 22]—including finding optimal configurations that fulfill application requirements and minimizing redundancy overhead. However, this task does not allow for a closed-form representation whose complexity is independent of the channel parameters. Instead, it is a search problem with a complexity dependent on its input parameters—e.g., a linear increase in round-trip time leads to a more than linear increase of configurations to evaluate. Executing the search for realistic channel parameters on realistic CPS computing devices proved intractable [23].

Based on an efficient, but still intractable, reimplementation of the full search [21], we set out to bring hybrid error coding to resource-constrained devices. In one branch, we approached the problem using machine learning [23], in particular using supervised learning with deep neural networks. In a second branch, we have been successful in decomposing the search problem in stages and improving individual stages algorithmically—achieving optimal redundancy efficiency but shorter inference time [24]. In this article, we look at the decomposed search and combine both algorithmic as well as learning approaches to build DeepSHARQ: a search with minimized run-time but high efficiency.

The contribution of this article is threefold:

  1. (a)

    We describe a decomposition of the HARQ coding configuration search, allowing for optimizations at different stages.

  2. (b)

    We implement the search algorithm DeepSHARQ that leverages both algorithmic and learning-based approaches to infer efficient coding configurations in real time.

  3. (c)

    We evaluate DeepSHARQ and compare it against existing solutions—showing its usability on resource-constrained devices.

The remainder of this article is structured as follows: first, we describe related approaches to our work (Sect. 2) and give background on error control at the transport layer of packet networks (Sect. 3). How optimal HARQ configurations can be determined is explained in Sect. 4. Our approach, DeepSHARQ, is described in detail in Sect. 5. This is extended by a description of the model training process (Sect. 6) and an evaluation of the search (Sect. 7). Section 8 outlines directions for future research and Sect. 9 concludes the paper.

2 Related work

The end-to-end design paradigm [25] has led to many proposals to complement error coding in the lower layers with coding at the transport layer in order to improve reliability without prohibitively increasing the delay [11, 12, 21, 22, 26,27,28,29,30]. Maximum Distance Separable (MDS) block codes ensure that the number of correctable losses equals the number of transmitted parity packets. MDS codes have been used to provide predictable reliability under time constraints [21], reduce delay in multimedia communication [28], and avoid feedback implosion in multicast [16, 31]. Despite their high loss rate floor, and hence a redundancy transmission overhead to achieve the same performance as MDS codes, binary codes have also been a mechanism of choice due to their reduced coding complexity [11, 29, 30]. Finally, making the end-to-end delay independent of the block length is possible with windowed Random Linear Codes (RLC), which evenly distribute the parity packets over the source packets. RLC codes have proved to reduce the in-order delay, and hence the tail delay in fully reliable protocols [15, 32]. However, this delay reduction in RLC codes comes at the cost of lower code rates than block codes [33], and the run-time complexity of their matrix inversion function hinders their deployment in packetized layers [29, 34, 35]. Michel et al. [12] have extended QUIC with the three aforementioned code families, showing that RLC codes achieve the lowest delay. Although, in this paper, we have opted for block codes, which in principle have a larger delay, we have done so because i) we implement a delay-aware scheme, which ensures that the delay of no packet exceeds the application’s target delay, and ii) we target code configurations that approach the theoretical minimum under timing constraints [36] and windowed RLC codes are limited from the code rate standpoint [33].

Like in almost any other field, the significant advances in Deep Learning (DL) have made their way into networked communications [37]—e.g., adaptive video streaming [38, 39], channel state information prediction [27, 40], congestion control [6, 8, 10], and protocol optimization [26, 27, 41]. In the context of error control, Chen et al. [26] use reinforcement learning to select the code rate of an FEC scheme in order to improve the quality-of-experience in the context of real-time video streaming. Cheng et al. [27], implement an LSTM network that predicts the future loss pattern in a block of data packets, and based on it, selects the amount of redundancy to transmit. Hu et al. [19] also use LSTM networks to predict loss patterns, but propose a model compression method to enable fast inference and compensate for the large complexity of LSTM networks.

Non-learning-based approaches have also been proposed to implement adaptive error control [13, 17, 18, 22]. Tickoo et al. [22] implement loss-tolerant TCP that uses an adaptive FEC scheme based on MDS codes that, similar to our approach, adjusts the transmitted redundancy to the channel characteristics. Adaptive, RLC-based error control is proposed in [17], and the authors show that the proposed mechanism is on par with pure ARQ in throughput- and delay-bound scenarios. [13] proposes a new code construction for low-delay stream codes and presents an adaptive algorithm that outperforms MDS codes. Michel at al. [18] implemented adaptive FEC in QUIC, and evaluated the algorithm’s performance for applications with different requirements, showing the benefit of FEC over QUIC’s purely reactive error control.

3 Background

Error control is a key function in the most common transport protocols, as it compensates losses in the lower layers in order to provide the desired reliability level. This section introduces the different building blocks in error control.

3.1 Transport layer error control

Networked systems experience packet losses for multiple reasons, e.g., buffer overflows in congested links, channel noise, and fading, and medium access collisions. PHY/MAC layers already implement error correction mechanisms that transmit some form of redundancy that allows for loss recovery. However, these mechanisms fail to provide predictable reliability and end-to-end guarantees [20]. Therefore, error control in the upper layers must complement them [25].

Automatic Repeat reQuest (ARQ) has traditionally been the scheme of choice in the most widely deployed transport protocols—i.e., TCP and QUIC. ARQ requires a feedback mechanism to signal either the reception of packets with acknowledgments (ACK) or packet losses with negative acknowledgments (NAK). TCP implements cumulative ACKs referring to the last, correctly received byte, whereas QUIC implements a selective packet-based mechanism in which every received and processed packet is ACKed. Although an ACK could be issued for every packet, both TCP and QUIC implement ACK aggregation mechanisms that reduce the receiver-side traffic—e.g., see delayed ACKs in TCP [42] and ACK aggregation in QUIC [43]. On the other hand, NAKs have been typically implemented for multicast [44, 45] to avoid the feedback implosion problem—i.e., the sender in a multicast group is overwhelmed by the ACKs from all receivers, both in terms of received traffic and processing time [16]. When packet retransmissions are triggered depends on the implemented loss detection algorithm [14, 46,47,48]. TCP was originally designed with a purely time-based retransmission mechanism. However, more recent algorithms use duplicate ACKs/NAKs as packet loss signals as well, which provides faster reactions than timers at the risk of wrongly deeming a packet as lost due to packet reordering in the network. Regardless of the implemented algorithm, retransmissions are never triggered before the round-trip time (RTT) that is required to collect feedback for a packet, and hence we say that ARQ’s delay is RTT-dependent.

Obtaining feedback is not always possible if i) the application’s target delay is not large enough to wait for feedback, or ii) a feedback channel does not exist (e.g., television broadcasting). In such cases, Forward Error Coding (FEC) is more suitable for the task. Unlike ARQ, FEC proactively transmits redundancy information (RI). As no information about lost packets is available at the time of transmitting the redundancy, FEC must encode parity packets, which are a linear combination of data packets, so that losses can be recovered by solving a linear equation system at the receiver (see Sect. 3.2 for a detailed description of how these packets are encoded). As a result, the loss recovery delay is no longer RTT-dependent, but it is proportional to the source packet intervals that the sender must wait to collect packets before encoding.

As the ARQ and FEC delays differ in nature, it stands to reason that both approaches should be combined to provide optimal predictable reliability under delay constraints. When combined, the optimal balance between proactive (FEC) and reactive (ARQ) can be found such that the transmitted RI is minimized. Hybrid ARQ (HARQ) implements precisely that behavior: parity packets can be transmitted in the proactive or reactive cycles, and the sender stops transmitting redundancy when the receiver signals it has enough to recover the losses or until it is too late to recover them in time. Figure 1 provides a graphical comparison of the three aforementioned schemes.

Fig. 1
figure 1

Comparison of the different redundancy transmission schemes for error control

3.2 Packet coding

When implemented in the transport layer, HARQ transmits parity packets—or, more generally, parity symbols—to recover the losses. A block code \({\mathcal {C}}(n,k): \mathbb {F}_q^{k} \rightarrow \mathbb {F}_q^n\) transforms a message vector \(\vec {m}\) into a code word \(\vec {c} \in {\mathcal {C}}\). The finite field \(\mathbb {F}_q\) has size q. Typically, the field is selected from the family of Galois Fields \(GF(2^m)\) for binary representation, where m is the number of bits per symbol in the alphabet. Here, k is the block length—number of symbols in \(\vec {m}\)—, and n the codeword length—number of symbols in \(\vec {c}\). The symbols are encoded by performing a matrix–vector multiplication with the generator matrix G (\(\vec {c} = \vec {m} \cdot G\)). At the receiver, the original message vector is recovered by performing the inverse operation (\(\vec {m} = \vec {\hat{c}} \cdot \hat{G}^{-1}\)). \(\hat{G}\) is a \(k \times k\) submatrix of G, whose columns have been selected based on the position of the received symbols \(\vec {\hat{c}}\). Figure 2 shows how the encoding operation is performed. We assume a systematic code is used—i.e., the \(k \times k\) identity matrix is part of G, and thus the code word contains a verbatim copy of the message vector. Systematic codes reduce the coding complexity as only \(p=n-k\) symbols are encoded instead of n, achieve better error correction capabilities: if the linear system cannot be solved—e.g., it is undetermined because fewer than k packets were received—, they can still forward the received verbatim data without decoding, and they also allow for data transmission before all the k packets are collected for encoding, which reduces the end-to-end delay.

Fig. 2
figure 2

Encoding process of a systematic code with a block length k, p parity packets and a generator matrix G. Symbols are packets of MTU bytes

While the physical layer performs the coding operation at the symbol level—i.e., directly in bits—, IP networks are packetized erasure channels, meaning that full packets are lost in the network because packets with uncorrectable bit flips are not forwarded to the upper layer, or full packets are dropped due to buffer overflows. As a result, HARQ at the transport layer must be capable of recovering full packets. Assume an IP packet is MTUFootnote 1 bytes long. With virtual interleaving the packets can be split into smaller symbols of m bits, k packets are grouped in the interleaver buffer, and the coding operations are iterated throughout the complete packet length. In [29], we showed that the packetization directly impacts the complexity of the system: while the matrix inversion has typically dominated the run-time complexity of coding in the physical layer, the matrix–vector multiplication dominates the packetized layers. As a result, a different code construction may be the best option depending on the channel conditions and platform the protocol runs on.

3.3 Code construction

Three different families of codes have been proposed for the transport layer: MDS [21, 22, 31], binary [11, 29, 30], and RLC codes [18, 34, 35]. They vary in error correction capabilities, underlying field size, and generator matrix construction, and they have different algorithmic tools at their disposal for efficient implementation [31, 49].

Maximum Distance Separable (MDS) codes [31] guarantee that the minimum distance between codewords is \(d_{min}=e+1\)—i.e., they meet the Singleton Bound with equality—, where \(e=n-k\) is the number of correctable erasures [50]. For this property to hold true, any \(k \times k\) submatrix of G must be invertible. The Cauchy and Vandermonde matrices fulfill this same property, and thus they are frequently used to construct this type of code, usually in \(GF(2^8)\) so that symbols are one-byte long.

The matrix inversion is, at the symbol level, the main contributor to the run-time complexity. Binary codes [51,52,53] overcome this limitation by decoding without an explicit matrix inversion. However, operating in GF(2) does not guarantee the invertibility of every square submatrix. As a result, the loss rate floor is lifted from MDS codes—conversely, binary codes require excess parity packets to achieve the same loss rate as MDS codes. It can be shown that the excess portion of the transmitted redundancy reduces for very large block lengths [52]. Hence, binary codes have dominated physical layer deployments—e.g., LDPC [51] in 4 G and 5 G, or polar [53] codes in 5 G—, where such large block lengths are common. However, they can also perform well in the transport layer when running on resource-constrained devices. Since most CPUs do not directly support operations in high-order Galois Fields, binary codes, which can be implemented with simple XORs, can significantly reduce the run-time complexity [29].

Finally, random linear codes (RLC) follow a random code construction— similar to some binary codes [51, 52], which actually are a sub-family of RLC codes—, in high-order Galois Fields to have a high probability of obtaining linearly independent rows and hence decrease the loss rate floor of random codes. However, these codes need many resources for matrix inversion [34, 35], and it is still an open research question as to whether they can be efficiently used on embedded devices, the natural component of CPS.

In the following, this paper assumes systematic MDS codes are used. However, the presented algorithms are code-agnostic as long as the probability of losing a packet and triggering retransmission rounds (see Eqs. 8 and  2 in Sect. 4.1) are adapted to model other code’s properties (e.g., random binary, polar or RLC codes, and non-systematic codes).

4 Predictably reliable, delay-aware error control

Providing predictably reliable, delay-aware error control is only possible with precise models of the communication channel, which must be used to find the optimal configuration subject to application and network constraints. In this section, we introduce SHARQ, an algorithm that finds the optimal configuration in polynomial time.

4.1 Problem statement

The performance of every HARQ scheme is governed by two parameters: the block length k, or how many data packets are encoded, and the repair schedule \(N_P\), which dictates how the p parity packetsFootnote 2 are distributed among the \(N_C\) repair cycles (see Fig. 3). The objective is to find the HARQ configuration that minimizes the transmitted RI (see Eq. 1) while meeting the application and network constraints at the same time. Minimizing the RI is essential for any communication system, otherwise resources—i.e., energy and bandwidth—are wasted due to the throughput increase, which is unfair to the other systems the communication channel is shared with. Formally,

$$\begin{aligned} \begin{aligned} k^{*},N_P^{*} =&\, \underset{k,N_P}{\mathrm {arg\,min}}\,RI(k,N_P) \\ {{such\, that:}}&\, D_{HARQ}(k,N_P) \le D_T \\&PLR_{HARQ} (k,\Vert N_P\Vert _1) \le PLR_T \\&R_{HARQ}(k,N_P) \le R_C \end{aligned} \end{aligned}$$

which considers three constraints: (i) every data packet must be received within the application target delay, (ii) the average number of loss packets cannot be greater than the application target loss rate, and (iii) the transmission data rate should not increase beyond the bottleneck data rate of the communication channel.

The redundancy information is a weighted sum over the entries of the repair schedule \(N_P\) (see Eq. 1). The weight is the probability of that cycle being required:

$$\begin{aligned} \begin{aligned}&RI(k, N_P) \\&\qquad =\frac{1}{k} N_P[0] +&\frac{1}{k} \sum _{c=1}^{N_C} w^R[p[c-1]] \cdot N_P[c] \end{aligned} \end{aligned}$$
(1)

where \(p[c] = n[c] - k\) is the cumulative number of parity packets until round c and \(w^R[c]\) the weight for \(N_P[c]\)—i.e., the probability of cycle c to be triggered in a multicast group with R receivers.Footnote 3 Formally,

$$\begin{aligned} w[i] = \sum _{j=max(0,k-p+i)}^{k-1} \left( {\begin{array}{c}k+i\\ j\end{array}}\right) (1-p_e)^{j} p_e^{k+i-j} \end{aligned}$$
(2)

and consequently \(w^R[i] = 1 - (1 - w[i])^R\). It can be shown that, for sufficiently large block lengths, the probability of triggering a new retransmission in a binary erasure channel decreases exponentially with the number of cycles. In such a case, the optimal repair schedule can be straightforwardly built: \(N_P\) is an all-ones vector except for the last entry, which is \(p-N_C+1\). However, in the short block length regime, such a naive repair schedule construction may be suboptimal [24]: if the probability of cycle \(N_C - 1\) to fail is sufficiently high, accumulating packets in later rounds approaches FEC behavior—i.e., all parity packets are transmitted with very high probability. In such cases, parity packets should be brought forward to reduce the probability of latter cycles in the schedule—see Sect. 4.3 for an algorithm that efficiently finds the optimal schedule.

Fig. 3
figure 3

HARQ delay budget. We analyze the impact of the repair schedule \(N_P\) on the achievable capacity of HARQ in the transport layer

While the FEC delay Eq. 3 depends on the source packet interval \(T_s\) to collect k data packets before encoding, the ARQ delay Eq. 4 is RTT-dominated due to the ACK-triggered retransmission process.Footnote 4 The HARQ delay Eq. 5 can be represented as the combination of its FEC and ARQ components, as depicted in Fig. 3. \(D_{RS}\) is the response delay of the system and models operating system delays—e.g., packet management or scheduling. Although a more precise adaptation can be achieved by feeding dynamic response delays into the algorithm [54], we have opted for a rather conservative constant value (\(D_{RS}=1\) ms) to reduce the dimensions of the input dataset—see Sec. 6.1. The model also considers the upper bound to the time required to detect that a packet is lost (\(D_{PL}\)), which is the maximum time the system needs to mark a packet as lost after its transmission and hence determines when a new retransmission round is triggered. \(D_{PL}\) solely depends on the loss detection algorithm implemented in the transport protocol [14, 46,47,48]. For the remainder of the paper, \(D_{PL} = 4.5 \cdot T_{s}\), which assumes the mechanism in [46] is implemented (see Sect. 5.4 for more details on why this is the case):

Table 1 Model parameters
$$\begin{aligned}{} & {} D_{FEC}(k,N_P) = \,\frac{RTT+D_{RS}}{2} \nonumber \\{} & {} + \,k\cdot T_{s} \nonumber \\{} & {} + \,N_{P}[0]\cdot D_{tx} \end{aligned}$$
(3)
$$\begin{aligned}{} & {} D_{ARQ}(N_P) = \,\frac{RTT}{2} \nonumber \\{} & {} + \,N_C (RTT + D_{RS} + D_{PL}) \nonumber \\{} & {} + \,(\Vert N_P\Vert _1 - N_P[0]) \cdot D_{tx} \end{aligned}$$
(4)
$$\begin{aligned}{} & {} D_{HARQ}(k,N_P) = \,D_{FEC}(k,N_P) \nonumber \\{} & {} + \,D_{ARQ}(N_P) \nonumber \\{} & {} - \,\frac{RTT}{2} \end{aligned}$$
(5)

The error control presented in this paper assumes some periodicity in the application data arrival—i.e., video streaming with a constant frame rate or sensors in CPS with a constant sampling rate. Equation 5 accordingly considers that the inter-packet time is constant for the optimization time window \(D_T\). However, the proposed mechanisms can also be applied to bursty, time-aware traffic: the \(T_s\) estimation function must detect a burst—e.g., when the application does not provide further data after one \(T_s\)—, in which case a new constraint is added that caps k to the maximum achievable block length for such a burst. The model also considers symmetrical network delay for simplicity. However, in the future, we intend to integrate DeepSHARQ in the time-aware protocol introduced in [55] to also provide predictable error control over networks with asymmetrical delays.

The packet loss rate is given in (6), where \(P(I_k=i)\) is the probability of being unable to decode exactly i data packets—i.e., the loss rate as seen by the application—when a systematic MDS code is used and \(b=max(p+1,i)\). Although we have already applied the framework here presented to channels with memory, such as the Gilbert–Elliot channel [21], in this paper, we limit ourselves to the more tractable i.i.d. channels in order to support intuition and plausibility for the reader. This is motivated by the fact that, if the protocol reacts to channel changes fast enough, the underlying channel can be modeled as a binary erasure channel with packet loss probability \(p_e\):

$$\begin{aligned} PLR_{HARQ}(k,p)= & {} \frac{1}{k}\sum _{i=1}^{k}i\cdot Pr(I_{k}=i) \end{aligned}$$
(6)
$$\begin{aligned} Pr(I_{k}=i)= & {} \sum _{e=b}^{p+i}\left( {\begin{array}{c}n\\ e\end{array}}\right) \cdot p_{e}^{e}\cdot (1-p_{e})^{n-e}\cdot p_{d}\left( {\begin{array}{c}e\\ i\end{array}}\right) \nonumber \\ \end{aligned}$$
(7)
$$\begin{aligned} p_{d}\left( {\begin{array}{c}e\\ i\end{array}}\right)= & {} \frac{\left( {\begin{array}{c}k\\ i\end{array}}\right) \left( {\begin{array}{c}n-k\\ e-i\end{array}}\right) }{\left( {\begin{array}{c}n\\ e\end{array}}\right) } \end{aligned}$$
(8)

While the previous two constraints deal purely with application constraints, the data rate constraint avoids network congestion by ensuring that the transmitted data rate (9) is below the bottleneck data rate of the network \(R_C\):

$$\begin{aligned} R_{HARQ} (k,N_P) = (1 + RI(k,N_P)) \cdot \frac{P_L}{T_s} \end{aligned}$$
(9)

Once the formal model is defined, an algorithm must be implemented that finds the optimum fast enough to react to changes within the channel coherence time—i.e., the time the channel properties remain unchanged.

4.2 SHARQ

figure a

Scheduled HARQ (SHARQ) is a search algorithm that, given the application delay and loss rate constraints, and the channel state information, finds \((k,N_P)\) that minimizes the RI. SHARQ’s algorithm—see Alg. 1—takes as input the maximum block length (\(k_{max}\)) and the number of parity packets (\(p_{max}\)). Given the maximum block lengths allowed by the delay and loss rate constraints, \(k_{max}\) is the minimum of the two: \(k_{max}=min(k_{lim}^{D_T}, k_{lim}^{PLR_T})\), where

$$\begin{aligned} k_{lim}^{D_T} = \Bigg \lfloor \frac{D_T - \frac{RTT + D_{RS}}{2}}{T_s} \Bigg \rfloor \\ k_{lim}^{PLR_T} = max\{k \,\vert \, PLR_{HARQ}(k,255-k) \le PLR_T\} \end{aligned}$$

Given \(p_{opt}(k) = min\{p \,\vert \, PLR_{HARQ}(k,p) \le PLR_T \}\) the optimal number of parity packets for a block length k to fulfill the packet loss rate constraint, it can be shown that it is a monotonically increasing function. The loss rate constraint solely depends on k and p—see Eq. 6. Therefore, as the block length increases, the RI decreases if p is kept constant: \(RI(k,p) > RI(k+1,p)\). Conversely, the PLR increases because the same number of parity packets carries information from more data packets: \(PLR_{HARQ}(k,p) < PLR_{HARQ}(k+1,p)\). As a result, \(p_{opt}(k-1) \le p_{opt}(k) \,\forall \, k \in [1,k_{max}]\), with the equality holding true if the PLR increase is not large enough to surpass \(PLR_T\). It directly follows that the maximum number of parity packets is \(p_{max} = p_{opt}(k_{max})\). Due to the monotonically increasing nature of the PLR, \(k_{max}\) and \(p_{max}\) can be found with a binary search with a run-time complexity \({\mathcal {O}}(m \cdot {\mathcal {C}}_{PLR})\), with m the number of bits per symbol in the Galois Field, and \({\mathcal {C}}_{PLR} = {\mathcal {O}}(k + log(p))\) the complexity of obtaining the PLR (see Appendix A for more details on the PLR complexity).

If (kp) is known, \(N_C\) can be directly obtained: as long as there are enough p’s to fill later cycles and the delay budget allows it, this cycle can only reduce the RI because every newly transmitted parity packet reduces the probability of later cycles. Therefore, \(N_C\) can be directly obtained as the maximum number of cycles that fit in the remaining of the delay budget:

$$\begin{aligned} N_{c,max}(k,p) = \Bigg \lfloor \frac{D_T - k \cdot T_s - p \cdot \frac{P_L}{R_C} - \frac{RTT + D_{RS}}{2}}{RTT + D_{RS} + D_{PL}} \Bigg \rfloor \end{aligned}$$

SHARQ clearly decouples the delay and PLR constraints, resulting in a more structured and efficient exploration of the search space. For every block length, p solely depends on the PLR constraint, whereas \(N_C\) solely depends on the delay constraints. Finally, the graph search in Sect. 4.3 is used to find the optimal \(N_P\). The graph search has run-time complexity \({\mathcal {C}}_{GS} = {\mathcal {O}}(p^2 N_C)\), and hence the run-time complexity of the SHARQ search algorithm is in \({\mathcal {O}}(N_{C,max} \cdot k_{max} \cdot p^2_{max})\).

4.3 Graph search

figure b

The objective of the graph search algorithm is to find the schedule \(N_P\) with minimum RI, given a (kp) pair and \(N_C\). As seen in Eq. 1, the RI is a weighted sum over the entries of \(N_P\). Each weight is the probability that the corresponding retransmission round is required. This structure creates a trade-off: packets in the later rounds are less likely to be transmitted and hence have a lower cost in terms of RI. However, putting fewer packets into the early rounds increases the probability that the later rounds are needed.

The key observation to efficiently find the optimal schedule is that the weight for round c only depends on the number of packets in rounds before c, but not how they are scheduled. In other words, if we have already scheduled x packets into y rounds, the cost of assigning dx packets to the next round is the same regardless of how the x packets were scheduled before. This structure can be expressed as a graph Eq. 10:

$$\begin{aligned} \begin{aligned} G&= (V,E,w_E) \\ V&= \{ start \} \cup \big \{ (x,y) \mid 0 \le x \le p, 0 \le y \le N_C \big \} \\ E&= \big \{ (start, (x,1)) \mid 0 \le x \le p - N_C \big \} \\&\cup \big \{ ((x',y), (x,y+1)) \mid 0 \le y< N_C - 1\\&\quad \wedge x - x' \ge 1 \wedge y \le x' \le p - N_C + y\big \}\\&\cup \big \{((x, N_C-1),(p, N_C)) \mid N_C - 1 \le x < p\big \} \\ \end{aligned} \end{aligned}$$
(10)

with edge weights reflecting the RI cost Eq. 11:

$$\begin{aligned} \begin{aligned} w_E(((x',y-1),(x,y)))&= \frac{1}{k} \cdot w^R[x'] \cdot (x - x'), \\ w_E((start, (x,1)))&= \frac{1}{k} \cdot x \end{aligned} \end{aligned}$$
(11)

The edges are chosen to enforce that every retransmission round (i.e., \(N_P[c]\) for \(c > 0\)) is assigned at least one packet. Consequently, we also need to ensure that we do not assign too many packets to one round, as we need at least one packet for every following round. An example of the resulting graph is shown in Fig. 4.

Fig. 4
figure 4

Graph for \(p=6\) and \(N_C=3\). Each path represents a choice of \(N_P\). The edge weights are set such that the \(N_P\) with the lowest RI corresponds to the shortest path. The highlighted edges represent \(N_P = [0, 3, 1, 2]\)

Each path through the graph from the start node to \((p,N_C)\) corresponds to a schedule. Since the edge weights are equal to the required RI, the schedule achieving the minimal RI corresponds to the shortest path. This graph can be computed using a dynamic programming approach, shown in Alg. 2. For each layer, we relax the nodes between the lower and upper bound for the number of packets admissible for the corresponding round as per the restriction above. We store both the minimum distance in D and a parent pointer, allowing us to reconstruct the shortest path in the end.

The edge weights can be obtained in \({\mathcal {O}}(p)\). Each layer has \({\mathcal {O}}(p)\) nodes with \({\mathcal {O}}(p)\) predecessors each. Since there are \(N_C\) layers, the time complexity is \({\mathcal {O}}(p^2 N_C)\).

5 DeepSHARQ

Based on SHARQ’s search structure, DeepSHARQ applies learning algorithms to estimate the block length and implements a simple schedule construction to reduce the run-time complexity compared to algorithms that use purely learning and algorithmic solutions.

5.1 Design principles

DeepSHARQ is designed with two main principles in mind: i) in contrast to purely learning-based approaches, DeepSHARQ exploits SHARQ’s search structure to simplify the learning problem, thereby requiring smaller neural networks to achieve similar inference accuracy, and ii) DeepSHARQ relaxes the optimality constraint to achieve predictably low inference times.

SHARQ quickly finds p with a binary search and \(N_C\) with a closed-form expression. Thanks to SHARQ’s structured search, it becomes apparent that the iteration over all possible block lengths and the graph search are responsible for most of the inference time. DeepSHARQ tackles the problem by inferring the block length with a neural network and using a simple repair schedule construction that, despite being suboptimal, does not produce significant RI increases.

5.2 Output space regularization

Fig. 5
figure 5

Optimal block length as a function of the input parameters showing the output space does not react smoothly to small changes in the input space. A different input parameter is linearly increased in each figure, while the baseline is kept constant (\(PLR_T=0.0001\), \(D_T=400\) ms, \(R_C=6\) Mbps, \(p_e=0.001\) and \(T_s=5\) ms). Particularly, it can be observed how linearly increasing \(p_e\) produces a quasi-random behavior in the output

The quantization of the output space makes small variations in the input produce significantly different block lengths in the output—e.g., how to use an increase in the delay budget? It is possible that the extra time is enough to use yet another retransmission cycle, which may significantly drop the maximum block length that fits in the remaining time. Figure 5 shows how significant the block variations are. For each of the figures, a different input parameter is linearly increased, while the baseline is kept the same (\(PLR_T=0.0001\), \(D_T=400\) ms, \(R_C=6\) Mbps, \(p_e=0.001\) and \(T_s=5\) ms). Changes in \(D_T\), \(PLR_T\), and \(T_s\) produce relatively smooth variations in k that could be easily learned. However, a linear increase in \(p_e\) results in a quasi-random behavior in the optimal block length. Although the block length variations may differ for other application and channel models, Fig. 5 clearly illustrates how difficult learning the output space can be. We propose a different training mechanism that tackles this problem by simplifying the output space via regularization. Instead of predicting the optimal label for each configuration, we train networks that predict any block length out of a set of valid block lengths. The set of valid k’s is selected so that the RI deviation from the optimal RI is within certain limits. Formally, given the set \({\mathcal {K}}_{v}\) of all block lengths that fulfill the requirements—see Sect. 4.1—, the neural network is allowed to predict any block length in the set \({\mathcal {K}}_{v}^{\delta } \subset {\mathcal {K}}_{v}\) such that \(RI(k,p_{opt}(k)) \le (1+\delta ) \cdot RI_{opt} \,\forall \, k \in {\mathcal {K}}_{v}^{\delta }\).

5.3 Repair schedule construction

It can be proved that, when error control is provided without any timing constraints, the probability of decoding failure decreases exponentially with every newly transmitted parity packet. The repair schedule in such a case is an all-ones vector so that the contribution of every new packet to the RI decreases exponentially as well—see Eq. 1 in Sect. 4. SHARQ’s simple schedule is based on this theoretical optimum: for \(N_C=0\), p packets are transmitted in the FEC cycle, whereas for \(N_C \in [1,p]\) the FEC cycle is set to 0, followed by the all-ones and \(p-N_C+1\) in the last cycle. Despite being suboptimal [24], such a naive repair schedule has the advantage that it can be constructed in \({\mathcal {O}}(1)\) and, as we prove in Sect. 7.3, the RI increase it produces is negligible.

5.4 System architecture

Fig. 6
figure 6

Architecture and Information Flow for the different Search Algorithms. Each column represents one algorithm implementation. M represents a data structure that includes the application constraints and channel model

DeepSHARQ’s pipeline is depicted in Fig. 6, where the neural network has 4 hidden layers with 150 neurons and leaky ReLU activation function, and a softmax output layer (see Fig. 7). DeepSHARQ inherits some of its algorithmic components from SHARQ [24], namely the binary search for p, the closed-form expression for \(N_C\), and the constraint fulfillment check once the configuration is found in order to notify the application that the channel supports its requirements. On the other hand, the graph search in Sect. 4.3 is substituted by the simple repair schedule construction in Sect. 5.3, so that the run-time complexity of finding the schedule is reduced from \({\mathcal {O}}(p_{max}^2 N_{C,max})\) to \({\mathcal {O}}(1)\), and the block length selection goes from a full search for \(k \in [1,k_{max}]\) in SHARQ to neural network prediction with run-time complexity \({\mathcal {O}}(1)\) in DeepSHARQ. As a result, the major contributors to DeepSHARQ’s complexity are the binary search for p, with \({\mathcal {O}}(m \cdot (k_{max} + log(p_{max})))\), and the RI calculation, with \({\mathcal {O}}(k_{max} + p_{max})\). Here m is the maximum number of steps in the binary search, which directly depends on the number of elements in the employed Galois Field \(GF(2^m)\)—see Sect. 3.2. In the transport layer, \(m=8\) so that symbols are one-byte long and \(k_{max} = p_{max} = 2^m\),Footnote 5 resulting in DeepSHARQ’s run-time complexity \({\mathcal {O}}(m \cdot k_{max} + p_{max})\). Table 2 shows how the run-time complexity of the search has been reduced with every newly proposed algorithm.

Fig. 7
figure 7

DeepSHARQ’s neural network architecture

Table 2 Search algorithms run-time complexity

Although DeepSHARQ has not been designed for a specific transport protocol, it assumes the implemented transport layer functions fulfill certain requirements. In the following, we describe such assumptions and how DeepSHARQ interacts with the other transport functions.

5.4.1 Loss detection

DeepSHARQ triggers the repair cycles in the schedule \(N_P\) if losses are detected in a block. This paper assumes the algorithm presented in [46] is implemented, which maintains a packet loss count at the receiver that is increased if (i) an out-of-order packet arrives, or (ii) a packet timeout expires. The timeout is configured between 1 and 2 times the inter-packet time (\(T_s\)). A new cycle is triggered when the loss count reaches a configurable threshold. The higher the threshold, the higher the algorithm’s robustness against in-network packet reordering. For time-bound scenarios with target delays in the same orders of magnitude as the RTT and \(T_s\) (see Table 3), packet reordering is equivalent to packet losses if the packets arrive outside of the time budget—i.e., \(D_T\) milliseconds after the transmission of the first packet in the block. Therefore, packet reordering has little impact in such scenarios. We consider a low threshold of three loss counts and a packet timeout of \(1.5 \times T_s\), which results in \(D_{PL} \le 4.5 \cdot T_s\). The delay model in Sect. 4.1 considers the worst-case detection delay to ensure the parity packets arrive at the receiver in time. Recently, new algorithms have been proposed that perform better in channels with significant packet reordering [47, 48]. In future work, we plan to integrate more recent algorithms into our model for faster and more accurate loss detection.

5.4.2 Congestion control

DeepSHARQ ensures the transmitted data rate does not exceed the channel data rate (see Eq. 9 in Sect. 4.1). However, it does not implement any mechanisms to sample the bottleneck data rate and ensure it is not exceeded but relies on congestion control for that. Congestion control is available in most transport layer protocols because it is key for a fair share of the available network resources. Although DeepSHARQ is congestion-control-agnostic and it could in principle coexist with any of the many proposed algorithms, we recommend BBR-like algorithms [9, 54] that try to operate at the Bandwidth-Delay Product (BDP). Operating at the BDP is crucial for CPS as it keeps network buffers empty, thereby minimizing the end-to-end delay while the data rate is close to the bottleneck data rate.

5.4.3 Channel estimation

DeepSHARQ’s ability to fulfill the application requirements depends on the precision of the estimated channel model. Another benefit of implementing BBR-like congestion control is that it provides an estimate of two of DeepSHARQ’s input parameters: RTT and \(R_C\) [56]. An estimation of the remaining parameter, the channel loss rate, is proposed in [21], which uses gaps in the data stream to estimate \(p_e\). In addition, the tolerated delays in CPS are so small that they are typically in the same order of magnitude as the channel coherence time, or even smaller. In other words, the channel can be considered constant during the time budget and [21, 56] provide an estimation precise enough for most IP deployments. Nevertheless, fast-changing, dynamic channels can have a coherence time in the single-millisecond range which pose a more challenging scenario. Machine-learning-based solutions seem promising for such a small granularity as well [26, 27]. We believe this is an interesting parallel research path that could enable DeepSHARQ even in the most demanding channels.

6 Model training

Finding the right hyperparameters is essential to achieve good performance in data-intensive tasks. This section analyzes the different components used in the learning process to shed some light on the model selection process, as well as to ensure the results are reproducible.

6.1 Dataset generation

We have designed the dataset with two objectives in mind: i) it must represent current deployments faithfully, and ii) it must generalize for any of the included deployments. The model uses six input parameters:

  • Application parameters: target erasure rate \(PLR_T\), target delay \(D_T\), and source packet interval \(T_s\).

  • Network parameters: channel data rate \(R_C\), channel erasure rate \(p_e\), and round-trip time RTT.

We have considered traces obtained in the wild for the most common network deployments—i.e., broadband,Footnote 6 4 G [57], 5 G [58, 59], and WiFi [60, 61] deployments. For application-related parameters, we have used delay and reliability constraints of traditional [62]Footnote 7 as well as more demanding applications still under deployment [7, 63]. For each of the parameters, an order of magnitude is selected from Table 3 with equal probability, and a randomly selected number between 1 and 9 is prepended to that order of magnitude. Finally, Alg. 1 is executed for the input with a slight modification: not only the optimal block length \(k_{opt}\) is logged, but \(k_{min}\) and \(k_{max}\) are also obtained, which respectively are the minimum and maximum block lengths that ensure the RI only deviates \(\delta \) from the optimal RI (see Sect. 5.2). The complete dataset consists of 2.5 million inputs, 42.4% of which do not have an empty solution space—i.e., there is at least one configuration fulfilling all the constraints at the same time, see Sect. 4.1. The neural networks have only been trained with these 1,060,527 inputs without an empty solution space, split into training (60%), validation (20%), and test (20%) sets. Such a dataset simplification reduces the time and resources spent in training without a negative impact on the system’s performance, as DeepSHARQ nevertheless discards any predicted k that does not meet the constraints.

Table 3 Selected orders of magnitude for the generation of the parameter dataset

The models in Sect. 4 consider other three parameters that are not included in the dataset: the packet length \(P_L\), the processing delay \(D_{RS}\), and the loss detection delay \(D_{PL}\). We assume the packet length is fixed to the MTU, and hence \(P_L=1,500\) bytes. We also considered a rather conservative constant value for the processing delay \(D_{RS}=1\) ms to reduce the dimensions of the dataset. Finally, \(D_{PL}\) is linearly dependent on \(T_s\), and hence it adds no new information as an input.

6.2 Loss definition

Unlike common classification problems, in which the neural network learns the mapping from the input to a single valid output, DeepSHARQ’s neural network is trained to accept as correct any label within a range. Therefore, we propose a new loss that accounts for the fact that the true label is a set and not a single value. Given the true label \(k_i\), and the neural network prediction \(\hat{k}_i\), the proposed loss is based on the binary cross-entropy \(H(k_i, \hat{k}_i)\), which is defined as follows:

$$\begin{aligned} \begin{aligned} H(k_i,\hat{k}_i)&= p[k_i \in {\mathcal {K}}_{v}^{\delta }] \cdot log(p[\hat{k}_i \in {\mathcal {K}}_{v}^{\delta }]) \\&+ (1-p[k_i \in {\mathcal {K}}_{v}^{\delta }]) \cdot log(1-p[\hat{k}_i \in {\mathcal {K}}_{v}^{\delta }]) \end{aligned} \end{aligned}$$

where \(p[\hat{k}_{i} \in {\mathcal {K}}_{v}^{\delta }]\) is the probability that the neural network predicts any block length that belongs to the accepted range, and \(p[k_{i} \in {\mathcal {K}}_{v}^{\delta }]=1\) because the true label always belongs to that range by definition:

$$\begin{aligned} L(k,\hat{k}) = - \frac{1}{N} \sum _{i=1}^{N} log(p[\hat{k}_{i} \in {\mathcal {K}}_{v}^{\delta }]) \end{aligned}$$
(12)

The loss \(L(k, \hat{k})\) in Eq. 12 is evaluated for every batch of size N. In Sect. 7, we show that this loss allows the model to correctly learn the mapping from input parameters to any label in the set of valid labels.

6.3 Ablation study

Tuning the learning rate hyperparameter is instrumental to successfully training neural networks. PyTorch implements various learning rate policies that can be configured to schedule the learning rate, such as plateuFootnote 8 or super-convergenceFootnote 9 [64]. Both policies benefit from large maximum learning rates that allow for a longer exploration phase and low learning rates to fine-tune the model weights. The super-convergence scheduler begins with a rising phase that goes from \(start\_lr\) to \(max\_lr\), after which it decays towards \(end\_lr\), which is substantially lower than \(start\_lr\). The plateau learning rate policy monitors the validation loss to estimate the effectiveness of the current learning rate (i.e., if after patience epochs the validation loss did not decrease by at least threshold amount, it decays the learning rate by a constant factor until the \(min\_lr\) has been reached). The super-convergence policy requires an optimizer with momentum, and hence we trained all the models with momentum-enabled stochastic gradient descent. Super-convergence varies the momentum between 0.85 and 0.95, while it is constant at 0.9 for plateau.

Figure 8 shows both policies’ accuracy and learning rate evolution with DeepSHARQ’s neural network limited to 1,000 epochs. The plateau policy reaches lower learning rates faster than super-convergence resulting in higher initial accuracy, but convergence towards lower accuracy in the second half of the training. On the other hand, super-convergence surpasses plateau in accuracy for the last hundred epochs due to its extended high learning rate exploration phase. We selected super-convergence with \(max\_lr=0.04\), as it achieves the best performance.Footnote 10

Training the models for 1,000 epochs takes approximately 24 h on a PC with an Intel Core i7-7700 CPU at 3.6 GHz and 8 cores, with an average core load of approximately 50%. The main bottleneck is the calculation of the loss function in Sect. 6.2, which, unlike the traditional cross-entropy loss, must be independently evaluated for every sample in the batch because every input may consider a different \({\mathcal {K}}_{v}^{\delta }\) set. However, the significantly smaller models (see Sect. 7.2) counteract the impact of the longer training phase when DeepSHARQ is deployed at scale on a significant number of end devices. In addition, thanks to the broad set of channels and applications considered in Sect. 6.1, DeepSHARQ is readily deployable on the most common networks nowadays without a lengthy re-training for fine-tune adaptation.

Table 4 Ablation study. The super-convergence learning rate policy with \(max\_lr=0.04\) has been used for all the models. The accuracy is calculated for a range of valid labels and \(\delta =0.3\)
Fig. 8
figure 8

Validation accuracy and learning rate evolution for different learning rate policies. Three maximum learning rates have been used for super-convergence (i.e., 0.02, 0.03, and 0.04), whereas patience (10 and 15) and learning rate reduction factors (0.7 and 0.8) have been used for plateau

Table 4 presents the ablation study we performed to select the final hyperparameters. All the presented results are for models trained for a range of valid labels for \(\delta =0.3\) (see Sect. 6.2). The regularization factor is a key parameter for super-convergence, as high learning rates already act as a form of regularization [64] and, combining it with L2 regularization with a high factor can be detrimental for performance (see Table 4 rows 2 and 6). We also tested multiple epochs and selected 1,000 as it strikes the right balance between good performance and training time.

Fig. 9
figure 9

Evolution of the validation accuracy and loss for a neural network with 4 hidden layers with 150 neurons each

For DeepSHARQ, we have selected the model with 4 hidden layers and 150 neurons because, for the selected \(\delta \), it achieves a good compromise between accuracy and model size. However, the method here proposed is flexible enough to allow for different model selections: while a large model with close-to-optimal accuracy may be a good option on powerful PCs, smaller models—e.g., using larger \(\delta \)’s, and hence at the cost of an RI increase—may be desirable for resource-constrained platforms.

7 Evaluation

In the following, we evaluate the newly proposed neural network approach, as well as DeepSHARQ’s real-time response to channel changes.

7.1 Methodology

All the models evaluated in this section have been trained following Sect. 6 and PyTorch 1.12.1. Only the test dataset has been used to generate the accuracy and Cumulative Distribution Functions (CDF) here presented, and the algorithms have been executed on a PC running Ubuntu 22.04 LTS on an Intel Core i7-7700 CPU at 3.6 GHz and 32 GB RAM. For inference time evaluation, the PyTorch-trained models have been ported to TensorFlow 2.11 using the TensorFlow Backend for ONNXFootnote 11 and executed with TensorFlow Lite with the tfliteFootnote 12 Rust crate. Two different neural networks are considered: i) \(k_{opt}\), trained to predict the optimal block length, and ii) \(k_{range}\), trained to predict any block length in a range of valid block lengths—see Sect. 5.

Fig. 10
figure 10

Model test accuracy as a function of the number of parameters. Four models are depicted with different hidden layers and neurons: (4,100), (4,150), (5,200), and (5,250)

Fig. 11
figure 11

Cumulative Distribution Function (CDF) of DeepSHARQ’s inference time (log scale), absolute data rate increase (log scale), and data rate increase as a percentage of the optimum data rate (linear scale)

7.2 Model performance

Figure 9 compares \(k_{opt}\) and two \(k_{range}\) versions (\(\delta =0.1\) and \(\delta =0.3\)) in terms of validation accuracy and loss. The models were trained for 1,000 epochs, using the super-convergence learning rate policy configured with \(max\_lr= 4 \cdot 10^{-3}\), \(start\_lr=2 \cdot 10^{-2}\), and \(min\_lr = 2 \cdot 10^{-9}\). The smoother output space allows the models to faster converge towards high accuracy—conversely, low loss–, and experience a lower variance within the high learning rate in super-convergence—see Sect. 6. This result is somehow expected since more labels are accepted as “correct” for \(k_{range}\) models, and hence the accuracy as defined in the learning process is increased. However, extending the range of valid labels also improves the model performance from an information theoretical standpoint (see Fig. 10). If a valid configuration is defined as any configuration with an information rate below the time-bound channel capacity—i.e., any configuration meeting all the constraints—then \(k_{range}\) models are able to find more valid configurations over the test dataset the larger the range of valid labels is. Such a trade-off between RI optimality and model size is particularly beneficial for resource-constrained devices, in which the bottleneck is the CPU rather than the network, both in terms of processing speed and energy consumption [29], especially when connected to 5 G networks [59]. In contrast, the DeepHEC model presented in [23] needs 5 hidden layers with 250 neurons each to reach 99.59% valid configurations in the test dataset. Bear in mind the aforementioned DeepHEC model predicts k, p and \(N_C\) with \(99.75\%\), \(99.82\%\), \(99.94\%\) accuracy, respectively. In other words, DeepHEC needs \(3.54 \times \) more parameters than DeepSHARQ (i.e., 107,656 DeepSHARQ vs. 381,63 DeepHEC) in order to support the same configurations.

7.3 The cost of optimality

DeepSHARQ slightly deviates from the optimization problem defined in Sect. 4.1, as it introduces two sources of suboptimal configurations: i) the neural network, which can either be directly trained to allow for suboptimal block lengths, or produce misclassifications even when trained to predict the optimum, and ii) the simple schedule construction. Figure 11 compares \(k_{opt}\) with 5 layers, 250 neurons, and \(k_{range}\) with 4 layers, 150 neurons, each model implementing two different repair schedule constructions: the graph search in Sect. 4.3 (graph) and the simple repair schedule in Sect. 5 (simple). The results show that allowing for suboptimal configurations reduces the tail and average inference time, and increases its predictability. SHARQ’s graph search for optimal repair schedules results in a long tail in the inference time due to its quadratic complexity, while increasing the unpredictability of the system at the same time. On the other hand, \(k_{range}\) reduces the average delay by 40% in comparison to \(k_{opt}\) due to the smaller NN.

The faster inference time comes at the cost of a data rate increase. As expected, \(k_{opt}\) produces no increase in significantly more cases than \(k_{range}\)—69% vs. 32%. However, in both cases, the increase is below 100 kbps in 87% of the cases and below 1 Mbps for the 97th percentile, and thus they do not seem prohibitively large when looking at the data rates in current deployments [57, 59, 61]. When it comes to the tail data rate increase, \(k_{opt}\) performance is worse than \(k_{range}\), which shows that learning a range instead of a single label acts as a regularization mechanism that improves generalizability. Finally, although the graph search makes a slight difference for \(k_{opt}\), it does not make any significant difference for \(k_{range}\), which further supports the design decision of opting for a suboptimal but faster scheduler.

7.4 Inference time

Fig. 12
figure 12

Cumulative Distribution Function (CDF) of the inference time. DeepSHARQ’s neural network has 4 hidden layers and 150 neurons per layer, and it has been trained with a dataset ensuring that \(RI = (1+\delta ) \cdot RI_{opt}\) with \(\delta = 0.1\)

Table 5 Mean, median, standard deviation, and 99th percentile of the inference time in microseconds for the different algorithms

Figure 12 compares DeepSHARQ’s inference time with three previously published algorithms: SHARQ [24], Fast Search [23], and DeepHEC [23]. Although Fast Search outperforms any other model in 43% of the cases, it also has a tail delay—i.e., the largest inference time experienced by the algorithm—6 orders of magnitude higher than DeepSHARQ’s. The three other models trade some inference time in the lower percentiles to provide a much more predictable inference time over the complete dataset. More precise statistics on the inference time are collected in Table 5, which shows that not only DeepSHARQ outperforms the other algorithms in terms of the mean and median inference time, but its standard deviation is at least two orders of magnitude smaller.

SHARQ shows that a purely algorithmic solution is able to achieve high predictability. However, deep learning solutions are the only ones able to consistently achieve low delay—see DeepHEC and DeepSHARQ. DeepSHARQ outperforms all other models in terms of tail inference time and predictability thanks to i) its smaller neural network, which is the component consuming most of the delay budget, and ii) its simple schedule, which avoids spending precious time in finding a better schedule that nevertheless produces no significant RI reduction—see Sec. 7.3.

The results presented here show that modeling the problem purely with deep learning results in excessively large models. DeepHEC learns \((k,p,N_C)\), and hence needs larger neural networks to achieve similar performance in terms of supported coding configurations (5 hidden layers and 250 neurons per layer, see [23]). DeepSHARQ halves the inference time compared to DeepHEC, and its tail delay is an order of magnitude smaller. The repair schedule construction is the only algorithmic component that remains in DeepHEC. Although it could be learned as well—e.g., applying reinforcement learning—, the neural network is expected to grow even larger due to the increased complexity of the problem to solve. Combining both approaches, which DeepSHARQ does, simplifies the problem enough for small neural networks to learn it while minimizing the time spent to derive the remaining parameters and the verification of the constraint fulfillment. DeepSHARQ’s average inference time is an order of magnitude smaller than the end-to-end delay requirements of tactile applications [7], so that it can react to channel changes faster than they are detected even for the most demanding applications delay-wise. Two orders of magnitude until the target delay of the most demanding applications is reached leaves plenty of room for increased inference time when the algorithm is executed on more constrained devices.

8 Future work

While initial works have evaluated the coding configuration search in a concrete transport protocol [21], this and other recent articles looked at the search problem in isolation. Future work would be to integrate DeepSHARQ into existing transport layer protocols (e.g., QUIC [43] or CoAP [65]) and evaluate the performance under changing channel conditions. These practical evaluations would evaluate the ability of the protocol to meet the application constraints in a changing environment, and they would also involve the use of embedded devices—testing if they are capable to run the search in real time. In addition, we intend to further integrate DeepSHARQ with state-of-the-art loss detection algorithms [47, 48].

In [29], we proved that a priori there is no single code that is optimal in terms of energy efficiency, but this depends on the hardware it is executed on, as well as channel conditions and application requirements. Given the recently increasing interest in energy-aware systems [66, 67], a further line of research would involve making DeepSHARQ energy-aware as well. This involves changing the search problem to incorporate energy as an input metric, i.e., an application-defined energy limit per application packet, and an output metric, i.e., how much energy a coding configuration demands.

From a theoretical communication perspective, a further line of research involves results on finite block coding [36]. These results allow computing the optimal block length based on channel parameters. Central to this is, however, the computation of the dispersion metric—something that has not been done practically in network protocols.

9 Conclusion

In this article, we presented DeepSHARQ, an approach to finding optimal hybrid error coding configurations in real time. Coming from the (computationally complex) search problem, we presented a decomposed search algorithm that improves the complexity of the algorithm using algorithmic as well as learning-based methods. We propose a new training methodology that, exploiting the quantized nature of the HARQ configurations, improves the neural network performance in learning (i.e., achieved accuracy) as well as communication (i.e., yielded configurations supported by the channel capacity) terms. Our evaluations show that DeepSHARQ is delivering both on the efficiency of the inferred configurations as well as on being fast in executing the inference. To the best of our knowledge, this is the best approach so far to finding coding configurations and makes it possible to execute this demanding task on devices with limited resources—a common trait of cyber-physical system hardware.