Next-generation wireless systems are expected to support an ever increasing number of wireless connections with better quality-of-service (QoS), e.g., higher data rate and smaller delay [1, 2]. As a result, energy consumption, as well as energy cost, and greenhouse gas emission are increased, which pose challenges in the design of wireless systems. One promising method to tackle this issue is energy harvesting (EH), where wireless nodes have the capability to harvest energy from the renewable sources (e.g., solar, and thermoelectric, etc.) of the surrounding environment, and store the harvested energy in batteries to carry out their functions. In this chapter, we explore power allocation problems for such EH systems to support delay-sensitive communications.

More specifically, this chapter considers an EH system communicating over a fading channel. The stochastic power control problems for source arrival rate maximization under EH and delay constraints are studied. The EH constraint ensures that the randomly available (random in time and amount) renewable energy cannot be spent until it is harvested and subsequently stored in the battery. Also, it cannot spend more energy than the currently available amount in the battery. Moreover, in addition to the average delay constraint model considered in Chap. 3, we also consider delay-outage constraint model. In the latter case, we have converted the original problem into effective capacity maximization problem using asymptotic delay analysis. We formulate the problems as infinite-horizon constrained Markov decision process (MDP) problems. We employ the post-decision state-value function approach in MDP to study the structural properties of the optimal policies, i.e., the monotonicity of the power allocation with channel, EH, and battery states. Throughout this work, it is assumed that the statistics of the system random (channel fading, and EH) processes are unknown to the source. For the model under consideration, reinforcement learning techniques such as Q-learning can be employed to optimize the resource dynamically. However, the post-decision state approach is much more appealing than the Q-learning as the former approach provides less storage complexity and faster convergence [3, 4]. Towards this end, we develop online power allocation algorithms without requiring known statistics of the random processes. Illustrative results demonstrate the advantages of the proposed approach over existing approaches, i.e., larger arrival rates can be supported under similar channel and EH conditions, and delay constraints.

4.1 System Model and Problem Formulations

4.1.1 Model Description

We consider a point-to-point communication system of bandwidth B (Hz), where the source communicates with the destination, as illustrated in Fig. 4.1. The source is equipped with an EH module, which can harvest renewable energies from the surrounding environment, and then store the harvested energies in an energy queue (or battery). Data is assumed to arrive at the source buffer with the constant rate μ. We consider that the transmission happens over frames of equal duration T (seconds). For notional simplicity, we normalize the frame duration T and bandwidth B in the following. We next describe different parts of the system and their assumptions in detail.

Fig. 4.1
figure 1

A source-destination communications link with EH transmitter

1. Channel fading model: We assume block-fading channels with fading duration equal to the frame duration. The channel power gain h[t] in frame t = 1, 2, ⋯ represent the channel state in frame t. The channel fading process \({\bigl \{h[t]\bigr \}} \in \mathcal{H}\) is assumed to be ergodic, stationary, and i.i.d. with probability distribution function (pdf) \(p_{\mathcal{H}}(h)\) over the channel state space \(\mathcal{H}\), which can be discrete or continuous.

2. EH and battery model: The source harvests energy amount e[t] from its surroundings during frame t. Moreover, e[t] is then stored in a battery and will be available for use in frame t + 1 onwards. The random EH process \({\bigl \{e[t]\bigr \}} \in \mathcal{E}\) is modeled as a stationary, ergodic i.i.d. process with pdf \(p_{\mathcal{E}}(e)\) over the EH state space \(\mathcal{E}\). Let \(\bar{E}\) denote the average harvested energy in each frame.

Let \(b[t] \in \mathcal{B}\) denote the energy amount currently stored in the battery in frame t, where \(\mathcal{B}\) denotes the battery (energy queue) state space. Let P[t] ∈ [0, b[t]] denote the transmit power of the source in frame t. We assume that the power required for signal processing is negligible compared to the transmit power, and hence, the energies stored and depleted from the battery are only used for data transmissions. The battery dynamics is updated as follows:

$$\displaystyle\begin{array}{rcl} b[t + 1] =\varphi (b[t],P[t],e[t]),\forall t.& &{}\end{array}$$
(4.1)

Here φ(⋅ ) represents a function, which depends on the battery dynamics, e.g., storage efficiency, leakage effects etc. Here, we consider battery with infinite storage capacity. This assumption is with the current trend of the battery technology, where a large amount of energy can be stored in the battery with negligible leakage effect, e.g., a super-capacitor [5]. Therefore, as a good approximation in practice, the battery dynamics (4.1) increases and decreases linearly as follows [68]:

$$\displaystyle\begin{array}{rcl} b[t + 1] = b[t] - P[t] + e[t],\forall t,& &{}\end{array}$$
(4.2)

We can see that the battery dynamics \({\bigl \{b[t]\bigr \}}\) follows a first-order Markov chain that depends only on the present and immediate past conditions. Moreover, when transmitting with power P[t] under channel state h[t], the achievable throughput r[t] is assumed to be given by Shannon’s formula:

$$\displaystyle{ r[t] = r(h[t],P[t]) =\log _{2}(1 + P[t]h[t]),\forall t. }$$
(4.3)

Our considered model can be extended for correlated channel fading and correlated EH processes with necessary modifications. In this case, the control actions and state-value functions (considered in Sects. 4.2 and 4.3) would include the immediate past channel and/or EH states.

The EH and channel fading processes can vary in different time-scales. In practice, the incoming energy variation is typically slower than that of the channel state. Throughout this work, we consider the scenario of very fast change of the incoming energy, where energy varies in the same time-scale as the channel state. The proposed approaches can be applied with appropriate modifications for the case of slow EH variation.

3. Data queue dynamics: The source utilizes its data buffer to store the traffic arriving with a constant rate μ. Note that the service process of the data queue is \({\bigl \{r[t]\bigr \}}\) in (4.3). Let \(q[t] \in \mathcal{Q}\) denote the data queue length in frame t, where \(\mathcal{Q}\) denotes the queue length state space. So, the queue length dynamics can be expressed as follows:

$$\displaystyle{ q[t + 1] = q[t] -\min {\bigl \{ q[t],r[t]\bigr \}}+\mu,\forall t. }$$
(4.4)

We assume that the queue is stable, i.e., the steady-state queue length random variable Q is bounded. The average queue length \(\bar{Q}\) can be expressed as:

$$\displaystyle\begin{array}{rcl} \bar{Q} =\lim _{t\rightarrow \infty }\sup \frac{1} {t} \mathbb{E}\Bigg\{\sum \limits _{\tau =1}^{t}q[\tau ]\Bigg\}.& &{}\end{array}$$
(4.5)

4.1.2 Problem Formulations

We formulate the stochastic power control problem to maximize the constant arrival rate μ under the maximum average delay constraint as follows:

$$\displaystyle\begin{array}{rcl} \qquad \qquad \mathop{\max }\limits_{\mu,P[t] \leq b[t],\forall t}\ \ \ \ \!\!\mu \quad \mathtt{s.t.:}\quad \bar{Q} \leq \bar{ Q}^{\max },& &{}\end{array}$$
(4.6)

where \(\bar{Q}^{\max }\) is the average queue length bound.

Similarly, the corresponding optimization problem under the delay-outage constraint can be formulated as follows:

$$\displaystyle\begin{array}{rcl} \qquad \quad \mathop{\max }\limits_{\mu,P[t] \leq b[t],\forall t}\ \ \ \ \!\!\mu \quad \mathtt{s.t.:}\quad \Pr (Q> Q^{\max }) \leq \zeta _{Q},& &{}\end{array}$$
(4.7)

where Q max ∈ (0, ) and ζ Q ∈ (0, 1] are the queue length bound and queue-length-outage probability, respectively.

We assume that the pdfs of the channel fading and EH processes are unknown to the source. Such assumption makes the solution approach much more challenging as compared to the scenario with known pdfs, for example, in [6, 9]. We solve problems (4.6) and (4.7) optimally in the next two sections and provide intuitive explanations on how to optimally control the transmit power while satisfying the delay and EH constraints without knowing the pdfs of the random processes.

4.2 Power Allocation Under Average Delay Constraint

4.2.1 Optimal Allocation Solution

We observe that problem (4.6) is an infinite-horizon MDP. To this end, it is sufficient that we focus on policies that are independent of time, i.e., stationary policies. The stationary policy π A can be represented by a function \(\pi _{\text{A}}: \mathcal{B}\times \mathcal{Q}\times \mathcal{H}\rightarrow \mathbb{R}^{+}\) specifying the power control action in frame t as P[t] = π A(b[t], q[t], h[t]) such that P[t] ∈ [0, b[t]], where \(\mathbb{R}^{+}\) represents the set of non-negative numbers. Furthermore, from (4.4), we can also impose another constraint on P[t] such that r[t] ≤ q[t] is satisfied. This finding implies P[t] ∈ [0, P max(b[t], q[t], h[t])], where \(P_{\max }(x,y,z) =\min \big\{ (2^{y} - 1)/z,x\big\}\). According to [10, Theorem 12.7], the optimal solution of the constrained MDP problem (4.6) can be obtained by exploiting the Lagrangian approach as follows:

$$\displaystyle\begin{array}{rcl} \mathop{\min }\limits_{\lambda \geq 0}\,\,\Bigg\{\mathop{\max }\limits_{\mu,P[t] \leq P_{\max }(b[t],q[t],h[t]),\forall t}\,\,{\Biggl \{\mu -\lambda \bar{ Q}\Biggr \}} +\lambda \bar{ Q}^{\max }\Bigg\},& &{}\end{array}$$
(4.8)

where λ ≥ 0 represents the Lagrange multiplier associated with constraint average delay constraint. Therefore, to study (4.8), we can first study the inner maximization for a given λ ≥ 0 as follows:

$$\displaystyle\begin{array}{rcl} \mathop{\max }\limits_{\mu,P[t] \leq P_{\max }(b[t],q[t],h[t]),\forall t}\,\,{\Biggl \{\mu -\lambda \bar{ Q}\Biggr \}}.& &{}\end{array}$$
(4.9)

It is worth mentioning that we update λ by sub-gradient method [11]. In the following, we discuss the structural properties of the optimal power allocation policy π A for (4.9) and show how to allocate the power optimally in each frame t.

Let J(b, q, h) denote the (pre-decision) state-value function for problem (4.9) for a fixed λ > 0. In particular, J(b, q, h) is the optimal value of problem (4.9) with the initial state (b[1], q[1], h[1]) = (b, q, h). The Bellman’s optimality equation for problem (4.9) can be written as follows [12]:

$$\displaystyle\begin{array}{rcl} & & J(b,q,h) =\mathop{\max }\limits_{ \begin{array}{c}\mu,P\leq P_{\max }(b,q,h)\end{array}}\Bigg\{\mu -\lambda q +\sum \limits _{\hat{h}\in \mathcal{H}}\sum \limits _{\hat{e}\in \mathcal{E}}p_{\mathcal{H}}(\hat{h})p_{\mathcal{E}}(\hat{e})J(b - P +\hat{ e},q - r(h,P) \\ & & \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad +\mu,\hat{h})\Bigg\} - J(b_{0},q_{0},h_{0}), {}\end{array}$$
(4.10)

for some fixed state (b 0, q 0, h 0). The optimal policy π A is the optimal solution of (4.10).

We now adopt the post-decision state-value function approach in Chap. 3 for the problem under consideration. Similar to (3.13), we define the (post-decision) state-value function \(J_{\text{post}}(\check{b},\check{q})\) from the (pre-decision) state-value function J(b, q, h) as follows:

$$\displaystyle\begin{array}{rcl} J_{\text{post}}(\check{b},\check{q}) =\sum _{\hat{h}\in \mathcal{H}}\sum _{\hat{e}\in \mathcal{E}}p_{\mathcal{H}}(\hat{h})p_{\mathcal{E}}(\hat{e})J(\check{b} +\hat{ e},\check{q},\hat{h})& &{}\end{array}$$
(4.11)

for (post-decision) states \((\check{b},\check{q}) \in \mathcal{B}\times \mathcal{Q}\). We have the following relationships on the dynamics of the energy and data queues: \(\check{b}[t] = b[t] - P[t]\) and \(\check{q}[t] = q[t] - r[t]+\mu\); and \(b[t + 1] =\check{ b}[t] + e[t]\) and \(q[t + 1] =\check{ q}[t]\).

Using (4.10) and (4.11), the optimal policy π A can be computed using \(J_{\text{post}}(\check{b},\check{q})\) as follows:

$$\displaystyle\begin{array}{rcl} \mathop{\arg \max }\limits_{\begin{array}{c}\mu,P\leq P_{\max }(b,q,h)\end{array}}\Bigg\{\mu -\lambda q + J_{\text{post}}(b - P,q - r(h,P)+\mu )\Bigg\}.& &{}\end{array}$$
(4.12)

Before we study the monotonicity of the optimal policy with respect to the data queue length and battery states, we need the following results.

Lemma 4.1

\(J_{\text{post}}(\check{b},\check{q})\) is a concave decreasing function of \(\check{q}\) for a given \(\check{b}\) .

Proof

At first, we prove the decreasing monotonic property of \(J_{\text{post}}(\check{b},\check{q})\). The monotonicity is obvious since \(\mu -\lambda \check{q}\) is decreasing due to increasing \(\check{q}\). We use induction method to prove the concavity of \(J_{\text{post}}(\check{b},\check{q})\) with respect to \(\check{q}\). In particular, we show that \(J_{\text{post}}(\check{b},\check{q})[t]\) in (4.17) is concave in \(\check{q}\) for \(t = 1,2,\mathop{\ldots }\) and since \(\lim \limits _{t\rightarrow \infty }J_{\text{post}}(\check{b},\check{q})[t] = J_{\text{post}}(\check{b},\check{q})\), we conclude \(J_{\text{post}}(\check{b},\check{q})\) is concave.

We initialize \(J_{\text{post}}(\check{b},\check{q})[t]\) as \(J_{\text{post}}(\check{b},\check{q})[1] = 0\). We assume \(J_{\text{post}}(\check{b},\check{q})[t]\) as concave in \(\check{q}\) for fixed \(\check{b} \in \mathcal{B}\) and \(\hat{h} \in \mathcal{H}\). Now, we have to prove \(J_{\text{post}}(\check{b},\check{q})[t + 1]\) as concave in \(\check{q}\) according to the induction method. Note that \(\mu -\lambda \check{q}\) is linear in \(\check{q}\) and as we assume that \(J_{\text{post}}(\check{b},\check{q})[t]\) is concave in \(\check{q}\), therefore, \(\mu -\lambda \check{q} + J_{\text{post}}(\check{b} - P +\hat{ e},\check{q} - r(h,P)+\mu )[t]\) is concave in \(\check{q}\) as well. The maximum of a concave function is also a concave function. Hence,

$$\displaystyle\begin{array}{rcl} \mathop{\max }\limits_{\begin{array}{c}\mu,P\leq \check{P}_{\max }(\check{b}+\hat{e},\check{q},h)\end{array}}\Bigg\{\mu -\lambda \check{ q} + J_{\text{post}}(\check{b} - P +\hat{ e},\check{q} - r(h,P)+\mu )[t]\Bigg\}& &{}\end{array}$$
(4.13)

is concave in \(\check{q}\). Since the expectation operation preserves the concavity property, we conclude \(J_{\text{post}}(\check{b},\check{q})[t + 1]\) in (4.17) is concave in \(\check{q}\). Therefore, \(J_{\text{post}}(\check{b},\check{q})\) is concave decreasing function in \(\check{q}\) for a given \(\check{b}\).

Lemma 4.2

\(J_{\text{post}}(\check{b},\check{q})\) is a concave function of \(\check{b}\) for a given \(\check{q}\) .

Proof

We show the concavity of \(J_{\text{post}}(\check{b},\check{q})\) by the induction method. By following the similar steps to prove Lemma 4.1, we can show that \(J_{\text{post}}(\check{b},\check{q})[t]\) in (4.17) is concave in \(\check{b}\) for \(t = 1,2,\mathop{\ldots }\) and since \(\lim \limits _{t\rightarrow \infty }J_{\text{post}}(\check{b},\check{q})[t] = J_{\text{post}}(\check{b},\check{q})\), we conclude \(J_{\text{post}}(\check{b},\check{q})\) is concave.

We initialize \(J_{\text{post}}(\check{b},\check{q}) = 0\) and assume \(J_{\text{post}}(\check{b},\check{q})[t]\) as concave in \(\check{b}\) for given values of \(\check{q} \in \mathcal{Q}\) and \(\hat{h} \in \mathcal{H}\). Next, we have to prove that \(J_{\text{post}}(\check{b},\check{q})[t + 1]\) is concave in \(\check{b}\) by induction method. As \(\mu -\lambda (\check{q}+\mu )\) is independent of \(\check{b}\) and \(J_{\text{post}}(\check{b},\check{q})[t]\) is assumed to be concave in \(\check{b}\), hence \(\mu -\lambda (\check{q}+\mu ) + J_{\text{post}}(\check{b} - P +\hat{ e},\check{q} - r(h,P)+\mu )[t]\) is also concave in \(\check{b}\). Using similar arguments mentioned in the proof of Lemma 4.1, we can conclude that \(J_{\text{post}}(\check{b} - P +\hat{ e},\check{q} - r(h,P)+\mu )[t + 1]\) is concave in \(\check{b}\) for a given \(\check{q}\).

For convenience, let us drop the index of time interval [t] and denote \(f(\check{b} - P) = J_{\text{post}}(\check{b} - P +\hat{ e},\check{q} - r(h,P)+\mu )\). We apply Topkis’ monotonicity theorem [13, Theorem 2] to prove that P is a non-decreasing function of \(\check{b}\). Therefore, at first, we have to prove that for a given \(\hat{h}\) and \(\check{q}\), \(J_{\text{post}}(\check{b},\check{q})\) has an increasing difference in (\(\check{b},P\)) for \(P \in [0,\check{b}]\) and a given \(\check{q}\). In particular, we need to show

$$\displaystyle\begin{array}{rcl} f(\check{b}' - P') - f(\check{b} - P') \geq f(\check{b}' - P) - f(\check{b} - P),\quad \forall \check{b}' \geq \check{ b},\forall P' \geq P.& &{}\end{array}$$
(4.14)

From Lemma 4.1, we know \(J_{\text{post}}(\check{b} - P +\hat{ e},\check{q} - r(\hat{h},P)+\mu )\), i.e., \(f(\check{b} - P)\) is concave in \(\check{b}\). Hence, from the fundamental property of concave functions, we have [11]:

$$\displaystyle\begin{array}{rcl} f(u+\delta ) - f(u) \geq f(v+\delta ) - f(v),\quad u \leq v,\delta \geq 0.& &{}\end{array}$$
(4.15)

Substituting \(u =\check{ b} - P'\), \(v =\check{ b} - P\), and \(\delta =\check{ b}' -\check{ b}\), we obtain (4.14). Thus, we obtain that \(J_{\text{post}}(\check{b} - P +\hat{ e},\check{q} - r(\hat{h},P)+\mu )\) has an increasing difference in (\(\check{b},P\)) for \(P \in [0,\check{b}]\) and a given \(\check{q}\).

We now study the monotonicity of the optimal power control policy.

Theorem 4.1

The optimal power control policy π A has the following properties:

  1. 1.

    π A (b, q, h) is a non-decreasing function of q for given h and b.

  2. 2.

    π A (b, q, h) is a non-decreasing function of b for given h and q.

Proof

Let us consider Lemma 4.1. As \(J_{\text{post}}(\check{b},\check{q})\) is a concave decreasing function of \(\check{q}\), hence \(\mu -\lambda \check{q} + J_{\text{post}}(\check{b} - P +\hat{ e},\check{q} - r(h,P)+\mu )\) is supermodular in \((\check{q},r(h,P))\) for \(r(h,P) \in [0,\check{q}]\). As r(h, P) is a concave function of P, we can say that \(\mu -\lambda (\check{q}+\mu ) + J_{\text{post}}(\check{b} - P +\hat{ e},\check{q} - r(h,P)+\mu )\) is supermodular in \((\check{q},P)\) for \(P \in [0,(2^{\check{q}} - 1)/\hat{h}]\) for given \(\check{b}\) and \(\hat{h}\). As the pre-decision and post-decision parameters are proportional to each other, therefore, we can conclude that π A (b, q, h) is a non-decreasing function of q for given h and b.

Next, to show that π A (b, q, h) is a non-decreasing function of b for given h and q, consider Lemma 4.2. By applying Topkis’ monotonicity theorem [13, Theorem 2] and representing the parameters in terms of pre-decision state, we conclude that π A (b, q, h) is a non-decreasing function of b for given h and q.

Theorem 4.1 prescribes that more power is used for transmission over a given channel and available energy when there are more data-bits in the data-queue. In other words, with the increasing buffer occupancy q, more data should be scheduled to provide more ‘room’ for new incoming data traffic without violating the delay constraint. We also observe from Theorem 4.1 that we should increase transmit power for a given channel and data-queue condition if we have more energy in the battery. These findings help to reduce the search space to solve (4.12) by restricting the search space towards specified direction. Intuitively, Theorem 4.1 helps to reduce the data queue length in order to meet the average delay constraint.

4.2.2 Online Algorithm

Now we propose an online algorithm to obtain the optimal policy π A without requiring known statistics of the underlying random processes. It is equivalent to learn \(J_{\text{post}}(\check{b},\check{q})\) since we can obtain π A from \(J_{\text{post}}(\check{b},\check{q})\) using (4.12).

From the relationship (4.11), we first write the optimality equation for \(J_{\text{post}}(\check{b},\check{q})\) as follows:

$$\displaystyle\begin{array}{rcl} & & J_{\text{post}}(\check{b},\check{q}) =\sum \limits _{\hat{h}\in \mathcal{H}}\sum \limits _{\hat{e}\in \mathcal{E}}p_{\mathcal{H}}(\hat{h})p_{\mathcal{E}}(\hat{e})\max _{\mu,P\leq P_{\max }(\check{b}+\hat{e},\check{q},\hat{h})}\Bigg\{\mu -\lambda \check{ q} + J_{\text{post}}(\check{b} +\hat{ e} - P,\check{q} \\ & & \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \!\!\!\!\!\!\!\!\!\!\!\! - r(\hat{h},P)+\mu )\Bigg\} - J_{\text{post}}(\check{b}_{0},\check{q}_{0}) {}\end{array}$$
(4.16)

for some fixed state \((\check{b}_{0},\check{q}_{0})\). Toward this end, notice that when the statistics of the channel fading and EH processes is known, \(J_{\text{post}}(\check{b},\check{q})\) can be computed using the sequential relative value iteration algorithm (RVIA) as follows

$$\displaystyle\begin{array}{rcl} & & J_{\text{post}}(\check{b},\check{q})[t + 1] =\!\!\sum \limits _{\hat{h}\in \mathcal{H}}\sum \limits _{\hat{e}\in \mathcal{E}}p_{\mathcal{H}}(\hat{h})p_{\mathcal{E}}(\hat{e})\mathop{\max }\limits_{\begin{array}{c}\mu,P\leq P_{\max }(\check{b}+\hat{e},\check{q},\hat{h})\end{array}}\Bigg\{\mu \! -\!\lambda \check{ q}\! +\! J_{\text{post}}(\check{b}\! +\!\hat{ e}\! -\! P, \\ & & \qquad \qquad \qquad \qquad \qquad \qquad \!\!\!\!\!\!\!\!\!\!\check{q} -\! r(\hat{h},P)+\mu )[t]\Bigg\} - J_{\text{post}}(\check{b}_{0},\check{q}_{0})[t], {}\end{array}$$
(4.17)

for \(t = 1,2,\mathop{\ldots }\) with initial value function \(J_{\text{post}}(\check{b},\check{q})[1]\).

Using the post-decision approach helps reducing the number of states to compute the state-value function, as we do not need to keep track of the channel states over the time intervals in \(J_{\text{post}}(\check{b},\check{q})[t]\) to achieve the optimal state-value function. Learning the value function J(b, q, h)[t] as in the conventional Q-learning approach would ultimately increase the computational complexity to a large extent.

We now resort to an online time-averaging algorithm to obtain (4.17) without requiring known fading and EH distributions. We now describe the implementation strategy of the proposed online algorithm as follows:

  • Initialization phase: Initialize \(J_{\text{post}}(\check{b},\check{q})[1]\) and λ[1] ≥ 0, and fix \((\check{b}_{0},\check{q}_{0}) \in \mathcal{B}\times \mathcal{Q}\).

  • Transmission phase: For \(t = 1,2,\mathop{\ldots }\), based on the current state (b[t], q[t], h[t]), the optimal power control action P [t] is determined by solving the following problem:

    $$\displaystyle\begin{array}{rcl} \mathop{\arg \max }\limits_{\begin{array}{c}\mu,P[t]\leq P_{\max }(b[t],q[t],h[t])\end{array}}\Bigg\{\mu -\lambda q[t] + J_{\text{post}}(b[t] - P[t],q[t] - r(h[t],P[t])+\mu )[t]\Bigg\}.& & {}\end{array}$$
    (4.18)
  • State-value function updating phase: We update the state-value function as:

    $$\displaystyle\begin{array}{rcl} & & J_{\text{post}}(\check{b},\check{q})[t + 1] = (1\! -\!\phi [t])J_{\text{post}}(\check{b},\check{q})[t]\! +\!\phi [t]\Bigg(\mathop{\max }\limits_{\begin{array}{c}\mu,P\leq P_{\max }(\check{b}+\hat{e},\check{q},\hat{h})\end{array}}\Bigg\{\mu \! -\!\lambda [t]\check{q} + \\ & & \qquad \quad J_{\text{post}}(\check{b} +\hat{ e}[t] - P,\check{q} - r(h[t],P)+\mu )[t]\Bigg\} - J_{\text{post}}(\check{b}_{0},\check{q}_{0})[t]\Bigg). {}\end{array}$$
    (4.19)
  • Multiplier update: The multiplier λ[t] is updated as follows:

    $$\displaystyle\begin{array}{rcl} \lambda [t + 1] ={\Bigl [\lambda [t] +\nu [t](q[t] -\bar{ Q}^{\max })\Bigr ]}_{0}^{L}& & {}\end{array}$$
    (4.20)

    where [x] a b denotes the projection of x on the interval [a, b] for ab and L is sufficiently large number to ensure boundedness of the multiplier.

The learning rate sequences ϕ[t] and ν[t] represent the decreasing step-size parameters for the value-iteration function and the Lagrange multiplier update equation, respectively. The step-size parameters satisfy the following properties [3]:

$$\displaystyle\begin{array}{rcl} \sum \limits _{t=1}^{\infty }\phi [t] =\sum \limits _{ t=1}^{\infty }\nu [t] = \infty;\qquad \sum \limits _{ t=1}^{\infty }(\phi [t])^{2} + (\nu [t])^{2} <\infty;\qquad \lim \limits _{ t\rightarrow \infty }\frac{\nu [t]} {\phi [t]} = 0.& &{}\end{array}$$
(4.21)

We can see that (4.19), being a stochastic estimate of (4.17), is updated based on the instantaneous realizations of the underlying random processes without requiring their statistics. Moreover, this algorithm is applicable to any distributions of the channel fading and EH processes, and hence, is considered as robust to the variations of channel fading and EH models. The convergence of proposed online algorithm to \(J_{\text{post}}(\check{b},\check{q})\) satisfying (4.16) can be found by following the similar steps described in [3, Appendix].

4.2.3 Baseline Transmission Schemes

To show the effectiveness of the developed optimal power control scheme for the average delay model by simulations in Sect. 4.4, we propose two baseline schemes in this subsection based on the results available in the existing literature. The first baseline scheme, namely benchmark scheme, does not keep track of the battery states in each time interval to achieve the optimal throughput. Therefore, instead of constraining the instantaneous transmit power, the average power consumption is upper bounded by the average harvested energy. We show by simulations that our developed scheme and the benchmark scheme result in the same optimal throughput for a given average delay requirement. The second baseline scheme, denoted as naive scheme, heuristically calculates the transmit power as a function of the amount of remaining energy at the battery and the number of bits to be transmitted.

1. Benchmark scheme: It has been shown in [14] that the optimal utility of an EH system can be calculated by knowing the average harvested energy only without requiring the dynamics of the battery to be considered. Based on this finding, we formulate an optimization problem for the benchmark scheme as follows:

$$\displaystyle\begin{array}{rcl} \mathop{\max }\limits_{\mu,P[t]}\ \ \ \ \!\!\mu \quad \mathtt{s.t.:}\quad \mathbb{E}{\bigl \{P[t]\bigr \}} \leq \bar{ E}\,\,\bar{Q} \leq \bar{ Q}^{\max },& &{}\end{array}$$
(4.22)

where we remind that \(\bar{E}\) is the average harvested energy in each frame. Note that the difference between problems (4.22) and (4.6) is that problem (4.6) keeps track of both the battery and queue length states in each time interval, whereas problem (4.22) keeps track of the queue length state only. Therefore, the computational complexity to solve problem (4.22) is much less than that involved to solve problem (4.6). The optimal solution of problem (4.22) can be obtained by using the Lagrangian approach as follows:

$$\displaystyle\begin{array}{rcl} \mathop{\min }\limits_{\zeta \geq 0,\,\psi \geq 0}\,\,\Bigg\{\mathop{\max }\limits_{\mu,P[t]}\,\,{\Bigl \{\mu -\zeta \bar{ Q} -\psi \mathbb{E}{\bigl \{P[t]\bigr \}}\Bigr \}} +\zeta \bar{ Q}^{\max } +\psi \bar{ E}\Bigg\},& &{}\end{array}$$
(4.23)

where ζ and ψ are the multipliers. The optimal solution for the benchmark scheme can be obtained by exploiting the same post-decision state approach as used in the scheme developed in Sect. 4.2.1.

Problem (4.23) is a single-dimensional constrained MDP as opposed to the two-dimensional MDP (4.9). Therefore, the post-decision state approach is only a function of the queue length state q for given ζ and ψ. As such, the computational complexity to update the post-decision state is much less than the developed scheme.

In this benchmark scheme, it is implicitly assumed that the available energy in each time interval is infinite, even if the average transmit power is constrained by finite \(\bar{E}\). It is worth mentioning that the benchmark scheme is a theoretical abstraction to obtain the optimal throughput for an EH system. Hence, this scheme cannot be applied in real-time systems (as may not be feasible for certain time intervals), where the available energy in each time interval depends on the random EH process and the past-control actions. Nonetheless, the reason to consider the benchmark scheme in this paper is to show that our scheme can achieve the same optimal throughput as that offered by the benchmark scheme. In fact, our developed online algorithm in Sect. 4.2.2 takes into account the dynamics of the available energy, unlike the benchmark scheme while providing the same optimal solution. We show this comparison in detail in Sect. 4.4.

2. Naive Scheme: In this scheme, in each frame t, we assign \(P[t] =\min {\Bigl \{ b[t], \frac{2^{q[t]}-1} {h[t]} \Bigr \}}\). This scheme neither takes into account the impact of channel and energy arrival statistics nor apply any learning technique to improve the power control policy. The purpose of considering the naive scheme is to show the effectiveness of controlling the transmit power intelligently, as developed in our proposed scheme, over the transmission frames.

4.3 Power Allocation Under Delay-Outage Constraint

In this section, we propose an approach to solve problem (4.7).

4.3.1 Effective Capacity Maximization

In order to handle the delay-outage constraint, we need to study the tail distribution of steady-state queue length random variable Q, which is very cumbersome. To overcome this difficulty, we assume large delay region, i.e., Q max is sufficiently large, and employ the asymptotic delay analysis. More specifically, using (2.6), the problem (4.7) can be reformulated as the effective capacity maximization problem as follows:

$$\displaystyle\begin{array}{rcl} \mathop{\max }\limits_{P[t] \leq b[t],\forall t}\ \ \ \ \!\!\! - \frac{1} {\theta ^{\mathrm{tar}}}\log \mathbb{E}{\Bigl \{e^{-\theta ^{\mathrm{tar}}r[t] }\Bigr \}},\quad \theta ^{\mathrm{tar}} \triangleq -\log (\zeta _{ Q})/Q^{\mathrm{max}},& &{}\end{array}$$
(4.24)

where r[t] is given by (4.3). In the following, for the sake of convenience and generalization, let us denote the normalized delay exponent as θ = θ tar∕log(2). Using the monotonicity of log(⋅ ), problem (4.24) can be re-expressed as follows:

$$\displaystyle\begin{array}{rcl} \mathop{\min }\limits_{P[t] \leq b[t],\forall t}\ \ \ \ \!\!\!\mathbb{E}\Big\{(1 + h[t]P[t])^{-\theta }\Big\}.& &{}\end{array}$$
(4.25)

We can now observe that problem (4.25) is an infinite-horizon MDP. In the following, we present an approach to solve and analyze problem (4.25).

We focus on the stationary policies π S for problem (4.25). The policy π S can be represented by function \(\pi _{\text{S}}: \mathcal{B}\times \mathcal{H}\rightarrow \mathbb{R}^{+}\) specifying the power control action in frame t as P[t] = π S(b[t], h[t]) such that P[t] ∈ [0, b[t]]. Note that in contrast to π A (the policy for the average delay constraint model), π S does not depend on \(\mathcal{Q}\) and hence is not a function of q[t]. The optimal value of μ for a given feasible policy π S obtained from problem (4.25) represents the effective capacity of the considered EH system [15, 16]. Note that when θ → 0, i.e., no constraint is imposed on the delay requirement, the solution of problem (4.25) can also be obtained from the classical optimal online schemes described in [6, 9] for a sufficiently large number of transmission frames and for a known channel fading and EH statistics. Similar to the average delay constraint model, we assume that the channel fading and EH statistics are unknown for the delay-outage constraint model as well.

Let V (b, h) denotes the (pre-decision) state-value function for problem (4.25), i.e., V (b, h) is the optimal value of problem (4.25) with the initial state (b[1], h[1]) = (b, h). The Bellman’s optimality equation for problem (4.25) can be written as follows [12]:

$$\displaystyle\begin{array}{rcl} V (b,h) =\mathop{\min }\limits_{ \begin{array}{c}P\leq b\end{array}}\Bigg\{(1 + hP)^{-\theta } +\sum \limits _{\hat{ h}\in \mathcal{H}}\sum \limits _{\hat{e}\in \mathcal{E}}p_{\mathcal{H}}(\hat{h})p_{\mathcal{E}}(\hat{e})V (b - P +\hat{ e},\hat{h})\Bigg\} - V (b_{0},h_{0})& &{}\end{array}$$
(4.26)

for a fixed state (b 0, h 0). The optimal policy π S is the optimal solution of (4.26).

4.3.1.1 Post-decision State-Value Function Approach

Similar to the average delay model, we adopt the post-decision state-value approach for the delay-outage constraint model to optimally control the transmit power. The post-decision state-value function for delay-outage constraint model \(V _{\text{post}}(\check{b})\) can be defined as follows:

$$\displaystyle\begin{array}{rcl} V _{\text{post}}(\check{b}) =\sum \limits _{\hat{h}\in \mathcal{H}}\sum \limits _{\hat{e}\in \mathcal{E}}p_{\mathcal{H}}(\hat{h})p_{\mathcal{E}}(\hat{e})V (\check{b} +\hat{ e},\hat{h})& &{}\end{array}$$
(4.27)

for post-decision states \(\check{b} \in \mathcal{B}\). The dynamics of the battery can be represented as \(\check{b}[t] = b[t] - P[t]\), and \(b[t + 1] =\check{ b}[t] + e[t]\). Using (4.26) and (4.27), π S can be computed as:

$$\displaystyle\begin{array}{rcl} \mathop{\arg \min }\limits_{\begin{array}{c}P\leq b\end{array}}\Bigg\{(1 + hP)^{-\theta } + V _{\text{post}}(b - P)\Bigg\}.& &{}\end{array}$$
(4.28)

Lemma 4.3

\(V _{\text{post}}(\check{b})\) is a convex decreasing function of \(\check{b}\) .

Proof

At first, we prove the decreasing monotonic property of \(V _{\text{post}}(\check{b})\). The monotonicity is obvious since \((1 +\hat{ h}P)^{-\theta }\) is decreasing due to increasing P, which is proportional to the stored energy. We use induction method to prove the convexity of \(V _{\text{post}}(\check{b})\). In particular, we show that \(V _{\text{post}}(\check{b})[t]\) in (4.31) is convex for \(t = 1,2,\mathop{\ldots }\), and since \(\lim \limits _{t\rightarrow \infty }V _{\text{post}}(\check{b})[t] = V _{\text{post}}(\check{b})\), we conclude \(V _{\text{post}}(\check{b})\) is convex.

We initialize \(V _{\text{post}}(\check{b})[t]\) as \(V _{\text{post}}(\check{b})[1] = 0\). We assume \(V _{\text{post}}(\check{b})[t]\) as convex for a given \(\hat{h} \in \mathcal{H}\). Now, we have to prove \(V _{\text{post}}(\check{b})[t + 1]\) as convex according to the induction method. Note that \((1 +\hat{ h}P)^{-\theta }\) is convex in P, and as we assume \(V _{\text{post}}(\check{b})[t]\) is convex in \(\check{b}\), therefore, we conclude \((1 +\hat{ h}P)^{-\theta } + V _{\text{post}}(\check{b} - P +\hat{ e})[t]\) is jointly convex in P and \(\check{b}\) for \(P \in [0,\check{b}]\) [11]. Moreover, the minimum of jointly convex function is convex. Hence,

$$\displaystyle\begin{array}{rcl} \mathop{\min }\limits_{\begin{array}{c}P\leq \check{b}+\hat{e}\end{array}}\Bigg\{(1 +\hat{ h}P)^{-\theta } + V _{\text{post}}(\check{b} +\hat{ e} - P)\Bigg\}& &{}\end{array}$$
(4.29)

is convex with \(\check{b}\). Then, from (4.31), we conclude \(V _{\text{post}}(\check{b})[t]\) is convex, since the expectation operation preserves the convexity property. So, we conclude \(V _{\text{post}}(\check{b})\) is a convex decreasing function of \(\check{b}\).

Theorem 4.2

The optimal control policy π S (b, h) is a non-decreasing function of b for a given h.

Proof

Using [13, Lemma 1] and Lemma 4.3, we can show that \((1 +\hat{ h}P)^{-\theta } + V _{\text{post}}(\check{b} - P +\hat{ e})\) is an increasing difference function in \((\check{b},P)\) for \(P \in [0,\check{b} +\hat{ e}]\). Then by applying Topkis’ monotonicity theorem, we deduce that the control action P is non-decreasing with \(\check{b}\) for a given \(\hat{h}\). Intuitively, P is non-decreasing with \(\check{b}\) is obvious, since when \(\check{b}\) increases, the optimization domain \([0,\check{b} +\hat{ e}]\) for P becomes larger. The larger set helps to reduce \((1 + hP)^{-\theta } + V _{\text{post}}(\check{b} - P +\hat{ e})\) more. Hence, representing the parameters in terms of pre-decision state, we can conclude that π S (b, h) is non-decreasing with b for a given h.

Similar to Theorem 4.1, we observe that we should allocate more power for transmission to increase the throughput if we have more energy available in the battery. This finding helps to reduce the computational complexity to solve (4.28) as we restrict the search space towards one direction to achieve the optimal solution.

Proposition 4.3 and Theorem 4.2 provide insights about the structural properties of the post-decision state-value function \(V _{\text{post}}(\check{b})\) and the optimal power control policy towards developing the online algorithm in Sect. 4.3.2. However, as the post-decision state-value function \(V _{\text{post}}(\check{b})\) is a convex decreasing function of \(\check{b}\), following the approximation method developed in [4], we can approximate \(V _{\text{post}}(\check{b})\) as a convex function to alleviate the computational complexity and develop a suboptimal online algorithm.

4.3.2 Online Algorithm

We propose an online algorithm to obtain the optimal policy π S under the delay-outage constraint.

From (4.26) and (4.27), we can write the optimality equation for delay-outage constraint model as follows:

$$\displaystyle\begin{array}{rcl} V _{\text{post}}(\check{b})\! =\!\sum \limits _{\hat{h}\in \mathcal{H}}\sum \limits _{\hat{e}\in \mathcal{E}}p_{\mathcal{H}}(\hat{h})p_{\mathcal{E}}(\hat{e})\mathop{\min }\limits_{\begin{array}{c}P\leq \check{b}+\hat{e}\end{array}}\Bigg\{(1 +\hat{ h}P)^{-\theta } + V _{\text{post}}(\check{b} +\hat{ e} - P)\Bigg\}\! -\! V _{\text{post}}(\check{b}_{ 0})& &{}\end{array}$$
(4.30)

for some fixed state \(\check{b}_{0}\).

Notice that when the channel and EH processes are known, \(V _{\text{post}}(\check{b})\) can be computed using the sequential RVIA as follows for t = 1, 2, ⋯ :

$$\displaystyle\begin{array}{rcl} V _{\text{post}}(\check{b})[t + 1]& =& \sum \limits _{\hat{h}\in \mathcal{H}}\sum \limits _{\hat{e}\in \mathcal{E}}p_{\mathcal{H}}(\hat{h})p_{\mathcal{E}}(\hat{e})\min _{P\leq \check{b}+\hat{e}}\Bigg\{(1 +\hat{ h}P)^{-\theta } + V _{\text{post}}(\check{b} +\hat{ e} - P)[t]\Bigg\} \\ & & \qquad \qquad \qquad - V _{\text{post}}(\check{b}_{0})[t], {}\end{array}$$
(4.31)

with initial value function \(V _{\text{post}}(\check{b})[1]\). For online implementation, we can follow the same procedures, i.e., initialization, transmission, and learning phases as described for the average delay model in Sect. 4.2.

4.3.3 Baseline Transmission Schemes

Based on the concepts behind developing the baseline schemes for the average delay model, we describe two similar types of baseline schemes for the delay-outage constraint model to show the effectiveness of our developed scheme.

1. Benchmark Scheme: It has been shown in [14] that the optimal utility of an EH system can be calculated by knowing the average harvested energy only without requiring the exact distribution of the EH process to be known. Hence, with a given average harvested energy \(\bar{E}\), we formulate an optimization problem for the ‘benchmark scheme’ as follows:

$$\displaystyle\begin{array}{rcl} \mathop{\min }\limits_{P[t] \geq 0}\ \ \mathbb{E}\Bigg\{(1 + h[t]P[t])^{-\theta }\Bigg\}\quad \mathtt{s.t.:}\quad \mathbb{E}{\bigl \{P[t]\bigr \}} \leq \bar{ E}.& &{}\end{array}$$
(4.32)

The Lagrangian of problem (4.32) is given by

$$\displaystyle\begin{array}{rcl} \mathcal{L} = \mathbb{E}\Big\{(1 + h[t]P[t])^{-\theta }\Big\} +\eta (\mathbb{E}\{P[t]\} -\bar{ E}),& &{}\end{array}$$
(4.33)

where η represents Lagrange multiplier associated with the only constraint of problem (4.32). Applying Karush-Kuhn-Tucker (KKT) optimality conditions [11], we obtain optimal P[t] as follows:

$$\displaystyle{ P^{{\ast}}[t] = \left \{\begin{array}{ll} \Big( \frac{\theta }{\eta (h[t])^{\theta }}\Big)^{ \frac{1} {1+\theta } } - \frac{1} {h[t]},\qquad &\text{if}\quad h[t] \geq \frac{\eta } {\theta } \\ 0, &\text{otherwise.} \end{array} \right \}. }$$
(4.34)

From the KKT optimality conditions, we can show that the constraint of problem (4.32) is satisfied with equality at the optimal point. Hence, the optimal solution of η can be obtained numerically by solving the following equation

$$\displaystyle\begin{array}{rcl} \int \limits _{\frac{\eta }{\theta } }^{\infty }\Bigg(\Big( \frac{\theta } {\eta h^{\theta }}\Big)^{ \frac{1} {1+\theta } } -\frac{1} {h}\Bigg)p_{\mathcal{H}}(h)dh =\bar{ E}.& &{}\end{array}$$
(4.35)

We can see that combining (4.34) and (4.35) provides the same result as that obtained for a non-EH system, e.g., [16, Eqs. (8), (9)], if the average available energy is replaced by the average harvested energy \(\bar{E}\). Note that in this benchmark scheme, it is assumed that the available energy in each frame is infinite (see (4.34)), even if the average energy is constrained by \(\bar{E}\). Hence, similar to the average delay model, this scheme cannot be applied in real-time systems when the available energy in each frame depends on realizations of the random EH process and the past control actions.

2. Offline Scheme: In this baseline scheme, we formulate an offline optimization problem motivated by the contributions made in [6, 9] for performance comparison. This offline scheme was originally proposed for finite number of transmission time intervals in [6, 9]. However, we consider infinite time horizon. Hence, to make a fair comparison in the numerical results, we consider a large number of time intervals for this scheme to compare its performance with the developed and benchmark schemes. We formulate an optimization problem as follows:

$$\displaystyle\begin{array}{rcl} & & \mathop{\min }\limits_{P[t] \geq 0,\forall t}\qquad \qquad \frac{1} {\mathcal{T}}\sum \limits _{t=1}^{\mathcal{T}}(1 + h[t]P[t])^{-\theta } \\ & &\text{s. t.:}\qquad \qquad \ \ \ \ \sum \limits _{k=1}^{t}P[k] \leq \sum \limits _{ k=0}^{t-1}e[k],t = 1,\mathop{\ldots },\mathcal{T}{}\end{array}$$
(4.36)

where \(\mathcal{T}\) denotes the maximum number of time intervals. Problem (4.36) is a convex optimization problem and hence can be solved optimally and efficiently [11]. Applying KKT optimality condition in problem (4.36), we can obtain optimal power allocation P [t] as follows:

$$\displaystyle{ P^{{\ast}}[t] = \left \{\begin{array}{l} \Bigg( \frac{\theta }{(h[t])^{\theta }\sum \limits _{ k=t}^{\mathcal{T}}\lambda [k]}\Bigg)^{ \frac{1} {1+\theta } } - \frac{1} {h[t]},\qquad \text{if}\quad h[t] \geq \sum \limits _{k=t}^{\mathcal{T}}\lambda [k]/\theta \\ 0,\qquad \qquad \qquad \qquad \qquad \quad \quad \,\,\,\text{otherwise,} \end{array} \right. }$$

where λ[k], \(k = 1,\mathop{\ldots },\mathcal{T}\) denote the Lagrange multipliers associated with constraint (4.36). Please note that when θ → 0, i.e., any amount of delay is allowed, then (4.3.3) provides the same solution as that obtained in [9] for fading channels.

3. Naive Scheme: In this naive scheme, in each transmission frame t, we assign P[t] = e[t], irrespective of the channel condition. Note that this scheme, being overly aggressive in spending energy, does not take into account the impact of channel and energy arrival statistics, and hence the long-term effect of the power allocation policy is completely ignored.

4.4 Illustrative Results

In this section, we evaluate the performances of the developed power control schemes and the baseline schemes for both average delay and delay-outage constraint models. We assume exponentially distributed channel power gain with an average value of 0 dB. We assume a random energy profile that is uniformly distributed between 0 and \(2\bar{E}\) [6]. Note that our developed scheme is general enough to be accommodated with any ergodic energy distribution. To incorporate the delay-outage constraint, we consider the maximum queue length, Q max = 8 in Figs. 4.4 and 4.5. Further, the step-size parameters for learning rate sequences and for Lagrange multiplier update equations are chosen as ϕ[t] = (1∕t)0. 70 and ν[t] = (1∕t)0. 85, respectively.

4.4.1 Average Delay Constraint

Figure 4.2 shows the optimal supportable throughput versus queue length bound trade-off curves for the proposed and baseline schemes under the average delay constraint model. We set the average EH rate \(\bar{E} = 2\), and evaluate the optimal throughputs for a given range of maximum time-averaged queue length \(\bar{Q}^{\max }\). We observe that the throughput increases with increasing queue length bound for both proposed and baseline schemes. However, the increasing rate of the throughput is high for smaller values of queue length bound, while the (increasing) rate slows down for higher values of the bound. Figure 4.2 also shows that we achieve the same throughput for the proposed and the benchmark schemes. Recall that the benchmark scheme for the average delay model does not keep track of the battery state in each time interval, and hence the control action taken in each time interval may not always be feasible. For instance, the calculated optimal power in a given time interval may be greater than the amount of remaining energy in the battery. Therefore, in spite of the lower computational complexity offered by the benchmark scheme, this scheme is not implementable in practice. In contrast, our developed scheme keeps track of both the battery and data queue length states, takes optimal control actions in each time interval, and still achieves the same optimal throughput as is achieved by the benchmark scheme. Further, we observe that the proposed scheme outperforms the naive scheme and the performance gap between the proposed and naive schemes increases with increasing queue length bound requirement. The naive scheme does not learn the channel and energy statistics over the transmission time and yields deteriorated performance by spending a large amount of energy that the battery contains in each time interval. Therefore, we can conclude that although the proposed scheme incurs higher complexity compared to the naive scheme, it is worth implementing the former scheme because of the large performance gap between the two schemes, particularly in the range of \(\bar{Q}^{\max } \geq 3\).

Fig. 4.2
figure 2

Throughput versus maximum average queue length \(\bar{Q}^{\max }\)

In Fig. 4.3, we show throughput versus average harvested energy \(\bar{E}\) for the proposed scheme for different values of the maximum average queue length \(\bar{Q}^{\max }\). In particular, we consider \(\bar{Q}^{\max } =\{ 0.75,1.5,2.5,4.5\}\). We observe that the throughput increases with increasing \(\bar{E}\) for a given \(\bar{Q}^{\max }\). The higher available energy helps to transmit more data bits even when there is a stringent delay-requirement. However, the increasing rate of throughput is low for smaller values of \(\bar{Q}^{\max }\), whereas the (increasing) rate is high for higher values of \(\bar{Q}^{\max }\). For instance, increasing the average harvested energy \(\bar{E}\) from 0.5 to 5 increases the throughput by 0.15 when \(\bar{Q}^{\max } = 0.75\). On the other hand, with the same amount of incremental harvested energy, the throughput increases by 0.52 when \(\bar{Q}^{\max } = 4.5\).

Fig. 4.3
figure 3

Throughput versus average harvested energy \(\bar{E}\)

4.4.2 Delay-Outage Constraint

We compare the performance of our proposed scheme with that of the baseline schemes in Fig. 4.4 for \(\bar{E} = 2\). Note that similar to the conventional non-EH system [16], the effective capacity of a single link EH system increases with ζ Q . This result has already been shown in [17] for a single link EH system with known channel and energy profiles. Moreover, similar to Fig. 4.2, we observe that the proposed scheme provides the same optimal result as that obtained from the benchmark scheme for all the considered values of ζ Q . Therefore, we can conclude for the delay-outage constraint model that by considering the dynamics of the battery and applying the optimal control action according to the proposed learning algorithm, we can still achieve the same optimal effective capacity even for unknown channel and energy statistics. Moreover, the performance gap between our scheme and the naive baseline scheme for the considered range of ζ Q exemplifies the impact of an intelligent power allocation strategy over a heuristic one, which does not take into account the channel and energy statistics to allocate the transmit power.

Fig. 4.4
figure 4

Effective capacity versus ζ Q

In Fig. 4.5, we show the behavior of the effective capacity with \(\bar{E}\) for ζ Q = {10−6, 10−4, 10−2, 10−1, 0. 4, 1}. We observe that the effective capacity increases with increasing \(\bar{E}\) for a given queue-length-outage probability ζ Q . Note that ζ Q,A denotes (simulated) outage probability for average delay model; \(\bar{Q}_{\text{S}}\) denotes the (simulated) time-averaged queue length for delay-outage model. For instance, in case of ζ Q = 1, i.e., unconstrained delay, the effective capacity can be increased by 1.53 if \(\bar{E}\) is increased from 0.5 to 5. However, the increasing rate of the effective capacity is comparatively less when there is a stringent delay constraint. For example, when ζ Q = 10−6, the effective capacity can be improved by 0.82 if we increase \(\bar{E}\) from 0.5 to 5. Furthermore, changing ζ Q provides small impact on the effective capacity for small \(\bar{E}\) (e.g., 0.31 of effective capacity is decreased when ζ Q is changed from 1 to 10−6 for \(\bar{E} = 0.5\)) and a larger variation in the effective capacity for large \(\bar{E}\) (e.g., the effective capacity decreases by 1.02 when ζ Q is changed from 1 to 10−6 for \(\bar{E} = 5\)).

Fig. 4.5
figure 5

Effective capacity versus average harvested energy \(\bar{E}\)

4.4.3 Average Delay Versus Delay-Outage Constraints

We compare the performances of the average delay and delay-outage constraint models in Tables 4.1 and 4.2. We fix Q max = 8 for both tables and consider two cases of queue-length-outage probabilities. Precisely, we set ζ Q = 0. 1 and ζ Q = 0. 01 for Tables 4.1 and 4.2, respectively for the delay-outage constraint model. We vary \(\bar{E}\) from 0.5 to 5 in steps of 0.5 and evaluate the effective capacity and the average queue length for each value of \(\bar{E}\) of the delay-outage constraint model. Then, we set the maximum average queue length for the average delay model in such as a way so that we achieve the throughput same as the effective capacity obtained by the delay-outage constraint model. We calculate the outage probability for the average delay model by considering the events when the instantaneous queue length exceeds Q max = 8 (the queue length bound for the delay-outage constraint model). By assuming Q max = 8, we ensure that the maximum queue length is sufficiently long as compared to the average queue length in order to satisfy delay-outage constraint for all the considered values of \(\bar{E}\). It is worth mentioning that we can show by KKT optimality conditions that the maximum average queue length is same as the time-averaged queue length for the average delay model.

Table 4.1 ζ Q = 0. 1
Table 4.2 ζ Q = 0. 01

We observe that the average queue length for the delay-outage constraint model is higher than that for the average delay constraint model. In particular, the former is not intended to minimize the average queue length as opposed to the main objective of the latter. On the contrary, the latter yields higher queue-length-outage probability compared to the former. In the average delay constraint model, we cannot control the events, where the queue length exceeds a certain queue-threshold and hence we end up with a higher outage probability. Therefore, we can conclude that both average delay and delay-outage constraint models are important to be considered depending on the system applications. For instance, in case of real-time applications, where stringent delay outage probability is required, delay-outage constraint model is more appealing to be employed. On the contrary, in case of tight average delay requirement, average delay model is always a better choice.

4.4.4 Convergence Study of the Online Algorithms

We show the convergence behavior of the proposed online power allocation algorithms. In order to avoid redundancy, we only show the results for the delay-outage constraint model.

In Fig. 4.6, we show the convergence of the effective capacity for the proposed learning algorithm for the delay-outage constraint model for two scenarios of queue-length-outage probability ζ Q . We assume ζ Q = 0. 90 and ζ Q = 10−6, respectively. Further, we adopt \(\bar{E} = 2\) and determine the running average parameter \(R^{\text{av}}[t] = \frac{t-1} {t} R^{\text{av}}[t - 1] + \frac{1} {t} (1 + h[t]P[t])^{-\theta }\) to evaluate the effective capacity R EC[t] in each time interval t ≥ 1 as \(R_{\text{EC}}[t] = - \frac{1} {\theta \log (2)}\log R^{\text{av}}[t].\) The results confirm that the proposed method converges to the optimal solution after 6000 transmission frames for both scenarios.

Fig. 4.6
figure 6

Online allocation algorithm convergence: effective capacity

We further show the convergence of the average transmit power for two scenarios of the average harvested energy \(\bar{E}\). We consider \(\bar{E} = 4\) and \(\bar{E} = 2\), respectively and assume ζ Q = 0. 90. Similar to Fig. 4.6, we evaluate the running average of the transmit power P av[t] in each time frame t by \(P^{\text{av}}[t] = \frac{t-1} {t} P^{\text{av}}[t - 1] + \frac{1} {t} P[t]\), t ≥ 1 (Fig. 4.7). We observe that the average transmit powers converge after 6000 time frames for both scenarios. It is worth mentioning that the average transmit powers converge to the average harvested energy. This finding complies with the fact that the average transmit power for the proposed scheme achieves the same average value as that obtained from the benchmark scheme, because the constraint in (4.32) always meets with equality at the optimal point [14].

Fig. 4.7
figure 7

Online allocation algorithm convergence: average power