1 Introduction

Nowadays, video applications like Zoom, Youtube and Netflix are increasingly prevalent over the mobile Internet. Meanwhile, it is predicted by Cisco Visual Networking Index that mobile video traffic will upscale 9-fold from 2017 to 2022, and accounting for nearly four-fifths of total mobile data traffic by the end of the forecast period [8]. The major motivations behind this phenomenon are the proliferation of powerful mobile devices such as iPhoneTM-based and AndroidTM-based smartphones, and the explosive demands for high quality video streaming from them. By and large, these mobile video streaming services are with high throughput and low latency requirements. Therefore, efforts should be made to optimize the delivery of mobile video streaming applications.

Among different video streaming standards and technologies, DASH is one of the dominant video streaming technologies over the mobile Internet [29]. Various proprietary proposals of DASH are developed, such as Apple HTTP Live Streaming, Microsoft Smooth Streaming and Adobe HTTP Dynamic Streaming [24]. The basic idea of DASH is that a video sequence is partitioned into multiple segments/chunks with constant playback length and replicas of each segment are stored in different sites in a Content Delivery Network (CDN) with different resolutions and qualities. DASH aims to adapt and optimize video streaming over time to offer the best possible video quality to the end user, by considering device capabilities, network conditions and content characteristics.

However, it is still challenging for video streaming over wireless cellular networks with guaranteed Quality of Experience (QoE) due to the limited capacity of the cellular network and the massive growth in mobile video traffic. A straightforward way to sustain the explosive growth of video traffic in the mobile network is to upgrade current cellular network to next generation advanced networks such as LTE-Advanced and 5G networks. Nevertheless, by simply increasing the capacity of cellular network might not always be economical [1]. Therefore, this approach requires continuous exploration of novel solutions for video streaming optimization in order to deliver an enhanced QoE for a wide range of mobile video applications.

With the development of techniques for simultaneous utilization of multiple network interfaces at the mobile devices, higher quality videos can be supported by using multiple wireless access networks simultaneously [29]. For example, in a place overlapped with both 802.11ac and LTE networks, a possible way to further enhance the video streaming performance is to download video chunks via LTE and 802.11ac interfaces simultaneously. Therefore, we propose to combine the DASH technique with multipath video streaming by delivering a video as a sequence of small, independent segments encoded in different bitrates and allowing a single video segment to be transported over various wireless links for bandwidth aggregation. To achieve this, the HTTP’s range retrieval requests technique is adopted to enable a video segment to be logically partitioned and to be downloaded through various wireless network interfaces separately [3, 12, 29].

In addition to the adaptive bitrate at the application layer, the parameters at the physical layer such as MCS in both LTE and 802.11ac networks can be utilized to further optimize the DASH-based video streaming. Thus, a DASH-based cross-layer video optimization scheme is proposed in this paper to improve the perceptual video quality for end-to-end video streaming over multiple wireless access networks. In fact, the tuning of MCS at the physical layer in both 802.11ac and LTE networks, the video bitrate switching at the application layer are jointly performed by the cross-layer optimization controller (see Fig. 1) according to the feedback information such as the Signal-to-Interference plus Noise Ratio (SINR) and the buffer occupancy rate. The major contributions of this paper can be summarized as follows:

  • A DASH-based cross-layer optimization scheme is proposed for multipath video streaming over LTE and 802.11ac wireless networks. The MCS mode at the downlink physical layer and the video segment bitrate at the application layer are jointly adapted to enhance the video streaming performance.

  • The playback buffer occupancy rate is also considered for bitrate selection and rate allocation between the LTE and 802.11ac networks. A logarithmic quality function is proposed to model the perceived QoE of each requested segment. Then we formulate this DASH-based cross-layer multipath video streaming problem as a nonlinear optimization problem with mixed discrete-continuous constraints and try to find the optimal bitrate, MCS and rate allocation values to maximize the nonlinear and non-differentiable objective function for each segment.

  • To reduce the complexity, we propose an efficient online heuristic algorithm to find the sub-optimal solution to maximize the expected quality of the requested video segment and further evaluate its performance through a trace-driven simulation.

Fig. 1
figure 1

The proposed DASH-based cross-layer optimization framework over multiple wireless networks

The rest of this paper is organized as follows. In Section 2, we discuss the related works concerning DASH-based video streaming in 802.11ac and LTE downlink networks. Section 3 describes the proposed DASH-based cross-layer multipath video streaming optimization framework, the tuning of parameters at the physical layer of LTE and 802.11ac networks, followed by the formulation of the optimization problem and the corresponding solution. In Section 4, we evaluate the performance of the proposed algorithm by trace-driven simulations, followed by the concluding remarks in Section 5.

2 Related work

To improve the quality of wireless video streaming from a cross-layer perspective, a variety of optimization schemes have been proposed. Zhao et al. [33] proposed a Structural SIMilarity index (SSIM)-based cross-layer optimized video streaming over LTE downlink wireless network. The MCS mode at the physical layer is selected to improve the perceptual video quality by jointly taking the characteristics of the video slice into account. In [2], Argyriou et al. investigated the performance of video streaming in heterogeneous cellular networks when the time-domain resource partitioning mechanism is employed. The perceived video quality for the subscribers is maximized by jointly optimizing the selected video quality transmitted to a user, the rate allocated to each specific user at the application layer, and the time-domain resource partitioning at the physical layer. In IEEE 802.11ac wireless local area networks, Chang et al. [6] proposed a cross-layer designed quality adaptive strategy to maximize the perceived H.264/AVC video streaming quality. A multi-polling controlled access (MPCA) scheme at the MAC layer and the video frame types at the application layer are jointly considered to guarantee the latency for the critical video frames and reduce transmission overhead. However, the above literatures [2, 6, 33] attempt to improve the video streaming performance by cross-layer method in one wireless network without taking the advantage of aggregated bandwidth from multipath video streaming.

At the application layer, HTTP-based adaptive video streaming (standardized as DASH [24]) is being widely adopted as a form of Internet video delivery. In [24], the standards and design principles of DASH specifications are presented and the implementation examples are also provided. In DASH, the adaptive bitrate (ABR) algorithm in the client is critical to ensure a desirable QoE and various ABR algorithms have been proposed. Previous ABR algorithms can be typically grouped into three classes: rate-based, buffer-based and reinforcement learning based methods. Rate-based algorithms [18, 25] usually request video segments at the highest bitrate that networks are predicted to support. However, these sort of methods first estimate the available bitrate by observing the past segment downloads which are often hindered by the biased throughput prediction on top of HTTP. In contrast, buffer-based methods merely keep track of the client’s playback buffer occupancy while selecting the bitrates for later video segments. These methods strive to keep the buffer occupancy above a pre-configured threshold which balances video quality and rebuffering events. The most advanced buffer-based methods, both Buffer-Based Approach (BBA) [15] and Bitrate Adaptation for Online Video (BOLA) [23], are optimizing for a specified video quality metric only based on the observed buffer occupancy. Yin et al. [31] proposed a Model Predictive Control (MPC) algorithm which combines the rate-based and buffer-based techniques to select proper bitrates that expected to maximize the QoE over several future video segments. Nevertheless, MPC still suffers from inaccurate throughput estimation which is critical for its performance. The most recent reinforcement learning based approach, Pensieve [20], trained a neural network model to learn a precise ABR algorithm, and select bitrates automatically for a horizon of serval future segments. The Pensieve in the client learns the control policy for video bitrate adaptation purely through experience, without utilizing any specific assumptions or pre-configured models about the environment. To summarize, the above papers utilize adaptive bitrate algorithms to make video quality decision based on the predicted bandwidth or the buffer state of one wireless link, which also can further be optimized by jointly considering different parameters at different protocol layers or using multipath video streaming for bandwidth aggregation.

Leveraging both LTE and Wi-Fi links simultaneously can enhance the performance of video streaming services and therefore numerous DASH-based multipath video streaming schemes have been studied. In [19], the authors proposed a video segment request policy called REQUEST for DASH-based video streaming in a smartphone utilizing both Wi-Fi and LTE interfaces. REQUEST enables better video quality, fewer rebuffering events than other existing schemes under given budgets of LTE data usage and battery energy. In a multi-user scenario, Ho et al. [14] presented a game-theoretic scalable offloading framework that enabled seamless video streaming over LTE and Wi-Fi networks concurrently. In this framework, fountain encoding together with the progressive second price auction mechanism are employed to improve the video streaming performance among multiple smartphones. At the transport layer, the Multipath TCP (MPTCP) and the Multipath QUIC protocol [28] are designed to offer significant benefits to DASH-based multihomed video streaming. However, the congestion control algorithms in the above original multipath transport protocol are not suitable for multipath video streaming. James et al. [16] discussed that whether MPTCP is always beneficial for video streaming over DASH. They found that without sufficient bandwidth on the secondary path, the video streaming over MPTCP would suffer from degraded performance. Further, Han et al. [13] proposed a multipath framework called MP-DASH for video streaming over multiple network interfaces. MP-DASH strategically schedules video segments to satisfy user preferences. In order to provide a general framework, Chen et al. [7] proposed a DASH-based video streaming solution in the client-side, called MSPlayer, that exploiting multiple CDN nodes and network interfaces. MSPlayer provided the aggregating bandwidth for high-definition video streaming and reduced start-up latency. However, MSPlayer does not assume multipath video streaming over MPTCP in which multiple transport links considered as one logical link to the application layer. In addition, MSPlayer doesn’t provide strategy to select the wireless link. To address this, Elgabli et al. [11] proposed a preference-aware mulipath video streaming algorithm over HTTP using MPTCP. Howover, these MPTCP-based mulitpath video streaming strategies cannot be deployed without modifying the original congestion control algorithms. Therefore, MPQUIC protocol that using the UDP protocol in the transport layer is more suitable multipath video streaming. As a baseline of our scheme, Viernickel et al. [28] proposed Multipath-enabled QUIC (MPQUIC) solution to leverage multiple network interfaces to provide bandwidth aggregation. In this paper, We further improve the performance of multipath video streaming by adjust the MCS mode at the physical layer in a cross-layer method.

In summary, most existing video streaming solutions either purely rely on one network interface, or leverage multiple network interfaces without cross-layer optimization. Moreover, some researchers mainly make effort to find an optimal ABR algorithm by tuning the policy agent in the client to cater to the new environment. Motivated by the above analyses, we attempt to take advantage of the aggregated bandwidth from LTE and 802.11ac network interfaces, and exploit the cross-layer scheme to further improve the performance of DASH-based multipath video streaming. In the next section, we will describe the proposed DASH-based cross-layer optimization framework and the formulation of the optimizing problem over LTE and 802.11ac networks.

3 DASH-based multipath cross-layer optimization

3.1 DASH-based cross-layer optimization framework

Figure 1 shows the proposed DASH-based cross-layer multipath video streaming optimization framework. In this framework, the multi-interfaced (LTE and 802.11ac) client sequentially requests video segments stored in different CDN nodes via DASH technique over LTE and 802.11ac wireless network interfaces simultaneously. In the CDN side, the video sequence is partitioned into multiple independent segments, and each segment is with multiple replicas encoded with various bitrate values [7]. To fully take advantage of the aggregating bandwidth, each segment is logically divided into multiple subsegments, which can be requested through multiple wireless interfaces via HTTP’s range retrieval requests [12, 29]. In such a scenario, two crucial issues should be considered in the client to ensure a good video streaming performance: how to select the bitrate for the new requested segment and how to slice each segment into two subsegments that delivered through LTE and 802.llac networks respectively.

To achieve this, the segment bitrate and the rate allocation at the application layer, the MCS mode at the downlink physical are jointly adjusted by the cross-layer optimization controller embedded in the client-side. When requesting a new segment, the link adaptation including the adjustment of MCS mode should be performed to adapt to the time-varying wireless channel states. Accordingly, the segment bitrate adjustments comprised of bitrate selection and rate allocation among separate links are dynamically tuned to match the integrated channel goodput that the selected MCS can support. In addition, the buffer occupancy in the client is also considered by the controller to avoid the rebuffering events.

The wireless channel is usually accompanied by time-varying characteristics and frequency-selective fading. To accommodate this, the Adaptive Modulation and Coding (AMC) is utilized to select the most suitable MCS mode based on the estimated channel state and Bit Error Rate (BER) /Block Error Rate (BLER). In practice, the MCS mode for a specific User Equipment (UE) is determined by the eNodeB/AP with the help of periodical feedback of Channel Quality Indicator (CQI) from the UE, which is represented by the Signal-to-Interference-plus-Noise-Ratio (SINR). For example, the MCS is selected to maintain the BLER of each resource block smaller than 10 percent for the LTE downlink channel adaptation [2, 33]. However, in our paper, the MCS mode is selected by considering both the SINR and the effect of its achieving goodput on the perceived video quality. In other words, the new segment bitrate value at the application layer should be selected up to the integrated bandwidth that LTE and 802.11ac downlink networks can support. Further, the rate allocation that determines the subsegment size transferred by the corresponding access networks is tuned to the selected MCS mode.

3.2 Video quality model for DASH

Two overarching goals have to be balanced in DASH-based video streaming applicatons. On one hand, they attempt to maximize the video quality of each video segment by selecting the highest video rate that networks can support, and maintaining a smooth video playback. On the other hand, they try to avoid rebuffering events that result in halt of video playback when the client’s received buffer goes empty [6, 15, 20, 23, 31]. In this paper, the video is modelled as a sequence of consecutive video segments, \(\mathcal {V}=\{1,2,\cdots ,K\}\), each of which contains T seconds of video and encoded with different bitrates. The player can choose to request a new segment with bitrate \(r_{i} \in \mathcal {R} ,i \in \mathcal {V}\), where \(\mathcal {R} \) is the set of all available bitrate values. These information characterizing various representations of the media components (bitrates, resolutions, codecs, etc.) is contained in the media presentation description (MPD) file, which will be requested by the client during the initialization phase [24].

By neglecting the impact of rebuffering events and the quality variations between two consecutive segments, we denote \(q(\cdot ): \mathcal {R} \to \mathbb {R}_{+} \) by the function which maps the selected video rate ri of segment i to the perceptual video quality. According to [28], the perceptual video quality is increased with video bitrates. The slope is quite steep in the low bitrate region, but it gradually slows down at high bitrate values. The logarithmic function matches this characteristic well and is utilized to represent the video quality q(⋅) in this paper. Therefore, the perceptual video quality is expressed as

$$ q(r_{i}) = \log~(1+\alpha \cdot r_{i}),~~ r_{i} \in \mathcal{R}, i \in \mathcal{V} $$
(1)

where α is a fitting parameter for a specific video codec and video sequence. It can be estimated from three or more trial encodings using nonlinear regression techniques.

To avoid the rebuffering events that strongly impair the user’s experience, the current requesting segment has to arrive at the client before the playback buffer goes empty. Let tb be the buffer occupancy at time t that starts to request segment i, i.e., the play time of the downloaded yet unviewed segment remained in buffer. The value of tb can be obtained via periodical feedback by the client to the optimization controller. We also denote by Cs the average total goodput provided by all the access networks from moment ti to ti + T. Note that if Tri/Cstb, the buffer goes empty while the client is still downloading segment i, resulting in rebuffering events [15, 20, 29]. We define a tradeoff function to balance the impairment of rebuffering and the video playback quality. A tradeoff coefficient λ is introduced to weight the impairment of the rebuffering events. This modified perceived video quality function can be represented as

$$ Q(r_{i})=q(r_{i})-\lambda \cdot \text{I}~({T\cdot r_{i} / C_{s} }- t_{b}) $$
(2)

where I (⋅) is the step function that I (⋅) = 1, if Tri/Cstb, otherwise, I (⋅) = 0.

Since each segment is logically divided into two subsegments, each of which will be requested over the LTE and 802.11ac downlink simultaneously via the HTTP’s range retrieval requests technique [12, 29]. The rebuffering event occurs if one of the subsegments cannot arrive at the client before the playback buffer runs out. Let ri,1 and ri,2 be the bitrates allocated to LTE and 802.11ac wireless networks respectively. Their sum equals to the selected bitrate ri of segment i. That is ri = ri,1 + ri,2. The average downlink goodput provided by the LTE and 802.11ac wireless networks while downloading the subsegments are denoted by Ci,1 and Ci,2 respectively. In this case, the rebuffering event emerges if \(\max \limits ({T \cdot r_{i,1} / C_{i,1}},{T \cdot r_{i,2} / C_{i,2}}) \ge t_{b}\). Thus, the ultimate quality function for segment i can be defined as

$$ Q(r_{i},r_{i,1},r_{i,2}) = q(r_{i}) - \lambda \cdot \text{I}~(\max({T \cdot r_{i,1} \over C_{i,1}},{T \cdot r_{i,2} \over C_{i,2}})- t_{b}) $$
(3)

3.3 The goodput estimation of LTE downlink

In the LTE downlink, the achieved goodput depends on the wireless channel condition, the selected MCS mode and the resource allocation algorithm. To estimate the effective average goodput Ci,1 while downloading the corresponding subsegment through LTE downlink, the mutual information effective SNR mapping (MIESM) is utilized to measure the LTE downlink channel quality in this paper. For the selected MCS mode \(m_{1} \in {\mathscr{M}}_{1}\), where \({\mathscr{M}}_{1}\) is the candidate MCS mode set in the first column of Table 1, the effective SNR mapping γmieff(m1) based on the mutual information can be calculated as [17]

$$ \gamma_{mieff}(m_{1})=\tau(m_{1})\left[J^{-1}\left( {1\over S_{n}} \sum\limits_{k=1}^{S_{n}}J\left( \sqrt{\gamma_{k} \over \tau(m_{1})}\right)\right)\right]^{2} $$
(4)

where Sn is the number of allocated subcarriers for subsegment i, τ(m1) is the calibration factor for MCS mode m1 listed in Table 1, and γk is the SINR at the kth subcarrier. The definition of functions J(⋅) and J− 1(⋅) are defined as (5) and (6). For more details, please refer to the references [10, 17, 33].

$$ J(x) \approx \left\{ \begin{array}{l} - 0.04210610{x^{3}} + 0.209252{x^{2}} - 0.00640081x, ~~~{\text{ 0 < x < 1}}{\text{.6363}}\\ 1 - \exp (0.00181491{x^{3}} - 0.142675{x^{2}} - 0.08220540x + 0.0548608),~~~{\text{ x}} \ge {\text{1}}{\text{.6363}} \end{array} \right. $$
(5)
$$ {J^{- 1}}(y) \approx \left\{ \begin{array}{l} 1.09542{y^{2}} + 0.214217y + 2.33727\sqrt y ,~~~{\text{ 0 < y < 0}}{\text{.3646}}\\ {\text{ - 0}}{\text{.706692log(- 0}}{\text{.386013(y - 1)) + 1}}{\text{.75017y, ~~~ 0}}{\text{.3646}} \le {\text{y}} \le {\text{1}} \end{array} \right. $$
(6)
Table 1 The candidate LTE downlink MCS modes

Based on the MIESM γmieff(m1) defined in (4), the Block Error Rate (BLER) BLER(γmieff(m1)) for the RB with MCS mode m1 can be precisely predicted as

$$ BLER(\gamma_{mieff}(m_{1}))={{1\over2} {erfc({\gamma_{mieff}(m_{1})-b(m_{1})\over{\sqrt{2}c(m_{1})}})}} $$
(7)

where erfc(⋅) is the complementary error function, b(m1) and c(m1) listed in Table I are the “transition center” and “transition width” respectively, each of which can be obtained by fitting J− 1(⋅) to the exact BLER in a specific communication system. In this paper, a MIMO 2X1 AWGN LTE downlink channel is simulated using a generic LTE system-level simulator in [26].

Due to the truncated ARQ mechanism implemented in the data link layer, resource blocks that are received in error during the original transmission might be retransmitted, up to a maximum of Nr times. For notational simplicity, let us define \( \epsilon (m_{1}) \overset {\text {def}}{=} BLER(\gamma _{mieff}(m_{1})) \), and the average number of transmissions per resource block can be derived as

$$ \begin{array}{@{}rcl@{}} \overline{N}(\epsilon(m_{1}),N_{r})&=&\sum\limits_{i=1}^{N_{r}} i\cdot (1-\epsilon(m_{1}))\cdot \epsilon(m_{1})^{i-1} \\ &=&1+\epsilon(m_{1})+\epsilon(m_{1})^{2}+\cdots+\epsilon(m_{1})^{N_{r}}\\ &=&{{1-\epsilon(m_{1})^{N_{r}}}\over{ 1-\epsilon(m_{1})}} \end{array} $$
(8)

To evaluate the achieved channel goodput, the number of information bits carried by each transmitted symbol is calculated as \(r(m_{1}) =R_{c} \cdot \log _{2}(M_{m_{1}})\) and listed in Table 1, where Rc is the FEC code rate and \(M_{m_{1}}\) refers to a \(M_{m_{1}}\)-QAM constellation for MCS mode m1.

It has been known that the available spectrum resource is divided into some individual resource blocks based on the frequency and time domains in LTE downlink physical layer. Each RB occupies the duration of one slot (0.5ms) and contains 7 OFDM symbols with normal cyclic prefix in the time domain and 12 subcarriers (180KHz) in the frequency domain. However, three downlink control channels are defined in the LTE downlink in order to support the data transmission, which are Physical Control Format Indicator Channel (PCFICH), Physical HARQ Indicator Channel (PHICH), Physical Downlink Control Channel (PDCCH). In the normal configuration, these channels occupy the the first three OFDM symbols in each sub-frames (1ms) in the time domain and the whole bandwidth in the frequency domain, described by the grey square blocks in Fig. 2. We can see in Fig. 2 that there are eight resource elements reserved for reference signals in each resource block [22]. Therefore, the available data bits carried by two adjacent RBs in one sub-frame, as a function of MCS mode m1, can be expressed as ξ(m1,Nr) = Nrbr(m1), where Nrb = 120 denotes the number of resource element allocated for the data transmission in two adjacent RBs.

Fig. 2
figure 2

The frequency and time domains in the LTE downlink

In each Transmission Time Interval (TTI), the Proportional Fair Scheduling (PFS) algorithm [4] is used for resource block scheduling among multiple users in one single cell. Suppose the total RB numbers allocated for the delivery of subsegment in LTE downlink equals to Bn and all the RBs adopt the same MCS mode. When the truncated ARQ is adopted, each resource block is averagely transmitted \(\overline {N}(\epsilon (m_{1}),N_{r})\) times. Therefore, the achieved goodput can be computed as

$$ C_{i,1}(m_{1})= {\xi(m_{1},N_{r}) \cdot B_{n} \over \overline{N}(\epsilon(m_{1}),N_{r})} $$
(9)

3.4 The goodput estimation of 802.11ac network

In the 802.11ac downlink physical layer, OFDM is selected as the modulation scheme and ten MCS modes with different modulation schemes and coding rates are provided for link adaptation. Specifically, BPSK, QPSK, 16-QAM, 64-QAM and 256-QAM are the supported modulation schemes listed in Table 2. In the MAC layer, to share the wireless channel between multiple compatible stations, the contention-based Distributed Coordination Function (DCF) that uses the algorithm of Carrier-Sense Multiple Access with Collision Avoidance (CSMA/CA) is implemented as a mandatory medium access control (MAC) mechanism. In CSMA/CA, each successful frame transmission duration of DCF consists of a backoff delay \(\bar {T}_{b}\), the data transmission time Tdata(l,m2), a Short InterFrame Space (SIFS) time TSIFS = 16μs, the ACK transmission time Tack(m2) and a Distributed InterFrame Space (DIFS) time TDIFS = 34μs [9]. Suppose that a frame with l bits data payload is to be transmitted using MCS mode \(m_{2} \in {\mathscr{M}}_{2}\), where \({\mathscr{M}}_{2}\) is the candidate MCS mode set in the first column of Table 2. According to [30], the data transmission duration can be calculated as

$$ T_{data}(l,m_{2})= 20\mu s + {\lceil {30.75+l \over r(m_{2})} \rceil}\cdot 4\mu s $$
(10)

where r(m2) can be computed by the code rate given in Table 2 and is the bits-per-symbol information for MCS mode m2. For simplicity, the same MCS mode is supposed to used for the ACK frame transmission. The duration for an ACK frame can be expressed as follows [30],

$$ T_{ack}(m_{2})= 20\mu s + {\lceil {16.75 \over r(m_{2})} \rceil}\cdot 4\mu s $$
(11)
Table 2 Data rates (Mbps)-ten MCS modes, 1 spatial stream, normal guard interval

In the backoff period, a random integer is assigned to the station according to a uniform distribution over the interval [0, CW], where CW is the content window size and its initial value is CWmin. Based on the formulation in [21], the average backoff time is given by

$$ \bar{T}_{b} = {CW_{min}\cdot T_{\text{slot}} \over 2} $$
(12)

where Tslot is the slot time in 802.11ac and is equal to 9μ s.

A frame transmission is considered successful only upon receiving the corresponding ACK frame correctly. Therefore, the probability of a successful frame transmission with wireless channel state γ2 and MCS mode m2 can be calculated by

$$ P_{s}(l,\gamma_{2},m_{2}) =[1-P_{data}(l,\gamma_{2},m_{2})][1-P_{ack}(\gamma_{2},m_{2})] $$
(13)

where Pdata(l,γ2,m2) and Pack(γ2,m2) are the data error probability and the ack error probability, respectively, and their values are varied under different wireless channel model and estimated over the AWGN channel in this paper. Since the data frame is normally much longer than the ACK frame, the probability for the ACK frame to be lost is much smaller than the data frame. Thus, we have the following approximation

$$ P_{s}(l,\gamma_{2},m_{2}) \approx 1-P_{data}(l,\gamma_{2},m_{2}) $$
(14)

An upper bound is given on the packet error probability, under the assumption that hard-decision Viterbi decoding with independent errors and binary convolutional coding are used at the channel input. The data packet error probability with l octets using MCS mode m2 is bounded by

$$ P_{data}(l,\gamma_{2},m_{2}) \le 1-(1-P_{u}(\gamma_{2},m_{2}))^{8l} $$
(15)

where the union bound Pu(m2) is the first-event error probability given by

$$ P_{u}(\gamma_{2},m_{2}) = \sum \limits_{d=d_{free}(m_{2})}^{\infty} a_{d} \cdot P_{d}(\gamma_{2}) $$
(16)

where dfree(m2) is the free distance for the convolutional code in MCS mode m2, ad is total number of error of weight d, and Pd(γ2) is the probability of an incorrect path at distance d from the correct path being chosen by the Viterbi decoder and is given as follows,

$$ {P_{d}}({\gamma_{2}}) = \left\{ \begin{array}{l} \sum\limits_{i = (d + 1)/2}^{d} {\left( \begin{array}{l} d\\ i \end{array} \right) \cdot } {\rho^{i}} \cdot {(1 - \rho )^{d - i}},~\text{if \textit{d} is odd},\\ {\textstyle{1 \over 2}} \cdot \left( \begin{array}{l} d\\ d/2 \end{array} \right) \cdot {\rho^{d/2}} \cdot {(1 - \rho )^{d/2}} + \\ \sum\limits_{i = d/2 + 1}^{d} {\left( \begin{array}{l} d\\ i \end{array} \right) \cdot } {\rho^{i}} \cdot {(1 - \rho )^{d - i}}, ~~\text{if \textit{d} is even} \end{array} \right. $$
(17)

Note that ρ is the bit-error-rate as a function of the symbol SNR γ2 for the MCS mode m2 and can be approximated by (18) [32].

$$ \rho {\text{ = }}\frac{{\sqrt {{{m}_{2}}} - 1}}{{\sqrt {{{m}_{2}}} {{\log }_{2}}\sqrt {{{m}_{2}}} }}erfc\left( \sqrt {\frac{{{\text{3lo}}{{\text{g}}_{2}}\left( {{{m}_{2}} \cdot \gamma_{2} } \right)}}{{2({{m}_{2}} - 1)}}} \right) $$
(18)

Based on the above analysis, the effective goodput Ci,2(m2) of IEEE 802.11ac network can be calculated by

$$ \small{ C_{i,2}(m_{2})={ P_{s}(l,\gamma_{2},m_{2})\cdot l \over \bar{T}_{b} + T_{data}(l,m_{2})+T_{ack}(m_{2})+T_{\text{SIFS}}+T_{\text{DIFS}} }} $$
(19)

3.5 Optimization formulation and solution

With the feedback of the effective goodput estimation and the buffer occupancy, the controller attempts to maximize the perceptual video quality for each segment without causing the rebuffering events. The decision variables include the requested segment bitrate ri, the rate allocation ri,1 and ri,2 from the application layer and the MCS mode m1 and m2 at the physical layer of different networks. In addition, to offer a smooth playback, the quality variation between two consecutive segments should be smaller than a threshold μ preferred by the user. Therefore, the cross-layer optimization problem can be formulated as

$$ \begin{array}{@{}rcl@{}} &&{}\max_{\{r_{i},r_{i,1},r_{i,2},m_{1},m_{2}\}} \left\{q(r_{i})-\lambda \cdot \text{I}~\left( \max\left( {r_{i,1} \over C_{i,1}(m_{1})},{r_{i,2} \over C_{i,2}(m_{2})}\right)-t_{b}\right)\right\} \\ s.t. r_{i} &=& r_{i,1}+r_{i,2}\\ |q(r_{i})&-&q(r_{i-1})| \le \mu \\ && r_{i} \in \mathcal{R} \\ && m_{1} \in \mathcal{M}_{1} \\ && m_{2} \in \mathcal{M}_{2} \end{array} $$
(20)

where the effective goodput Ci,1(m1) and Ci,2(m2), as a function of MCS mode and SNR, can be calculated by (9) and (19), respectively. For more information about the parameters in this paper, please refer to Table 3.

Table 3 Summary of terminology

It can be seen that (20) contains both discrete and continuous variables. For instance, m1 and m2 are discrete while ri,1, ri,2 are continuous. Furthermore, the objective function in (20) is nonlinear and non-differentiable. Therefore, the cross-layer optimizing problem of (20) is a typical nonlinear optimization problem with mixed constraints. These kinds of problems are NP-hard without polynomial time solution. To solve the cross-layer optimizing problem formulated in (20), we construct a heuristic algorithm to find the near-optimal decision variables \((r_{i}^{*},r_{i,1}^{*}, r_{i,2}^{*}, m_{1}^{*},m_{2}^{*})\) to maximize the perceived video quality of segment i. That is \(Q(r_{i}^{*},r_{i,1}^{*}, r_{i,2}^{*}, m_{1}^{*},m_{2}^{*}) \ge Q(r_{i},r_{i,1}, r_{i,2}, m_{1},m_{2}), \forall ~r_{i},r_{i,1}, r_{i,2}, m_{1},m_{2}\) subject to the constraints defined in (20). In the algorithm, we first pick up a candidate bitrate set \(\mathcal {R}_{candidate}\) that satisfies |q(ri) − q(ri− 1)|≤ μ and then sort the elements in the candidate set by descent order. In other words, the quality variation caused by two consecutive segments are tolerable if the bitrate of segment i is one of the elements in \(\mathcal {R}_{candidate}\).

Since the video quality function q(⋅) is increasing with the bitrate r, we aggressively request segment i with the highest bitrate values in the candidate set. That is \(r^{*}_{i} = \arg \max \limits _{r_{i}} r_{i}, r_{i} \in \mathcal {R}_{candidate}\). After selecting the optimal bitrate \(r_{i}^{*}\) at the application layer, we will determine the MCS mode at the physical layer for the corresponding network. The MCS mode with small constellation and powerful channel code can maintain reliability at poor channel condition. Therefore, we select the MCS mode with smallest constellation, and channel code and estimate the achievable goodput \(C_{i,1}(m_{1}^{*})\) and \(C_{i,2}(m_{2}^{*})\) based on the selected MCS mode and symbol SNR.

The rate allocation that determines the size of the subsegment is based on the goodput of each network. That is \(r^{*}_{i,1} = {C_{i,1}(m^{*}_{1})\over {C_{i,1}(m^{*}_{1})+C_{i,2}(m^{*}_{2})}}\cdot r^{*}_{i}\), \(r^{*}_{i,2} = {C_{i,2}(m^{*}_{2})\over {C_{i,1}(m^{*}_{1})+C_{i,2}(m^{*}_{2})}}\cdot r^{*}_{i}\). Then the optimal decision variables \((r_{i}^{*},r_{i,1}^{*}, r_{i,2}^{*}, m_{1}^{*},m_{2}^{*})\) are obtained so far. However, such a decision variable set might lead to rebuffering event. We assume a relatively large λ indicates that the user is more concerned about rebuffering is used in this algorithm. So every subsegment has to arrive at the client before the received buffer run out. Note that if \( \max \limits {({{r^{*}_{i,1}\over C_{i,1}(m^{*}_{1})},{r^{*}_{i,2}\over C_{i,2}(m^{*}_{2})}})} \le t_{b }\), the rebuffering event will not occur and the decision variables are verified. Otherwise, the effective goodput given the selected MCS mode cannot satisfy the quality level of video segment i with bitrate of \(r_{i}^{*}\) and the MCS mode with larger constellation size and more powerful channel code is selected as the optimal MCS mode. If there is no MCS mode can satisfy such a bitrate level of segment i, the algorithm will select segment i with a smaller bitrate level. The details of the proposed heuristic algorithm for cross-layer multi-path streaming is shown in Algorithm 1. To evaluate our proposed heuristic algorithm, we construct a off-line mapping table between the goodput C of network and the candidate MCS mode m. Based on this, our heuristic algorithm is with polynomial time complexity. Specifically, in the first phase of the algorithm, we attempt to determine the candidate bitrate set Rcandidate within the available video bitrate. The time complexity is \(\mathcal {O}(R)\) and linear. In the second phase of the algorithm, we try to find the appropriate video bitrate allocated to different network and the MCS mode. By using the off-line mapping table aforementioned, we can obtain the goodput C in constant time. Therefore, the total complexity of our heuristic algorithm is \(\mathcal {O}(R\cdot M_{1} \cdot M_{2})\) within polynomial time.

figure a

4 Evaluation

4.1 Experimental setup

At the application layer, the video sequences are encoded via the H.264/AVC reference software JM18.6 [27] by setting different quantization parameters (QP). In our setup, each video segment is encoded at bitrate values in {350, 700, 1200, 1800, 2800, 4500} kbps, corresponding to various resolutions in { 240p, 360p, 480p, 720p, 1080p, 1440p}. We can also see from (2) that the video quality is related to two factors: the video segment bitrates and the rebuffering events. Hence, besides the QP at the application layer that determines the video bitrates, we also slice each whole video sequence into 50 segments and had a total duration of 200 seconds, which each segment stands for approximately 4 seconds of playback. In the simulation, we assume that the video player at the client was configured to hold a buffer capacity with enough playback duration.

Additionally, the LTE and 802.11ac downlink wireless channel are simulated through MATLAB Software based on [26] and [9], respectively. Then we exploit these generated traces to evaluate the performance of the proposed algorithm. The main experimental parameters for both video coding and the wireless network environment are shown in Table 4. To evaluate the proposed scheme, we compared it with state-of-the-art schemes including the Multipath QUIC procotol (MPQUIC) scheme [28] and the SSIM-based Cross-layer optimization with Error-resilient RDO (SSIM-CL-w-ERDO) scheme [33]. In the SSIM-CL-w-ERDO scheme, we split each segment into two subsegments with equal size and each of them is optimized by the SSIM-CL-w-ERDO scheme in LTE and 802.11ac downlink, respectively. These three schemes are evaluated under different wireless channel conditions (Rayleigh distribution with average SINR \(\overline \gamma \) at 4dB, 9dB, 14dB) [5].

Table 4 Experimental parameters

4.2 Experimental results

In the proposed DASH-based cross-layer multi-path video streaming scheme, the video quality experienced by the end user is optimized by adaptively selecting the video bitrates, the MCS mode for each segment according to the wireless channel state of both LTE and 802.11ac downlink. Firstly, we investigated whether our proposed scheme can obtain the anticipated results. The adaptive selection of the MCS modes and the bitrates for total 50 segments of the video sequence ElephantsDream at different channel conditions (average SINR \(\bar \gamma =4dB, \bar \gamma =9dB, \bar \gamma =14dB\) for both LTE and 802.11ac downlink channel) are shown in Fig. 3.

Fig. 3
figure 3

The selected MCS modes and bitrates for total 50 segments of the video sequence ElephantsDream at different channel conditions of LTE and 802.11ac

MCS mode Adaptation

We can see from Fig. 3 that, at a good channel condition of \(\bar \gamma =14dB\), the MCS modes with large size constellations (large MCS mode indexes) in both LTE and 802.11ac downlink and large video segment bitrates are selected to improve the video perceptual quality. In the other hand, at a poor channel condition of \(\bar \gamma =4dB\), the MCS modes with small size constellations (small MCS mode indexes) in both two networks and small bitrate are selected to avoid the rebuffering events and guarantee the smoothness of video playback. Therefore, it illustrates that our proposed scheme can effectively adjust the MCS mode at the physical layer in both LTE and 802.11ac and bitrates of the video segment to improve the video streaming performance in a cross-layer manner.

Segment-level analysis

To evaluate the streaming performance of the proposed scheme, the segment-level average PSNR values for 50 segments of the three video sequences Parkrun, Shield and ElephantsDream transmitted at the channel condition of \(\bar \gamma =9dB\) are shown in Fig. 4. From Fig. 4, it can be observed that our proposed scheme can achieve higher average PSNR values than other two baseline schemes (MPQUIC and SSIM-CL-w-ERDO) for most of the video segments. However, Fig. 4 additionally illustrates that our scheme does not overwhelm the other two baseline schemes on every segment level. This is because the balance between the bitrate utility and rebuffering penalty in our scheme. On the whole, the average PSNR of the total 50 segments of the Shield video sequences obtained by our proposed scheme is 1.16dB and 2.52dB higher than the two baseline schemes, respectively. For the video sequence of Parkrun, our proposed scheme outperformed the baseline schemes by 0.80dB and 1.91dB, respectively. For the video sequence of ElephantsDream, the average PSNR improvement is 1.28dB and 2.80dB, respectively.

Fig. 4
figure 4

Average PSNR curves with increasing segment numbers for the sequence a Shield, b Parkrun and c ElephantsDream at the condition of \(\bar \gamma = 9dB\).

Video sequence level analysis

Figure 5 shows the average PSNR curves of the video sequences Shield, Parkrun and ElephantsDream at the wireless channel conditions of \(\bar \gamma =2dB, \bar \gamma =4dB, \bar \gamma =9dB, \bar \gamma =14dB, \bar \gamma =20dB\), respectively. It can be seen that our proposed scheme can achieve higher average PSNR values than other two baseline schemes in all wireless channel conditions. On average, for the video sequence of Shield, the average PSNR value achieved by our proposed scheme is approximately 1.55dB and 2.94dB than other baseline schemes. For the video sequence of Parkrun, the improvement is about 1.03dB and 2.78dB, respectively. For the video sequence of ElephantsDream, our scheme can overwhelm the two baseline schemes by 1.10dB and 2.31dB, respectively. Additionally, the performances of the average PSNR curves versus SINR show some differences under different channel conditions. It can also be observed from Fig. 5 that when SINR is with small value, in other words, when the channel condition is poor, our proposed scheme achieves a higher improvement of PSNR than that of average SINR with high value (the wireless channel quality is good). For instance, while streaming the sequence of ElephantsDream at the condition of \(\bar \gamma =4dB\), the average PSNR achieved by our scheme is approximately 1.32dB and 3.21dB higher than the MPQUIC and SSIM-CL-w-ERDO, respectively. However, when the wireless channel quality is good, for example, at the condition of \(\bar \gamma = 14dB\), the improvement is just 0.80dB and 1.30dB corresponding to the two baseline schemes, respectively. These could be due to the adaptive selected MCS mode at the physical layers in both LTE and 802.11ac that meeting the bitrate and rebuffering requirements for the streaming of each video segment.

Fig. 5
figure 5

Average PSNR curves of our scheme and two baseline schemes for the sequence a Shield, b Parkrun and c ElephantsDream at different channel conditions.

Delay performance

To evaluate the delay performance of the proposed scheme, the comparison of download time for each video segment at different channel condition between our scheme and two baseline scheme (MPQUIC and SSIM-CL-w-REDO) is showed in Fig. 6. We notice from Fig. 6 that when the channel condition become better, the video segment can be downloaded in less time for all the three schemes. Furthermore, though our scheme can achieve shorter download time performance than the SSIM-CL-w-REDO approach in different channel condition, we note that our scheme suffer from a little higher download time than the MPQUIC approach. MPQUIC can achieve better delay performance because it utilizes UDP protocol at the transport layer. Instead of UDP protocol used by MPQUIC, the TCP protocol used by our scheme or other MPTCP-based approach will introduce more delay by the acknowledgement mechanism.

Fig. 6
figure 6

Download time of each segment at different channel condition

QoE

To better comprehend the QoE gains obtained by our scheme, we evaluate the performance on the individual terms in the QoE model that defined in (3). Explicitly, Fig. 7 shows the comparison between our scheme and two baseline schemes (MPQUIC and SSIM-CL-w-ERDO) in different channel conditions in terms of the playback bitrate utility from the first term in (3), and the penalty of rebuffering from the second term in (3). More precisely, the given QoE value can be calculated by subtracting the rebuffering penalty from the bitrate utility.

The performance gains of our scheme contribute to the aggregated bandwidth and adaptive selection of MCS mode to support higher video bitrate, and the ability to try to avoid rebuffering event from network’s bandwidth fluctuations. As shown from Fig. 7, all the schemes obtain better bitrate utility values in the first term of (3) as the improvement of network channel condition. But we also see that the achieved bitrate utility’s gap among these three schemes will decrease as the network channel state goes better. These also can be validated by Fig. 5. With respect to rebuffering penalty, as the network state goes better, the QoE gap among these three scheme is increasing. This indicates that the two baseline schemes aggressively request video segment with high bitrates exceeding the network bandwidth. These might lead to more rebuffering events. In other words, our scheme achieve a better balance between bitrate utility and rebuffering penalty than other two baseline schemes.

Fig. 7
figure 7

Comparing our scheme with the baseline algorithms on the QoE metrics in terms of bitrate utility and rebuffering penalty

Finally, we evaluate the performance of the general QoE metric defined as (3). A normalized QoE metric is defined using the min-max normalization method that mapping the original QoE value to the new data between 0 and 1. Figure 8 shows the Cumulative Distribution Function (CDF) of the normalized QoE value across three different channel conditions. There are two key points from these results. First, it can be seen that the percentage of higher normalized QoE values achieved by our scheme is higher than the baseline schemes in all three channel conditions. Second, our scheme outperforms two baseline schemes (MPQUIC and SSIM-CL-w-ERDO) with an improvement in average normalized QoE of 6.5%, 8.3%, 10.2% in the channel condition of average SINR \(\bar \gamma =4dB\), \(\bar \gamma =9dB\), and \(\bar \gamma =14dB\), respectively.

Fig. 8
figure 8

The Cumulative Distribution Function (CDF) of the normalized QoE value across three different channel conditions

5 Conclusion and limitation

In this paper, a cross-layer DASH-based multipath video streaming scheme is proposed to improve the performance of video streaming. Two wireless access networks, LTE and 802.11ac downlink, are utilized to achieve the bandwidth aggregation. Meanwhile, the cross-layer method is combined with the multipath video streaming by optimizing the MCS modes at the physical layer in each network, the video bitrate, the playback buffering and the bitrate allocation for each segment at the application layer. Experimental results show that our proposed scheme outperformed other state-of-the-art schemes in term of PSNR, playback smoothness and normalized QoE.

In contrast to MPQUIC that runs on top of UDP, the video segment download time of our scheme is a little longer. This mainly is attributed to the TCP protocol used by our scheme or other MPTCP-based approach which will introduce more delay by the acknowledgement mechanism. In the further work, we will focus on the scheduling algorithm for multipath video streaming over MPQUIC in order to further improve the video streaming performance.