1 Introduction

In computer and communication systems, when a task is subject to failures or unpredictable delays, aborting the previous try and launching a new process can speed up the task completion. This mechanism is generally called restart and is widely used for fault tolerance. Restart is a simple yet efficient solution which can be applied even if the probability distribution of a task completion time exhibits high variance. Examples of such tasks include randomized search algorithms, distributed data queries and data transmission through unreliable network connections. As modern computer systems become more interconnected and diverse, a large number of management work have to be completed autonomously by the system itself. Thus, these tasks are also widely spread in autonomic computing [18, 19]. For instance, in order to monitor the system state, the collected system data should firstly be transmitted to some management agents. The transmission time could be high variance in a network with intermittent connectivity. Then the data have to be queried in a large historical database and mapped to some state. Once the state releases some signals that the system meets problems, a searching process is launched to find the solution to deal with this problem. Both the query and search work should be completed within a given time, but they may also suffer a high variance. As introduced in [43], restart allows to express an efficient trade-off between the average and the variance in the time a task will take.

By migrating heavy computation to powerful cloud servers, mobile offloading systems can circumvent constraints of the client hardware, e.g. high-energy consumption microchips and low capacity batteries. But smooth offloading of computation depends on a fast and stable network connection, which guarantees seamless communication. Unfortunately, the quality of a network is often not constant across space and time. In [46] the impact of unreliable network connections on mobile offloading has been experimentally confirmed. The execution of offloading tasks may suffer from long delays or even sometimes from failures. When the offloading task fails, the client may retry offloading. Various events can cause delays, for example a single node overload on the path between the mobile device and the server. The offloading data can be blocked or even lost at this node. In this situation, repeated offloading restart cannot solve the problem, and on the contrary, it increases congestion. To avoid this dilemma, the task can be completed using the resources in the local device instead of those in the Cloud. As introduced in [44], simulations of a stochastic model showed that if the offloading task needs an unknown time to migrate computation through the unreliable network connection, restarting and completing the computations locally by the mobile device can save both time and energy.

This paper is an extended version of our published at ICPE-2015 [46]. For completeness of presentation the experiment about exploring the impact of unreliable network connectivity in [46] is retained. In [46], only local restart was introduced. In this paper, both the offloading and local restart are jointly used to improve the system efficiency. This new scheme is called adaptive restart. In the new scheme, the first job attempts to restart with offloading. If the number of restarts exceeds a threshold, the job is completed locally. The main problem that must be solved is setting an optimal threshold that limits the number of allowed restart tries.

We use the task completion time as a metric to evaluate system performance under different thresholds. Besides the number of tries also the timeout is essential. The timeout controls the moment at which restart is launched. We theoretically identify the optimal threshold and timeout and experimentally confirm their advantages. Since both threshold and timeout depend on the network and system condition, setting their values is an adaptation process.

The remainder of this paper is organized as follows: in Sect. 2 we briefly review the background of mobile offloading systems and the restart algorithm. In Sect. 3 we introduce an experiment to study the impact of packet loss and delay in network on the offloading task completion time. The experimental results confirm the need for applying restart. Next, in Sect. 4 the workflow of the adaptive restart scheme is introduced. In Sect. 5 we describe the mathematical derivation of a condition which is used to determine the optimal threshold and timeout. The dynamic adaptive restart scheme implemented in a practical offloading platform is introduced in Sect. 6. The experiments and results introduced in Sect. 7 confirm the efficiency of the adaptive restart scheme. Section 8 concludes the paper.

2 Background and related work

Mobile offloading as a concept has been around for more than a decade. Thin clients using a remote infrastructure for compute-intensive tasks have already been seen as a method for addressing the challenges of distribution and mobility as in pervasive computing [34]. Powerful distributed systems as in Cloud computing aim at turning computing as utility into reality [5]. Recently, mobile offloading has been developed as to merge Cloud computing and mobile computing. Research in offloading methods can be divided into three main directions [13]: client–server communication, virtual machine migration and mobile agents. We will now discuss related work in all three areas.

  1. 1.

    Client–server communication: communication can be supported by pre-installation of the application in both the mobile client and the server. In this case one can benefit from existing stable protocols for process communication between mobile and surrogate devices. This is the basis for the systems in [2, 11, 14, 21, 23].

  2. 2.

    Virtual machine migration: offloading can be implemented as the migration of the complete virtual machine executing the application. The most fascinating property of this method is that no code is changed for offloading of a program. The memory image of a mobile client is copied and transferred to the destination server without interrupting or stopping any execution on the client. Although this method has clear advantages as it avoids having two versions of a program, it requires a high volume of transferred data [4, 6, 36].

  3. 3.

    Mobile agents: Scavenger [22] introduces a framework that partitions and distributes heavy jobs to mobile surrogates in the vicinity rather than to cloud servers. Offloading to more than one surrogate is the merit of this framework.

All offloading systems mentioned so far may suffer from poor network condition, and the application of well-designed fault-tolerance methods is in place. Restart is a popular preventive maintenance method to mitigate network failures. It can be very effective for certain types of failures, and its performance has been widely studied [1, 12, 15]. Markov chain models and Laplace transforms have been developed to analyse the performance of restart for improving the expected task completion time [1, 3, 20, 28, 37]. These analyses strongly support the efficiency of restart if the best restart timeout is known. Their implementation in an online algorithm for practical application is not straight forward. A fast method based on iteration theory to identify the optimal restart time is presented in [24]. The algorithm is improved in [4143]. It is tailored for Internet applications in [33].

Based on the assumption that the service time are phase-type distributed, Fourneau et al. [15] has proposed an open queueing network based on G-network formalism to study the effectiveness of the restart method in a multiple clients system. By comparing the analytical results and simulation results, the authors indicated that in more realistic scenarios, the arrival process of restart signals is not independent of the job arrival process. In [10], a simulation-based framework SFERA is introduced for the investigation of restart algorithms and their impact on request completion times in service-oriented system. SFERA observed systems from the client point of view and evaluated the time behaviour of a service-oriented architecture system.

Restart has also been widely deployed in high-performance computing (HPC) system. Generally, it collaborates together with the checkpoint scheme [12, 16]. In [26], a checkpoint–restart framework, BlobCR is proposed specifically for the optimization of HPC applications that need to be ported to clouds. As an increasing rate of soft error is shown in recent trends, Ni et al. [27] introduced an automatic checkpoint–restart framework ACR to detect and recover the system from soft or hard faults with minimal application intervention. In [9], the authors provided a formula to compute the optimal number of checkpoint for cloud jobs under varied distributions of failure events. An adaptive algorithm is designed to optimize the impact of checkpoint–restart scheme regarding various costs like overhead.

In addition, restart is exploited in pure mathematics. A heuristic adaptive restart technique is proposed in [29] to improve the convergence rate of accelerated gradient schemes. These schemes could reduce the complexity of first-order algorithms applied to minimize smooth convex function. As the iterates generated by an accelerated gradient scheme exhibit a periodic behaviour, the optimal restart interval is proportional to this period.

3 Offloading over an unreliable network

In order to observe and analyse the impact of an unreliable network connection on the mobile offloading system we design an experiment. Using the experiment we show that the performance changes in the system under changing network quality. We assume that the task completion time consists of the remote execution time and the data transmission time. Generally, the execution time is assumed constant for a given task and device and delays are added by data transfer. In particular, we assume that the task completion time on the mobile device and on the cloud server can be different, but both will be more or less constant for identical tasks at different times. The offloading completion time (OCT) varies greatly because data transmission times are not the same at all times. The impact of heavy load on the system is not considered in this paper. Overload conditions may remain in the system for longer duration and show greater correlation in the task completion times. This effect is neglected by not considering overload. But for our analysis delays in the server cannot be distinguished from delays caused by the network and we only consider the latter.

In the remainder of this section we first introduce our mobile offloading system and the sample application which we use here for demonstration purposes. Then we experimentally demonstrate how system performance varies over the day due to changing load in our wireless network over the day. The task completion time is described by fitting a distribution to selected subsets of the data. This shows that the variance in the task completion time distribution increases significantly for certain subsets of the data.

3.1 Experiment configuration

Offloading can be beneficial if two conditions hold. First, the task must consist of heavy computation requirement, and second, a small amount of data must be transmitted between the mobile device and the server. An application which meets both requirements is optical character recognition (OCR), but there are many more. OCR is a method to recognize the characters on a binary image with optional polygonal text regions. Generally, the recognition algorithm consists of three steps: (1) the layout of the image is analysed to find some baselines of the text region. (2) The text region is chopped into components based on the gaps in the baselines. (3) Each component is recognized as several characters by comparing its shape with a trained database. For details of OCR, the interested reader is referred to [38].

All three steps of OCR require heavy computation. A series of complicated edits to the image like rotation, segmentation and comparison has to be done. Performing those tasks on the mobile device consumes a lot of energy. For the powerful remote server energy-usage is not a critical metric. In addition, most text images can be stored in small files of at most a few kilobytes. So the amount of data to transmit from the mobile device to the remote server is small. But still the time needed for the transmission depends on the quality of the network connection.

For the experiments a mobile phone (Samsung GT-S7568, Android 4.0) and a server (4 cores: Intel Xeon CPU E5649 2.53 GHz) have been used. The mobile phone is placed in a dormitory room and connects to the Internet through Wifi (54 Mbps provided by a local Telecom operator). The server is in the laboratory of the university campus and connects to the Internet through a LAN port of 100 M. We have used the Linux command “traceroute” to track the route from the mobile phone to the server. Normally, the route passes 12 hops to reach the destination, and the total round-trip time is around 82 ms. The offloading engine as introduced in [44] includes an Android Application (App) for the mobile client and a website project for the server. In our experiment, the Tesseract OCR Engine is implemented in both parts of the offloading engine. An image (\(1160\times 391\) px, 8.1 KiB) with a rectangle text region, as shown in Fig. 1, is used for image recognition. Only 100 bytes are used to represent the decyphered words.

Fig. 1
figure 1

Image to be recognized

Completion of an offloaded OCR task can be divided into three phases: (1) the Android application transmits the image from the mobile device to the server, (2) the words on the image are recognized using the OCR engine in the server, and (3) the mobile device receives and displays the result from the server. The OCT is the time needed to complete the three steps. The same offloading task has been repeated more than 58, 000 times in approximately 24 h in order to observe OCT under the different network conditions. The results are stored in a text file in the mobile device. The memory of the mobile phone used for caching is cleared after the task completion and reused again in the next new task.

In addition, we conducted a different experiment where the image recognition is performed in the mobile device. We call it local execution, as all the processing steps (e.g. analysis, chopping and recognition) are completed by the mobile device itself. The completion time is called local completion time (LCT). The same image Fig. 1 is repeatedly recognized 8400 times by the local execution. In the next subsection, we will show that although local execution is slower than offloading, it is more stable than the latter.

Figure 2 shows a scatter plot of all data of the entire 58, 000 samples over a 24-h period starting at 8am on 14 January 2014. Under the assumption of a constant processing time, a large total completion time can be attributed to a long transmission time, i.e. poor network performance. The majority of the samples fall into the range between 980 and 1380 ms, corresponding to the 0.05 and 0.75 quantile of all the samples. Obviously, the distribution of the sample values is not identical at different times. While we do not know the reason for systematic changes in network transmission times, there are clearly several types of typical behaviour that should be distinguished.

Fig. 2
figure 2

Scatter plot of all OCT samples

We have selected three subsets of our observations as indicated by the shaded areas in Fig. 2, each containing 2000 samples, which corresponds to a time window of 40 min each. The number of samples is enough to decently fit a distribution and capture one type of network behaviour, the normal, the deteriorated and the bad state.

3.2 Experimental results

Table 1 shows the mean, the quantiles and the variance of the three subsets. In the normal subset the mean completion time has a low variability, as the 0.9 quantile is only 15 % higher than the mean. It is also worth mentioning that for the given application and set-up fast offloading takes in total only half as long as local computation, because remote servers are much faster than mobile devices.

Table 1 Statistics of completion times (ms)

Very roughly speaking, it seems like the network degrades most in the early afternoon and in the evening. We do not try to explain this, as finding the cause of network delays is not within the scope of this paper. Rather we argue that offloading as well as a local restart make sense for certain network conditions and since we rightfully assume that network conditions change over time a sliding window estimate is needed and appropriate.

It should be noted that on the average, even in poor network condition the offloaded task completes faster than the one that is computed locally. However, for the bad network period, since enough outliers skew the distribution and increase the sample variance, the variability in the data is high enough to justify the use of restart.

The completion time measurement of the local computation (LCT) is shown in Fig. 3. Local computation is usually stable, with very few outliers. Most samples fall into a narrow range between 2338 and 2411 ms, corresponding to the 0.05 and 0.9 quantile.

Fig. 3
figure 3

Scatter plot of all LCT samples

In summary, in the best case offloading can provide a solution in approximately half the time needed for local processing. On the other hand, local execution times are very stable, albeit longer than processing using offloading, which suffers from high variability and, hence, sometimes takes very long.

3.3 Data analysis

In this section the sampled data will be analysed to determine whether the theoretical conditions for successful restart are met. It can be shown [43] that restart is beneficial if the task completion time follows a distribution with sufficiently high variance or heavy tail. Therefore, the distribution of the experimental data and its variability will be determined. The log–log complementary distribution plot is used to illustrate the weight of the tail of the distribution [8].

Figure 4 shows the completion time of the three subsets and the LCT versus their complementary cumulative distributions on a log scale. Clearly, for the subset of the bad network state the curve has an approximately constant slope of \(-2\), indicating a heavy tail [8]. For the subset in deteriorated condition the tail has an exponential decay for long task completion times. Therefore, in this case we cannot clearly diagnose a heavy-tailed distribution. For the normal subset the decrease is steep, and for local computation completion times it is almost infinite. This indicates certainly no heavy tail in the latter two subsets.

Fig. 4
figure 4

Log–log complementary distribution of the completion time

Completion times using the local computation are almost constant. There is very little variation in the measurements. This means that once local computation has started restart will certainly not be beneficial. However, during a phase of poor network quality, a local restart may speed up the solution. This does not yet answer the question what a good choice of the timeout for restart could be.

Figures 5, 6 and 7 show the histograms of the three subsets and the density of the fitted distribution. For convenient fitting of phase-type distributions, the histograms have been shifted to the origin by subtracting the minimum value from all observations. The distribution fitting will be discussed in the next section.

Fig. 5
figure 5

Histogram and PH distribution of the normal subset

Fig. 6
figure 6

Histogram and PH distribution of the deteriorated subset

Fig. 7
figure 7

Histogram and PH distribution of the bad subset

3.4 Distribution fitting

In this section, we will describe the fitting process for the OCT as shown in the histograms and densities in Figs. 5, 6 and 7. Let the random variable \(T_{o}\) represent OCT of an offloading task without restart. The distribution of \(T_{o}\) is fitted with the cluster-based fitting algorithm [32] that fits a phase-type (PH) distribution to the data. The fitting procedure uses clustering and fits an Erlang distribution to each cluster. The full distribution is then a mix of those Erlang distributions, a hyper-Erlang distribution.

The hyper-Erlang distribution is suitable for situations where restarts succeed [31]. This distribution takes values from different random variables with different probabilities, for instance, with probability \(\alpha _i\) a value from an Erlang distribution with \(m_i\) phases and parameter \(\lambda _i > 0\), \(i = 1,2,\ldots ,M\). M is the number of clusters. In general, the mixed-Erlang distribution is represented by a vector-matrix tuple (\(\varvec{\alpha }\), Q).

(1)
$$\begin{aligned} \varvec{\alpha }= & {} (\underbrace{\alpha _1,0,\ldots ,0}_{m_1}, \alpha _2,0, \ldots ,\underbrace{\alpha _M,0,\ldots ,0}_{m_M},) \quad \sum _{i=1}^{M} \alpha _{i} = 1\nonumber \\ \end{aligned}$$
(2)

\(\varvec{Q}_i \in {\mathbb {R}}^{m_i \times m_i}, i = 1,\ldots ,M\) is a square matrix with size \(m_i\). The probability density function and cumulative distribution function are defined as:

$$\begin{aligned} f(t)= & {} \varvec{\alpha }e^{\varvec{Q}t}(-\varvec{Q}\cdot \mathbf 1 ) \ = \sum _{i=1}^{M}\alpha _i \dfrac{(m_i \lambda _i)^{m_i} t^{m_i-1}}{(m_i-1)!} e^{- m_i \lambda _i t} \quad (t \geqslant 0) \nonumber \\\end{aligned}$$
(3)
$$\begin{aligned} F(t)= & {} 1 - \varvec{\alpha }e^{\varvec{Q}t} \cdot \mathbf 1 \ = 1 - \sum _{i=1}^{M} \alpha _i \left( \sum _{k=0}^{m_i-1} \dfrac{(m_i \lambda _i t)^k}{k!} e^{-m_i \lambda _i t} \!\right) \! ,\nonumber \\ \end{aligned}$$
(4)

where 1 is the column vector of ones with the appropriate size.

Although the hyper-Erlang distribution has exponentially decaying tails, its variance can still be large enough to fulfil the requirements for successful restart as formally introduced in Sect. 5.1.

Since the completion times of a task have a lower threshold greater than zero, as can be seen in Fig. 2 and PH distributions preferably have a nonzero density at the origin, we have shifted the density \(f_{o}(t)\) to the left by the minimum observed value \(T_{min}^{o}\) for \(T_{o}\), i.e. \(f_{o}(t)=f'_{o}(t-T_{min}^{o})\). This yields \(f'_{o}(t)\) as the PH fitting result of the experimental data shifted to the origin.

Figures 5, 6 and 7 show the histograms and the PH results of the shifted \(T_{o}\) of the normal, deteriorated and bad subset. We used three clusters to fit the data, \(M=3\). Since we grouped the data into three categories this seemed to be a natural choice. Of course, one could have chosen more clusters, which might have increased the accuracy of the fit. The parameter results are shown in Table 2. Table 3 shows the error measurement of the PH results of the three subsets. We use the area difference between densities \(\bigtriangleup f\) and the relative error in the first moment \(e_1\) to measure the error. \(\bigtriangleup f = \int _0^\infty |\hat{f}(t)-f(t)|\mathrm{d}t\) and \(e_1 = \dfrac{|\hat{c_1}-c_1|}{c_1}\), where f(t) denotes the empirical pdf of the distribution to be fitted, \(\hat{f}(t)\) is the pdf of the PH result, \(c_1\) and \(\hat{c_1}\) is the first standardized moment of the empirical distribution and of the fitted PH distribution, respectively.

Table 2 Hyper-Erlang parameters
Table 3 Error

4 Failure handling with restart

Our concept of mobile offloading is to provide both options, i.e. of executing a given application locally in a mobile device as well as remotely in a server and selecting the suitable one of the two. In the normal system state, the mobile device can establish a reliable connection with the server to send parameters and receive results. When the connection is interrupted or suffers degradation, the system is in some kind of a failed state. The failure handling consists of restart after expiry of a timeout. The workflow of normal offloading and restart is shown in Fig. 8.

Fig. 8
figure 8

Process of the offloading execution with adaptive restart

4.1 Offloading retry

We assume that during the offloading process, if the network connection is disturbed and cannot support the data transmission, the mobile device is informed to wait. A timeout value is set to restrict the waiting time. If the network recovers quickly before the timeout expires, the offloading process resumes execution. However, if the network is not robust enough, the connection is unable to recover in a short time. This is assumed to happen when the waiting time exceeds the timeout. In this case the previous task is abandoned, and a new try of the same offloading task is restarted.

4.2 Adaptive restart

At times the offloading process may meet an unpredictable delay in data transmission. If the connection quality experiences a long-term turbulence, the mobile device has to restart several times and waits long for a sufficiently long up-time of the connection to complete the data transmission. Obviously, the redundant restart consumes not only time but also too much energy. In order to avoid such a long waiting time, the pre-offloading task will be restarted locally in the mobile device. Although the local restart may take longer than offloading, the execution continuity is maintained.

5 Optimal adaptive restart

When using restart one has to decide whether and when to abort a running task and to restart it. Obviously, there is a trade-off between waiting for the offloading task to complete and terminating the attempt to try again locally. In [43], an iterative solution for an infinite number of possible retries has been derived. In this section, we adopt the solution for computing the optimal timeout from [43] for the expected task completion time of the adaptive restart scheme. Based on the theoretical analysis, the optimal timeout for every restart and the optimal threshold for the number of offloading restarts are derived.

For a given random variable T describing task completion time restart after a timeout \(\tau \) is promising if the following condition holds [43]:

$$\begin{aligned} E[T]<E[T-\tau |T>\tau ] \end{aligned}$$
(5)

The interpretation of condition (5) means that for restart to be beneficial the expected completion time when restarting from scratch must be less than the expected time still needed to wait for completion. It can be shown [43] that condition (5) holds if the task completion time follows a distribution with sufficiently high variance or heavy tail.

5.1 Derivation of the restart condition

Remember that \(T_{o}\) represents the OCT of an offloading task without restart. Its density is \(f_{o}(t)\), and its distribution function is \(F_{o}(t)\). Assume \(\tau \) is the restart time, at which the previous offloading task is aborted and the new try is issued. Correspondingly, \(T_{l}\) represents the local computation time LCT of the same task, \(f_{l}(t)\) its density and \(F_{l}(t)\) its distribution. We assume that \(F_{o}(t)\) and \(F_{l}(t)\) are both continuous probability distribution functions defined over the domain \([0, \infty )\), such that \(F_{o}(t) > 0\) and \(F_{l}(t) > 0\) if \(t > 0\). We introduce T to denote the completion time when restart is allowed. We write f(t) and F(t) for its density and cumulative distribution function, respectively. We are interested in the expectation of T using the optimal timeout \(\tau ^{*}\). The value of \(\tau \) can be changed in real time according to system performance. But for simplifying the theoretical analysis, we assume that \(\tau \) is constant in the process of completing a given task. Then, in the next subsection, we use individual timeout values for the restart with offloading and local execution.

$$\begin{aligned} F(t)= & {} \left\{ \begin{array}{ll} F_{o}(t) &{}\quad 0 \leqslant t < \tau \\ 1-(1-F_{o}(\tau ))(1-F_{o}(t-\tau )) &{}\quad \tau \leqslant t < 2\tau \\ \qquad \qquad \vdots &{} \\ 1-(1-F_{o}(\tau ))^{n-1} (1-F_{o}(t-(n-1)\tau )) &{} \quad (n-1)\tau \leqslant t < n\tau \\ 1-(1-F_{o}(\tau ))^n (1-F_{l}(t-n\tau )) &{}\quad n\tau \leqslant t\\ \end{array} \right. \end{aligned}$$
(6)
$$\begin{aligned} f(t)= & {} \left\{ \begin{array}{ll} f_{o}(t) &{}\quad 0 \leqslant t < \tau \\ (1-F_{o}(\tau ))f_{o}(t-\tau ) &{}\quad \tau \leqslant t < 2\tau \\ \qquad \qquad \vdots &{} \\ (1-F_{o}(\tau ))^{n-1}f_{o}(t-(n-1)\tau ) &{}\quad (n-1)\tau \leqslant t < n\tau \\ (1-F_{o}(\tau ))^nf_{l}(t-n\tau ) &{}\quad n\tau \leqslant t \\ \end{array} \right. \end{aligned}$$
(7)

Remember in Eqs. (6) and (7), that \(n\geqslant 2\), when \(n=1\) is the single local restart mode which has been analysed in [46] as:

$$\begin{aligned} F(t)= & {} \left\{ \begin{array}{ll} F_{o}(t) &{} \quad (0 \leqslant t < \tau ) \\ 1-(1-F_{o}(\tau ))(1-F_{l}(t-\tau )) &{} \quad (\tau \leqslant t)\\ \end{array} \right. \end{aligned}$$
(8)
$$\begin{aligned} f(t)= & {} \left\{ \begin{array}{ll} f_{o}(t) &{} \quad (0 \leqslant t < \tau ) \\ (1-F_{o}(\tau ))f_{l}(t-\tau ) &{} \quad (\tau \leqslant t) \\ \end{array} \right. \end{aligned}$$
(9)

We define the partial expectation \(M(\tau )\) of the completion time T to determine its expectation E[T].

$$\begin{aligned} M(\tau ) = \int _0^\tau tf(t)\mathrm{d}t = \int _0^\tau tf_{o}(t)\mathrm{d}t \end{aligned}$$
(10)

The respective densities of T and \(T_{o}\) are identical between 0 and \(\tau \), so their partial expectations are equal as well.

$$\begin{aligned} E[T]= & {} \int _0^\tau tf_{o}(t)\mathrm{d}t + (1-F_{o}(\tau ))\int _\tau ^{2\tau }tf_{o}(t-\tau )\mathrm{d}t + \cdots \nonumber \\&+\, (1-F_{o}(\tau ))^{n-1}\int _{(n-1)\tau }^{n\tau } tf_{o}(t-(n-1)\tau )\mathrm{d}t \nonumber \\&+ \,(1-F_{o}(\tau ))^n\int _{n\tau }^\infty tf_{l}(t-n\tau )\mathrm{d}t \nonumber \\= & {} M(\tau ) + (1-F_{o})(\tau )(M(\tau ) + \tau F_{o}(\tau )) + \cdots \nonumber \\&+\,(1-F_{o}(\tau ))^{n-1}(M(\tau ) + (n-1) \tau F_{o}(\tau )) \nonumber \\&+\,(1-F_{o}(\tau ))^n(n\tau + E[T_{l}]) \nonumber \\= & {} \sum _{k=0}^{n-1}(1-F_{o}(\tau ))^k(M(\tau )+k \tau F_{o}(\tau )) \nonumber \\&+\, (1-F_{o}(\tau ))^n(n\tau +E[T_{l}]), \end{aligned}$$
(11)

\(\because \)

$$\begin{aligned}&(1-F_{o}(\tau ))^n n \tau + (1-F_{o}(\tau ))^{n-1} (n-1) \tau F_{o}(\tau )\nonumber \\&\quad = (1-F_{o}(\tau ))^n\tau + (1-F_{o}(\tau ))^n(n-1)\tau \nonumber \\&\qquad + \,(1-F_{o}(\tau ))^{n-1}\tau F_{o}(\tau ) \nonumber \\&\quad = (1-F_{o}(\tau ))^n\tau + (1-F_{o}(\tau ))^{n-1}(n-1)\tau , \end{aligned}$$
(12)

\(\therefore \)

$$\begin{aligned}&\sum _{k=0}^{n-1}(1-F_{o}(\tau ))^k k \tau F_{o}(\tau ) + (1-F_{o}(\tau ))^n \tau \nonumber \\&\quad = \tau \sum _{k=1}^{n}(1-F_{o}(\tau ))^k. \end{aligned}$$
(13)

Substituting (13) in Eq. (11), we get

$$\begin{aligned} E[T]= & {} M(\tau ) + \sum _{k=1}^{n-1}(1-F_{o}(\tau ))^k(M(\tau )+\tau ) \nonumber \\&+ \,(1-F_{o}(\tau ))^n(\tau +E[T_{l}]). \end{aligned}$$
(14)

Let \(E[T_{i}]\) represents E[T] when \(n=i\), and \(E[T_{\tau }]\) means \(n \rightarrow \infty \). When \(n\rightarrow \infty \) is the mode with infinite offloading restart, as derived in [43]:

$$\begin{aligned} E[T_{\tau }]= \dfrac{M(\tau )+\tau }{F_{o}(\tau )} - \tau . \end{aligned}$$
(15)

When \(n = 1\),

$$\begin{aligned} E[T_{1}]= M(\tau ) + (1-F_{o}(\tau ))(\tau + E[T_{l}]). \end{aligned}$$
(16)

The criterion to decide whether to restart or not can be formulated as \(E[T] < E[T_{o}]\). With (14), this condition can be written as the following inequality:

$$\begin{aligned} E[T_{l}] < \dfrac{\int _\tau ^\infty tf_{o}(t)\mathrm{d}t}{(1-F_{o}(\tau ))^n} - \sum _{k=1}^{n-1}\dfrac{M(\tau )+\tau }{(1-F_{o}(\tau ))^k} - \tau \end{aligned}$$
(17)

In [46], the restart criterion for \(n=1\) has been derived as

$$\begin{aligned} E[T_{l}] < \dfrac{\int _{\tau }^\infty tf_{o}(t)\mathrm{d}t}{1-F_{o}(\tau )} - \tau \end{aligned}$$
(18)

For \(n \rightarrow \infty \), the restart criterion is

$$\begin{aligned} \dfrac{M(\tau )+\tau }{F_{o}(\tau )} - \tau < E[T_{o}]. \end{aligned}$$
(19)

Theorem 1

When the criterion of exclusive local restart (\(n=1\), Eq. 18) is satisfied, the criterion of the adaptive restart scheme (\(n>1\), Eq. 17 or \(n\rightarrow \infty \), Eq. 19) is also satisfied.

Proof

As the prerequisite of using offloading is \(E[T_{o}]<E[T_{l}]\), Eq. (18) can be extended as:

$$\begin{aligned} E[T_{o}] < E[T_{l}] < \dfrac{E[T_{o}]-M(\tau )}{1-F_{o}(\tau )} - \tau \end{aligned}$$
(20)

\(\Longrightarrow \)

$$\begin{aligned} E[T_{o}](1-F_{o}(\tau )) < E[T_{o}]-M(\tau ) - \tau (1-F_{o}(\tau )) \end{aligned}$$

\(\Longrightarrow \)

$$\begin{aligned} M(\tau ) + \tau (1-F_{o}(\tau )) < E[T_{o}]F_{o}(\tau ) \end{aligned}$$

\(\Longrightarrow \)

$$\begin{aligned} \dfrac{M(\tau ) + \tau }{F_{o}(\tau )} - \tau < E[T_{o}]. \end{aligned}$$

So when \(n \rightarrow \infty \), Theorem 1 is true. Let

$$\begin{aligned} g_{1}(\tau ) = \dfrac{\int _{\tau }^\infty tf_{o}(t)\mathrm{d}t}{1-F_{o}(\tau )} - \tau . \end{aligned}$$
(21)

When \(n\geqslant 2\),

$$\begin{aligned}&g_{n}\left( \tau \right) = \dfrac{\int _\tau ^\infty tf_{o}\left( t\right) \mathrm{d}t}{\left( 1-F_{o}\left( \tau \right) \right) ^n} - \sum _{k=1}^{n-1}\dfrac{M\left( \tau \right) +\tau }{\left( 1-F_{o}\left( \tau \right) \right) ^k} - \tau . \end{aligned}$$
(22)
$$\begin{aligned}&g_{n}\left( \tau \right) - g_{n-1}\left( \tau \right) = \dfrac{\int _\tau ^\infty tf_{o}\left( t\right) \mathrm{d}t}{\left( 1-F_{o}\left( \tau \right) \right) ^{n-1}}\left( \dfrac{1}{1-F_{o}\left( \tau \right) }-1\right) \nonumber \\&\qquad \qquad \qquad \qquad \quad \qquad -\, \dfrac{M\left( \tau \right) +\tau }{\left( 1-F_{o}\left( \tau \right) \right) ^{n-1}} \nonumber \\&\qquad \qquad \qquad \;\quad \qquad = \dfrac{E[T_{o}]-M\left( \tau \right) }{\left( 1-F_{o}\left( \tau \right) \right) ^n} F_{o}\left( \tau \right) \nonumber \\&\qquad \qquad \qquad \;\quad \qquad \quad \,-\, \dfrac{M\left( \tau \right) +\tau }{\left( 1-F_{o}\left( \tau \right) \right) ^{n-1}} \end{aligned}$$
(23)

Recalling the result of Eq. (19) and substituting for \(E[T_{o}]\), we have

$$\begin{aligned} g_{n}(\tau ) - g_{n-1}(\tau )> & {} \dfrac{\dfrac{M(\tau ) + \tau }{F_{o}(\tau )} - \tau -M(\tau )}{(1-F_{o}(\tau ))^n} F_{o}(\tau ) \nonumber \\&-\, \dfrac{M(\tau )+\tau }{(1-F_{o}(\tau ))^{n-1}} = 0. \end{aligned}$$
(24)

Hence, we have proved that \(g_{n}(\tau ) > g_{n-1}(\tau )\), when \(n\geqslant 2\). Thus if \(g_{1}(\tau )>E[T_{l}]\), the inequality of (17) is always true for any \(n>1\), including \(n\rightarrow \infty \).

Figure 9 shows the result of (22), calculated according to \(f_{o}(t)\) of the bad subset from Table 2. Theorem 1 is also demonstrated in Fig. 9. An interesting point worth mentioning is that when \((M(\tau )+\tau )/F_{o}(\tau ) - \tau = E[T_{o}]\), \(\tau \) is at the critical value and \(g_{n}(\tau ) = g_{n-1}(\tau ) = \cdots = g_{1}(\tau ) = M(\tau )\). When \(\tau \) is larger than this critical value, \(g_{n}(\tau )\) becomes an increasing function with n at the same \(\tau \). Thus if \(g_{1}(\tau ) > E[T_{l}]\), then \(g_{n}(\tau ) > E[T_{l}]\). This proves Theorem 1.

Fig. 9
figure 9

Restart timeout with 0–4 offloading retries and 1 local retry

Theorem 2

When the criterion of restart is satisfied as the inequality (19) holds, the expected task completion time \(E[T_{n}]\) decreases with the number of restart n at the same \(\tau \).

Proof

Using the prerequisite of using offloading \(E[T_{o}]<E[T_{l}]\), the inequality of (19) can be extended as:

$$\begin{aligned} \dfrac{M(\tau ) + \tau }{F_{o}(\tau )} - \tau < E[T_{l}]. \end{aligned}$$
(25)

When \(n > 1\), we have \(E[T_{n}] < E[T_{n-1}]\):

$$\begin{aligned}&E[T_{n}] - E[T_{n-1}] \nonumber \\&\quad =\left( 1-F_{o}\left( \tau \right) \right) ^{n-1}\left( \tau + M\left( \tau \right) \right) + \left( 1-F_{o}\left( \tau \right) \right) ^{n-1}\left( \tau \right. \nonumber \\&\qquad \left. +\,E[T_{l}]\right) \left( 1-F_{o}\left( \tau \right) -1\right) \nonumber \\&\quad =\left( 1-F_{o}\left( \tau \right) \right) ^{n-1}\left( \left( \tau + M\left( \tau \right) \right) - \left( \tau + E[T_{l}]\right) F_{o}\left( \tau \right) \right) \nonumber \\&\quad < \left( 1-F_{o}\left( \tau \right) \right) ^{n-1}\left( \dfrac{\left( \tau + M\left( \tau \right) \right) }{F_{o}\left( \tau \right) } - \tau - E[T_{l}]\right) =0.\nonumber \\ \end{aligned}$$
(26)

When \(n \rightarrow \infty \):

$$\begin{aligned}&E[T_{\tau }] - E[T_{n}] \nonumber \\&\quad = \dfrac{M(\tau )+\tau }{F_{o}(\tau )} - \tau - M(\tau ) - \sum _{k=1}^{n-1}(1-F_{o}(\tau ))^k(M(\tau )+\tau ) \nonumber \\&\qquad -\, (1-F_{o}(\tau ))^n(\tau +E[T_{l}]), \end{aligned}$$
(27)

\(\because \)

$$\begin{aligned} \sum _{k=1}^{n-1}(1-F_{o}(\tau ))^k = \dfrac{(1-F_{o}(\tau ))-(1-F_{o}(\tau ))^n}{F_{o}(\tau )} , \end{aligned}$$
(28)

\(\therefore \)

$$\begin{aligned}&E[T_{\tau }] - E[T_{n}] \nonumber \\&\quad =\dfrac{M\left( \tau \right) +\tau }{F_{o}\left( \tau \right) } - \tau - M\left( \tau \right) - \dfrac{1-F_{o}\left( \tau \right) }{F_{o}\left( \tau \right) }\left( M\left( \tau \right) +\tau \right) \nonumber \\&\qquad - \,\dfrac{\left( 1-F_{o}\left( \tau \right) \right) ^n}{F_{o}\left( \tau \right) }\left( \tau +M\left( \tau \right) \right) \nonumber \\&\qquad - \,\left( 1-F_{o}\left( \tau \right) \right) ^n\left( \tau +E[T_{l}]\right) \nonumber \\&\quad = \left( 1-F_{o}\left( \tau \right) \right) ^n\left( \dfrac{M\left( \tau \right) + \tau }{F_{o}\left( \tau \right) } - \tau - E[T_{l}]\right) <0. \end{aligned}$$
(29)

Thus if inequality (25) holds, we have \(E[T_{\tau }]<E[T_{n}]<\cdots <E[T_{1}]<E[T_{o}]<E[T_{l}]\). This proves Theorem 2.

5.2 Optimal restart timeout

The optimal restart timeout is the value of \(\tau \) where \(E[T_{i}]\) is minimal. For comparing the system performance under the adaptive restart scheme with different number of restart, Fig. 10 shows \(E[T_{i}] (i = 1\)–5), \(E[T_{\tau }]\) and \(E[T_{o}]\) for the same data \(f_{o}(t)\) of the bad subset from Table 2. As expected \(E[T_{\tau }]\) has the best performance, the minimum of \(E[T_{\tau }]\) is the lowest. The most important observation shown in Fig. 10 is that when \(i = 2\), \(E[T_{2}]\) is quite close to the ideal performance of \(E[T_{\tau }]\). To evaluate the performance of different number of restarts, we define a metric to measure the distance between the optimal performance \(E[T_{\tau }]\) and \(E[T_{i}]\). As we have proved that \(E[T_{i}] < E[T_{o}]\) for \(i = 1,2,\ldots , \infty \), the value of \(E[T_{i}]\) can only vary in the space of \((E[T_{\tau }], E[T_{o}])\). So we define \(d_{i} = (E[T_{o}]-E[T_{i}])/(E[T_{o}]-E[T_{\tau }])\) to measure how close is \(E[T_{i}]\) to \(E[T_{\tau }]\).

Fig. 10
figure 10

Expectation of the task completion time versus \(\tau \) under different restart polices

From Table 4, we can easily find that the performance of using two restarts can closely approach the best performance of infinite offloading restarts. Because \(E[T_{2}]\) has reached 87 % performance of \(E[T_{\tau }]\), whereas if restarting exclusively with the local execution, its performance \(E[T_{1}]\) reaches merely 50 % of \(E[T_{\tau }]\). Accordingly, we can conclude that the adaptive restart scheme is better than the exclusive local restart. This is because it uses the offloading restart to speed up the task completion. Actually, by applying the local restart, the adaptive restart scheme can also avoid the unpredictable waiting time of redundant offloading restart. We will use experiments to prove this conclusion in Sect. 7.

Table 4 Performance comparison under identical restart timeout

5.3 Individual restart timeout

In the previous subsection, we have evaluated the performance of the adaptive restart scheme with identical optimal timeout value. In this subsection, applying a different \(\tau _{l}\) as the timeout for the local restart is analysed. \(T^{\prime }\) denotes the task completion time when \(\tau _{l}\) is allowed.

$$\begin{aligned}&F(t) \nonumber \\&\quad =\left\{ \begin{array}{ll} F_{o}(t) &{} \quad 0 \leqslant t < \tau \\ 1-(1-F_{o}(\tau ))(1-F_{o}(t-\tau )) &{} \quad \tau \leqslant t < 2\tau \\ \qquad \qquad \vdots \\ 1-(1-F_{o}(\tau ))^{n-1} (1-F_{o}(t-(n-1)\tau )) &{} \quad (n-1)\tau \leqslant t < (n-1)\tau + \tau _{l}\\ 1-(1-F_{o}(\tau ))^{n-1}(1-F_{o}(\tau _{l}) (1-F_{l}(t-(n-1)\tau )-\tau _{l}) &{} \quad (n-1)\tau + \tau _{l} \leqslant t\\ \end{array} \right. \end{aligned}$$
(30)
$$\begin{aligned}&f(t)=\left\{ \begin{array}{ll} f_{o}(t) &{} \quad 0 \leqslant t < \tau \\ (1-F_{o}(\tau ))f_{o}(t-\tau ) &{} \quad \tau \leqslant t < 2\tau \\ \qquad \qquad \vdots \\ (1\!-\!F_{o}(\tau ))^{n-1} f_{o}(t\!-\!(n\!-\!1)\tau ) &{} \quad (n\!-\!1)\tau \leqslant t < (n\!-\!1)\tau \!+\! \tau _{l} \\ \quad (1-F_{o}(\tau ))^{n-1} (1-F_{o}(\tau _{l}) &{} \quad (n-1)\tau + \tau _{l} \leqslant t\\ \quad f_{l}(t-(n-1)\tau )-\tau _{l})&{}\\ \end{array} \right. \end{aligned}$$
(31)

As the function (6) and (7), \(n \geqslant 2\) in the above function (30) and (30). When \(n=1\), the situation is identical with (8) and (9), so we do not write them out again.

Using the similar expression as Eq. (11), we calculate the expectation of \(T^{\prime }\) as

$$\begin{aligned} E[T^{\prime }]= & {} \sum _{k=0}^{n-1}(1-F_{o}(\tau ))^k(M(\tau )+k \tau F_{o}(\tau )) \nonumber \\&+\, (1-F_{o}(\tau ))^{n-1}(1-F_{o}(\tau _{l}))((n-1)\tau \nonumber \\&+\,\tau _{l}+\,E[T_{l}]) \end{aligned}$$
(32)

Unfortunately, it is impossible to give a simplified expression for \(E[T^{\prime }]\) as for (14). Figure 11 shows \(E[T^{\prime }_{2}]\) for various values of \(\tau \) with different \(\tau _{l}\). The bottom of the graph indicates the best timeout fraction (parameters in Table 5). Comparing the two Tables 4 and 5, we can find that using different timeout values for the offloading and local restarts individually can improve the performance. The minimum of \(E[T^{\prime }_{2}]\) is less than that of \(E[T_{2}]\). But considering the complexity of the computation for the optimal \(\tau \) and \(\tau _{l}\), this performance increase is expensive. In addition, in real applications, it is difficult for a mobile device to estimate the optimal \(\tau \) and \(\tau _{l}\) online as it requires to simultaneously sort a two-dimensional matrix. A large amount of time and energy is required for calculating the individual timeout values, but the benefit is limited. Thus, we cannot recommend the use of individual timeouts for the adaptive restart scheme.

Fig. 11
figure 11

Expectation of the task completion time versus \(\tau \) and \(\tau _{l}\)

Table 5 Performance comparison with optimal timeout \(\tau \) and \(\tau _{l}\)

Before moving on to the next section, we briefly review the conclusions obtained in this section.

1 :

Theoretically, infinite offloading restart is the optimal option with the lowest mean completion time. But applying the adaptive restart scheme with two restarts (one offloading and one local) can reach almost \(90\,\%\) of the best performance.

2 :

Although triggering the offloading and local restart with individual timeout value can slightly reduce the expected completion time, due to its high complexity for calculating the optimal individual timeout values, we do not implement this scheme in the experiments.

6 Dynamic restart scheme

The procedure of fitting a theoretical distribution and computing the optimal restart timeout from this distribution is computationally very expensive. Various algorithms and tools exist for fitting PH distributions to empirical data [17, 39, 40, 45], and the fitted distributions approximate the data in many cases very well. For efficiency reasons we use a direct method [33] to estimate \(g(\tau )\) and E[T] from the histogram. We dynamically build and update a histogram and then repeatedly determine the optimal restart timeout as discussed in the following subsections.

6.1 Dynamic histogram

A histogram simply divides up the range of possible observations into intervals, which we call buckets, and counts the number of observations that fall into each bucket. Buckets can have a variable or a constant width; we choose the latter for simplicity. Histograms initially hold too few samples to provide a good approximation of a probability distribution. After collecting data for a while a stationary distribution is represented increasingly well. However, if the distribution changes, old samples will never be dismissed from the histogram and will forever bias the new probability distribution.

There are several options how to handle changes in distribution: the histogram can be repeatedly flushed as to build up a new histogram for the respective current state of the system. This introduces many initial periods with insufficient data. Another option is to transform the buckets into dripping buckets that lose samples constantly over time. It is not easy to adjust the dripping speed such that the histogram will hold sufficient but not too many samples at all times [25, 30, 35].

We propose a partial flush which is tuned using two parameters, the total number of samples in the histogram when executing the partial flush and the percentage of samples to equally flush from all buckets.

figure c

Algorithm 1 shows the algorithm to initialize the histogram prior to run time. The parameters are the following:

\(T_{min}^{o}\)::

The lower bound of the histogram.

\(T_{max}^{o}\)::

The upper bound of the histogram.

\(T_{l}\)::

The task completion time by local execution.

N::

The number of buckets in the histogram.

\(B_{average}{[}i{]}\)::

The mean of all the samples in the ith bucket.

\(N_{B}{[}i{]}\)::

The number of samples in the ith bucket.

\(N_{out}\)::

The number of samples, whose value \({>} T_{max}^{o}\).

\(B_{out}\)::

The mean of all the samples \({>} T_{max}^{o}\).

The number of buckets N must be chosen manually. The upper bound of the histogram is determined by the execution time of one local run. The lower bound is given as the execution time of one offloading task. In the course of the experiments there may later be shorter offloading times which will be used as new lower bound and additional buckets will be inserted. These choices are motivated by the purpose of the histogram: to determine the optimal restart timeout the precise shape of the distribution in the tail is not needed.

figure d

Algorithm 2 shows the algorithm to record a new sample at run time. If the sample comes from local execution, \(T_{l}\) is updated by the mean of its original value and the new sample. Hence, the impact of old samples is reduced and replaced by that of new ones.

If the new sample is produced by offloading, it can be added to the histogram in three ways according to its value. Case 1, when new samples are larger than \(T_{max}^{o}\), they are all added to the out bucket. Case 2, when a shorter offloading time arrives, M additional buckets are inserted, M is calculated based on the ceiling function shown in line 9. \(T_{min}^{o}\) moves down to include the new sample. Lines 21–24 adjusts the mean and index of each original bucket accordingly. Case 3, when the sample falls into the range between \(T_{min}^{o}\) and \(T_{max}^{o}\), it is added to the corresponding bucket in the histogram. Figure 12 is the illustrative diagram of the three cases.

Fig. 12
figure 12

Recording a new offloading sample

The partial flush algorithm, shown as Algorithm 3, needs the two new parameters \(N_{bound}\) and p:

\(N_{bound}\)::

threshold to start the update. When the number of samples stored in the histogram exceeds this value, the update algorithm is triggered.

p::

percentage of samples to be kept. From each bucket, \((1-p)/100*n_i\) samples are removed if the bucket holds a total of \(n_i\) samples before the partial flush.

figure e

A large number of samples \(N_{bound}\) until partial flush leads to a long sampling period. Conversely, a large percentage p indicates that the majority of the samples are kept after updating. This will lead to frequent inexpensive partial flushes. The mechanism is related to hysteresis as used in the control of queueing systems.

6.2 Asymptotically unbiased ratio estimator

The estimate for the optimal restart timeout is based on the asymptotically unbiased ratio estimator [7]. Using the dynamic histogram proposed in Sect. 5.2, an estimator for \(g_{n}(\tau )\) in Eq. (22) is:

$$\begin{aligned} \hat{g}_{n}\left( \tau _{i}\right)= & {} \dfrac{\sum _{j=i}^{N}N_{B}[j] \cdot B_{average}[j] + N_{out} \cdot B_{out}}{\left( \sum _{k=i}^{N}N_{B}[k] + N_{out}\right) \left( 1 - \hat{F_{o}}'\left( \tau _{i}\right) \right) ^{n}} \nonumber \\&- \sum _{l=1}^{n-1}\dfrac{\hat{M}'\left( \tau _{i}\right) +\tau _{i}}{\left( 1 - \hat{F_{o}}'\left( \tau _{i}\right) \right) ^{l}} - \tau _{i} \nonumber \\&- \,T_{min}^{o}\left( 1\!-\!\dfrac{1}{\hat{F_{o}}'\left( \tau _{i}\!\right) }\right) \left( 1\!-\! \dfrac{1}{\left( 1\!-\!\hat{F_{o}}'\left( \tau _{i}\!\right) \!\right) ^{n-1}}\!\right) \nonumber \\ \end{aligned}$$
(33)

We assume that the optimal timeout \(\tau \) only takes on values \(\tau _{i}= i \times \bigtriangleup _{B}, i=1,2,\ldots ,N\). The cumulative distribution function \(\hat{F_{o}}'(\tau _{i})\) is estimated as:

$$\begin{aligned} \hat{F_{o}}'(\tau _{i}) = \dfrac{\sum _{j=1}^{i}N_{B}[j]}{\sum _{k=1}^{N}N_{B}[k] + N_{out}} \end{aligned}$$
(34)

The partial moment \(\hat{M}'(\tau _{i})\) is estimated as:

$$\begin{aligned} \hat{M}'(\tau _{i}) = \dfrac{\sum _{j=1}^{i}N_{B}[j] \cdot B_{average}[j]}{\sum _{k=1}^{i}N_{B}[k]} \end{aligned}$$
(35)

If the maximum estimate \(\hat{g}(\tau _{i})_{max} > T_{l}\), the restart condition (18) is fulfilled. Then, an estimate of E[T] provides the optimal timeout.

$$\begin{aligned} \hat{E}[T]_{\tau _{i}}= & {} \hat{M}'(\tau _{i}) + \sum _{k=1}^{n-1}(1-\hat{F_{o}}'(\tau ))^k(\hat{M}'(\tau _{i})+\tau _{i})\nonumber \\&+\,(1\!- \!\hat{F_{o}}'(\tau _{i}))(\tau _{i} + T_{l}) \!+\! \dfrac{1\!-\!(1-\hat{F_{o}}'(\tau _{i}))^n}{\hat{F_{o}}'(\tau )}T_{min}^{o} \nonumber \\ \end{aligned}$$
(36)

Remember that we have shifted all data, and the histogram to the origin. Therefore, the lower bound \(T_{min}^{o}\) of the histogram should be added to the expectation. The partial moment \(\hat{M}'(\delta _{i})\) is estimated as:

$$\begin{aligned} \hat{M}'(\delta _{i}) = \dfrac{\sum _{j=1}^{i}N_{B}[j] \cdot B_{average}[j]}{\sum _{k=1}^{i}N_{B}[k]} \end{aligned}$$
(37)

The optimal restart time can be identified by selecting the value of \(\delta _{i}\), which minimizes \(\hat{E}[T]_{\delta _{i}}\), and the optimal timeout is \(\tau = \delta _{i} + T_{min}^{o}\). Actually, at run time first the restart condition is evaluated and if it is not satisfied \(\hat{E}[T]_{\delta _{i}}\) is not determined.

7 Comparison in practical application

In order to observe and analyse the performance of the adaptive restart scheme in a real application we design an experiment. Using the experiment we show a comparison of five restart modes under different thresholds using the same scenario. Figure 13 shows the user interface of the offloading engine which is implemented with the dynamic adaptive restart scheme.

Fig. 13
figure 13

User interface of the offloading engine implemented with the dynamic adaptive restart scheme

7.1 Experiment environment

In order to imitate a scenario where the network connection is unreliable, we walk around the main building in our campus carrying the mobile device. The blue line in Fig. 14 shows the route of the device. As shown by the scale on the upper right corner in Fig. 14, the area of this teaching building is about \(300 \times 200\, \hbox {m}^{2}\) and seamlessly covered by wireless network. When the mobile device is moving, it has to hand-off frequently between different Wi-fi access points (APs). We state that the time interval of switching from one AP to the next is decided by some factors, e.g. the moving velocity of the device, the idle capacity of the next AP and the number of available APs near the device. But in our experiment, we assume that this interval time is a random variable. During the interval, the wireless network is unavailable for the mobile device. Thus, the wireless network connection between the mobile device and the server is unreliable when the device is moving.

Fig. 14
figure 14

Plan of the teaching building

The same OCR offloading task introduced in Sect. 3 has been repeated successively while the mobile device was moving. The results were stored in a text file in the mobile device. The memory of the mobile phone used for caching is cleared after each task completion and reused again in the next new task.

7.2 Experimental results

In our experiment, we started from point A as shown in Fig. 14. For initializing the dynamic histogram, we stayed there for 5 min. Then, we walked 10 min to the point B. A second chronograph is used to measure the time in the experiment. Although we cannot accurately reached the point B on time, we guaranteed that the deviation is no more than 15 s. Then we had a rest at the point B for 5 min. During the break, the mobile device connected with an identical Wi-fi AP which covers the area around point B. Even when we move in the \(5\times 5 \,\hbox {m}^{2}\) area around B, the mobile device did not hand-off to another AP. Thus, in the remaining time, the mobile device kept a reliable network connection to the server. The reason we stayed was to observe the impact of restart on a good network connection in the offloading system.

After the break, we spent 10 min again walking along another route back to point A as shown by the arrow in Fig. 14. The total time of one walk was 30 min. We compared five different restart modes:

A::

No restart.

B::

Infinite offloading restart, \(n\rightarrow \infty \).

C::

Exclusively local restart, \(n = 1\).

D::

One offloading restart \(+\) local restart, \(n=2\).

E::

Two offloading restart \(+\) local restart, \(n=3\).

We walked along the same route for each scheme six times and added up the results. The throughput of the five schemes over periods of 5 min is shown in Fig. 15. We defined the throughput as the number of tasks completed in each period. For each scheme, the number of tasks completed by the original offloading (oof), the restarted offloading and the local execution are marked individually in Fig. 15. Surprisingly, the experiment result shows that infinite offloading restart is not the optimum as its throughput is less than the other three restart schemes C, D and E. The explanation for this phenomenon is that when the mobile device is changing Wi-fi AP, sometimes the hand-off process requires tens of seconds. In particular, if the next access point has already connected to a large number of users, the new coming mobile device is hardly assigned sufficient resources to build a reliable connection. Connecting to a heavily loaded AP means that the hand-off time may extend to several minutes. During this time, repeatedly restarting to offload cannot speed up the task completion. Under a slow hand-off, local restart can at least guarantee task completion.

Fig. 15
figure 15

Throughput of different times in a day

Figure 15 also demonstrates that the throughput of scheme C is lower than that of scheme D, E. This phenomenon follows Theorem 2 in Sect. 5. However, the throughput of scheme D is higher than that of scheme E because in practical applications successive offloading tries are in many cases not independent. Generally, the failure of the first offloading restart indicates a high probability of the failure of successive offloading restarts. Our experiments give a strong indication of the correlation between successive restarts, at least when network quality is poor. Due to its complexity we did not consider the correlation in our theoretical analysis. In conclusion, the adaptive restart with one offloading retry and one local restart is in this scenario the optimal scheme to increase system performance.

8 Conclusion

In this paper we have extended our previous local restart scheme introduced in [46]. We have proposed a new adaptive restart scheme to improve the performance of mobile offloading systems. Before a task terminates by local restart in the mobile device, the adaptive scheme restarts it first using several offloading attempts. When the number of offloading restarts tries exceeds a given threshold, local restart is launched. Restarting the task again at the appropriate moment can reduce its overall completion time.

First, we introduced an experiment to illustrate the impact of network delays on mobile offloading. Then, we mathematically derived the optimal threshold and timeout in order to reduce the task completion time. We proposed a dynamic scheme to implement the adaptive restart for the mobile offloading system. In this scheme, a dynamic histogram is used to track the variation of the network quality, and the restart condition and the optimal time are estimated using the histogram.

By theoretically comparing the performance of applying different numbers of offloading retries, infinite offloading restart is proved to perform best, given that successive tries are independent and follow the same probability distribution. However, in practice a limited number of offloading restarts are preferred. Therefore, we have replaced infinite tries with a final local execution. We experimentally explore how many offloading restarts should preceed the final local execution as to optimize the system throughput (and hence the job completion time). We find that after one failed offload attempt local execution leads to higher throughput than another offload try. We assume that this effect indicates correlation of successive tries since most network problems have longer duration. Besides the correlation of successive delays, accelerating the adaptation of restart configuration by quickly updating threshold and timeout also improves the system performance. And the frequency of tuning the histogram has an impact on the update process. In the future we will include correlation, the speed of adaptation and the histogram update frequency in our theoretical study as to better predict the expected benefit of repeated offloading tries.