Large amount of real-time data are generated in the network with the development of smart home, smart transportation, social media, Internet of things and Internet applications. These data are collected, processed and analysed in real-time, so that the data processing results can be delivered in real-time with sub-second delay. The requirement for processing these real-time data spawned the concept of stream computing. Stream computing is a paradigm of memory computing that runs in a distributed environment and is therefore highly susceptible to system failure [1].

Once the stream processing system fails, the system must be normalised or get everything back to normal as soon as possible. Otherwise, the final estimated result may be worthless due to timeout. The fault-tolerance mechanism of streaming computing systems is one of the important research fields, which has received a greater attention from academia and enterprises. So far, mainstream stream processing systems generally use active or passive backup to enhance their reliability. Active backup can realise seamless failure recovery by switching among active standby tasks, but doubles the resource consumptions. As a result the cost will be very high when used in large-scale applications. Because of the high costs of active backup, checkpoint-based passive backup mechanisms are getting very popular, which reduce the expenses.

Flink [2] is an open source stream processing framework for distributed, high-performance stream processing applications. Compared with other stream processing engines such as Storm [3] and Spark Streaming [4], Flink can support both stream processing and batch processing, support real-time data processing with better throughput and exactly-once semantics process. Moreover, it also supports application in flexible deployment and flexible expansion. Therefore, it is widely used in production environments. A comparative study is performed with existing distributed stream processing engines [5].

Flink implements a lightweight asynchronous checkpoint based on the barrier mechanism to ensure high availability and efficiency. Choosing an optimal checkpoint interval is critical for checkpoint-based stream processing systems to ensure efficiency of the streaming applications. A shorter checkpoint interval will increase the cost of the streaming applications. Moreover, a longer checkpoint interval will increase the failure recovery time. Therefore, consumers need to balance between the additional cost for trouble-free operation and the cost of failure recovery to get the best quality of service.

The frequent alignment of the barrier markers will have an impact on the data processing delay, especially under a heavy workload in the exactly-once semantics process. However, Flink requires users to determine the checkpoint interval prior to the deployment phase. Therefore the user may set an inappropriate checkpoint interval due to incorrect evaluation of the incoming load characteristics, which adversely affect the quality of service of the streaming application. To solve the above problems, we are proposing a new model and its scheme of steps are as follows:

  1. (1)

    Performance model based upon Jackson’s open queuing network is used to estimate the tuples processing latency with different workloads and checkpoint intervals.

  2. (2)

    A recovery model is used to estimate the fault recovery time with different workloads and checkpoint intervals.

  3. (3)

    Checkpoint interval optimisation method based on the above models is used to calculate an optimised checkpoint interval with the system failure rate.

  4. (4)

    Finally, experiments are conducted on Flink to verify the effectiveness and efficiency of our models and checkpoint interval optimisation method.

1 Related work

Active backup and checkpoint-based passive backup are two fault-tolerance techniques widely used in distributed stream processing systems [6]. When failures occur, active backup requires at least one active backup instance to enable the task to switch from the primary task to its backup task. This ensures the shortest fault recovery time, but also incurs high resource consumptions. Therefore, in the past, active backup is only suitable for stream processing systems deployed in a small number of servers [7].

As the volume of stream data increases, the stream processing system is getting more and more complex, and the fault-tolerance mechanism based on active backup will become inefficient and may even be unavailable [8]. In contrast, resource efficiency can be significantly increased by using checkpoint-based fault-tolerance mechanisms. Therefore, in recent years, a large number of research studies analysed checkpoint-based fault-tolerance mechanisms used in distributed stream processing platforms. Sebepou et al. [9] introduced a state partitioning mechanism that supported incremental state checkpoints with minimal pausing of operator processing, thereby reducing the update cost of checkpoints. Zaharia et al. [10] proposed a new method to selectively write memory checkpoints on stable storage, thereby reducing the cost of saving checkpoints. Castro et al. [11] and Heinze et al. [12] proposed a checkpoint fault-tolerance method combined with upstream backup to reduce the recovery cost of global checkpoints. Su et al. [13] proposed a checkpoint fault-tolerance method that combined partial active backup with passive backup, thereby taking advantage of active backup to reduce rollback-recovery time. In summary, we can say that checkpoint-based fault-tolerance mechanisms are getting more and more attention in distributed stream processing systems.

When checkpoint-based fault-tolerance mechanisms are used, an optimal checkpoint interval is the key to ensure high efficiency of the submitted stream processing applications. The optimisation of checkpoint interval has been extensively studied in the field of high-performance computing. In 1973, Young et al. [14] used a first-order approximation to calculate an optimised checkpoint interval. In 2006, Daly et al. [15] extended the first-order approximation and proposed a high-order approximation scheme by taking into account the cost of failure recovery. Fernandez et al. [11] demonstrated the effect of checkpoint interval on delay and fault recovery time through experiments and proposed a method to set checkpoint interval based on estimated fault frequency and performance constraints. Liu et al. [16] and Jin et al. [17] further optimised the checkpoint interval setting of different fault distributions and execution scales based on some classical checkpoint interval optimisation methods. There are only few studies that explored and analysed optimal checkpoint interval setting in the field of stream computing. Liu Zhiliang et al. [18] designed a protocol for dynamic adjustment of checkpoint intervals in the sudden flow scenarios through coordination between upstream and downstream nodes. Zhuang et al. [19] proposed the Optimal Checkpoint Interval Model to dynamically alter checkpoint interval based on online workload.

The research in the past assumed that the checkpoint time is a stable value, and the checkpoint interval only affects the number of checkpoints. However, in a real system, the checkpoint time is directly related to the checkpoint interval. To obtain a more optimal checkpoint interval under different workload intensities, a performance model is proposed by us to estimate the tuples processing latency and a recovery model to estimate the fault recovery time.

2 Modelling and analysis

2.1 System model

In the stream processing system, each stream processing job can be abstracted as a directed acyclic graph (not considering the existence of a loop), which is recorded as G = <V, E>, where V represents the set of operators and E represents the direction of data flow between operators. Since each operator in the topology may have multiple parallel instances in actual operation, this paper uses the vector n = (n1, n2, …, nV) to represent the parallelism of each operator, i.e., there are ni instances of the operator Vi.

Flink use barrier markers to achieve consistent asynchronous checkpoints. The barrier markers are periodically injected into the input data streams by the JobManager of Flink, which are special signals to command the operators to save the state. These signals are pushed throughout the whole stream processing graph.

As shown in Fig. 1, when the operator task receives all barrier markers from its pre-tasks, it makes a snapshot of its current state and broadcasts the barrier marker to all of its successor tasks. Flink supports incremental snapshot and allows asynchronous state snapshots with low costs [20]. As noted above, to guarantee the correct results from exactly-once processing, the operator task must wait for all barrier markers from its pre-tasks to be aligned before saving the snapshot. The input connection whose barrier marker arrives earlier will be blocked, and the tuple will be cached. This phase is the alignment phase of the barrier markers and is a significant increase in overhead cost of checkpoint. So we divide the overhead cost of checkpoint into two parts: fixed overhead and alignment overhead.

Fig. 1
figure 1

Checkpoint Execution Flow Diagram

2.2 Performance model based on Jackson network

When the checkpoint mechanism is not enabled, the arrival rate of tuples that arrive at operator Vi is denoted by λi and the processing rate of operator Vi is denoted by μi. We assume that both the inter-arrival time of external tuples and the service time of the operators are independently and identically distributed exponential random variables. It is also assumed that load balancing is achieved in every operator. We need classify workload based on statistical features of traffic flows [21, 22]. Then, statistical tests are applied to check whether the load characteristics of the application conforms the distribution characteristics [23, 24].

The intensity of multiple inputs can be directly added, as the Poisson distribution is cumulative. Therefore, for each operator, it can be considered as conforming to the M/M/n queuing model, and the entire stream topology is modelled as the open Jackson queuing network. When the system reaches a stable state, the average sojourn time E[Ti] of the operator Vi consists of two parts: (1) the expected queuing delay, denoted by E[Qi](M/M/ni) and (2) the expected processing time, which is equal to 1/μi. Therefore, E[Ti] can be calculated using Eq. 1.

$$ \mathrm{E}\left[{T}_i\right]\left({n}_i\right)=\mathrm{E}\left[{Q}_i\right]\left(\mathrm{M}/\mathrm{M}/{n}_i\right)+\frac{1}{\mu_i} $$
(1)

According to the Erlang delay formula, the expected queuing delay in the M/M/n service is calculated by Eq. 2.

$$ \mathrm{E}\left[{Q}_i\right]\left(\mathrm{M}/\mathrm{M}/{n}_i\right)=\left\{\begin{array}{c}\frac{\pi_0{\left({n}_i{\rho}_i\right)}^{n_i}}{n_i!{\left(1-{\rho}_i\right)}^2{\mu}_i{n}_i},{\rho}_i<1\ \\ {}\kern3.25em +\infty, {\rho}_i\ge 1\end{array}\right. $$
(2)

In the above formula, \( {\rho}_i=\frac{\lambda_i}{n_i{\mu}_i} \) denotes the resource utilisation of the operator Vi. It is easy to know that in the case of ρi ≥ 1, the processing rate cannot keep up with incoming workloads, and the average tuple queuing delay will increase indefinitely. The probability that there is no tuple in the system under steady state is denoted by π0, which can be calculated by Eq. 3.

$$ {\pi}_0={\left(\sum \limits_{l=0}^{n_i-1}\frac{{\left({n}_i{\rho}_i\right)}^l}{l!}+\frac{{\left({n}_i{\rho}_i\right)}^{n_i}}{n_i!\left(1-{\rho}_i\right)}\right)}^{-1} $$
(3)

Based on the theoretical model of the open Jackson queuing network, the average processing delay E[T](n) of the entire queuing network can be obtained by adding the average processing delay E[Ti](ni) of each operator, given by:

$$ \mathrm{E}\left[T\right]\left(\boldsymbol{n}\right)=\frac{1}{\lambda_0}\sum \limits_{i=1}^{\mid V\mid }{\lambda}_i\mathrm{E}\left[{T}_i\right]\left({n}_i\right) $$
(4)

When a checkpoint is enabled in Flink, the probability distribution of the tuple arrival interval and the operator processing time does not change. According to the analysis in Section 2.1, the Flink save state snapshot to persistent storage can be performed asynchronously, so that the actual backup overhead is small, and the average delay of backup overhead is recorded as tc. However, if there are multiple upstream operators for an operator, to guarantee exactly-once semantic it must wait for barriers from all upstream operators before taking snapshot of its current state asynchronously. This kind of barrier alignment operation will cause the tuples to arriving earlier to wait in queue, which will have a great impact on the overhead of the tuple latency. Therefore, the effect of the alignment operation on E[Qi](M/M/ni) will be modelled in the next section.

According to Flink’s checkpoint mechanism, it can be concluded that: when the system reaches a stable state and if the overhead of the checkpoint alignment phase of the sub-instance Oi is Δi, then the tuple flowing to Oi waits for the checkpoint operation not to exceed Δe [25]. In this paper, the correctness of the results is proved only by the existence of two different upstream instances. When there are more than two upstream instances, similar analysis can be performed.

As shown in Fig. 2, the two upstream tuple flow of the operator instance Oi are s1 and s2, which are Poisson processes with parameters λ1 and λ2, respectively. We assume that the barrier of s1 reaches Oi before s2, and then Oi enters the alignment phase earlier. Obviously, within the aligned interval Δe, Oi caches approximately λ1 ∗ Δt tuples from s1, denoted as \( \left({e}_1,{e}_2,\dots {e}_{\left|{\lambda}_1\Delta t\right|}\right) \), where e1 is the first tuple cached in the s1 buffer queue. At the end of the alignment phase, Oi immediately continues to pass the barrier downstream, saves the current snapshot asynchronously, and processes the tuples in the buffer queues. It is easy to know that e1 arrives first, and the waiting time is increased to Δt. If a subsequent tuple ei enters the buffer queue by Δl later than e1, then ei actually waits for only (Δt − Δl) in the alignment phase. However, in the stage of processing the cache tuple, ei needs to wait for the preceding (λ1 ∗ Δl) tuples to be processed initially. The front (λ1 ∗ Δl) tuple processing delay is approximately \( \frac{\lambda_1\ast \Delta l}{\mu_i}<\frac{\lambda_i\ast \Delta l}{\mu_i}\le \Delta l \). Therefore, ei needs to wait \( \left(\Delta t-\Delta l+\frac{\lambda_1\Delta l}{\mu_i}\right) \), which is smaller than Δt.

Fig. 2
figure 2

Task level queuing delay analysis

It is assumed that the overheads of the ni instances of the operator Vi for the checkpoint alignment operation are t1, t2,…\( \mathrm{and}\ {t}_{n_i} \), respectively. Figures 1 to 3 shows the increment of the tuple wait time during one of the operator alignments. Based on above conclusions, we can say that the average additional waiting time of all tuples cached in buffer queues does not exceed the average additional waiting time of the tuples cached in their respective buffer queue, which is \( \frac{\sum \limits_{k=1}^{n_i}{t}_k}{n_i} \) , in the same operator.

Fig. 3
figure 3

Operator level queuing delay analysis

It is known that\( \mathrm{E}\left[{align}_{V_i}\right] \) represents the average additional wait time of the tuples caused by the barrier alignments and \( \mathrm{E}\left[{align}_{V_i}\right]\left(\mathrm{interval}\right) \) a function of the checkpoint interval. Tuples in each operator require approximately \( \frac{\mathrm{E}\left[{Q}_i\right]\left(\mathrm{M}/\mathrm{M}/{n}_i\right)}{interval} \) alignments in the M/M/n queuing model so that \( \mathrm{E}\left[{align}_{V_i}\right]\left(\mathrm{interval}\right) \) can be estimated according to Eq. 5.

$$ \mathrm{E}\left[{align}_{V_i}\right]\left(\mathrm{interval}\right)=\frac{\left(\frac{\pi_0{\left({n}_i{\rho}_i\right)}^{n_i}}{n_i!{\left(1-{\rho}_i\right)}^2{\mu}_i{n}_i}\right)\ast \frac{\sum \limits_{k=1}^{n_i}{t}_k}{n_i}}{interval} $$
(5)

In summary, once the checkpoint mechanism is enabled in Flink, E[Qi](M/M/ni) can be approximated using Eq. 6.

$$ \mathrm{E}\left[{Q}_i\right]\left(\mathrm{M}/\mathrm{M}/{n}_i\right)=\left\{\begin{array}{c}\frac{\pi_0{\left({n}_i{\rho}_i\right)}^{n_i}}{n_i!{\left(1-{\rho}_i\right)}^2{\mu}_i{n}_i}+\mathrm{E}\left[{align}_{V_i}\right]+{t}_c,{\rho}_i<1\ \\ {}\kern12em +\infty, {\rho}_i\ge 1\end{array}\right. $$
(6)

2.3 Checkpoint recovery model

JobManager and TaskManager are the two main daemons (computer programs that run in the background) of the system. JobManager is responsible for both scheduling and failure recovery. The JobManager monitors the heartbeat information of each TaskManager through the Actor System to detect whether a fault has occurred or not. One can inspect or analyse the entire process of Flink fault recovery by tracing the log information of Flink.

When the target job is running normally, the subtasks in the job are in the running state. When the job implement the checkpoint and the restart policy, the JobManager detects the failure, the status of the job changes to Failing, and the status of each task in the job is switched to the cancelling or cancelled states. The JobManager will restart the job, and the Checkpoint Coordinator in the JobManager will restore the state of each task from the most recently completed checkpoint. Then, each subtask in the job is rescheduled and dispatched to a new slot and switched to the running state. At this point, JobManager should have restored the failed job to the state of the latest checkpoint. The job requires the reprocessing the records from Kafka, starting from the offsets that were stored in the checkpoint.

It is learned from the above failure recovery process that the overhead of the checkpoint recovery Trecovery mainly consists of the following four parts: the fault detection time Tdetect, the state recovery time Trestore, the task restart time Trestart and the tuple reprocess time Treprocess.

$$ {T}_{recover}={T}_{detect}+{T}_{restore}+{T}_{restart}+{T}_{reprocess} $$
(7)

The fault detection time depends mainly on the heartbeat timeout period and the value is set according to the actual needs. The state recovery time depends mainly on the size of the checkpoint state and the network bandwidth. Please note that, the checkpoint status does not exceed the size 10 MiB in our experiment and so the state recovery time is negligible. The time of the task restart is related to the experimental platform and the application topology scale, and the tuple reprocess time depends mainly on the number of tuples to be replayed, which is closely related to both the length of the checkpoint interval and the intensity of workload. This paper assumes that the workload intensity is λ. In a pipeline-style job, the throughput of the job depends on the least throughput operator in the job, which is recorded as min{μi}. In general, the number of reprocessed tuples can be expressed as λ ∗ interval/2, so Treprocess can be approximated by Eq. 8.

$$ {T}_{reprocess}=\frac{\lambda \ast interval}{2\ast \min \left\{{\mu}_i\right\}} $$
(8)

2.4 Optimised checkpoint interval method

When configuring the checkpoint interval, both the overhead of completing the checkpoint and fault recovery time should be considered. To make a trade off, these two parts are given weights to evaluating indicator F, where f_rate represents the failure rate of the cluster. A checkpoint interval that allows F to take a minimum value can be approximated based on Eq. 9.

$$ F=\alpha \ast \mathrm{E}\left[T\right]\left(\boldsymbol{n}\right)+\left(1-\alpha \right)\ast f\_ rate\ast {T}_{restart} $$
(9)

In the above formula, the weight α has a value range of (0, 1). When the value of the weight α tends to 1, it indicates that the user is more concerned with the application performance. When the weight α tends to 0, it indicates that the user is more concerned with the application recovery time.

3 Experimental results and analysis

3.1 Experimental environment

To verify the accuracy of the checkpoint performance model and the checkpoint interval optimisation model in the actual system, several experiments are performed on both the simple topology and the complex topology. This article uses Apache Flink as the stream processing platform, Apache Kafka as the intermediate message queue, Apache Zookeeper as the distributed application coordination service and Apache Hadoop and RocksDB as the state backstage to save the snapshot state. The software and hardware experimental environment are shown in Tables 1 and 2, respectively. The experimental topology is shown in Fig. 4. In our experiments, the input uses simulated data to approximate the negative exponential distribution with a parameter λ. The operator processing logic is replaced with a processing delay time, and the processing delay time is approximated to a negative exponential distribution with the parameter μ.

Table 1 Cluster hardware configuration
Table 2 Software configuration
Fig. 4
figure 4

Experimental topology

3.2 Experiments and analysis

3.2.1 Checkpoint performance model verification

Now the accuracy of the checkpoint performance model is verified with the set of experiments. So, the experiments that may shed light on the relationship between the processing delay of the tuple and the checkpoint interval under different resource utilisation rates were performed. The processing time of the tuples were subjected to negative exponential distribution with parameters μ of 100, 300 and 500, respectively. Tuples were run five times on the cluster and the average tuple latency was recorded for the five experiments under different experimental parameter configurations. Since it is difficult to accurately control the parameters such as λ and μ in the Flink system, these parameters also need to be adjusted according to the actual cluster. Figures 5 and 6 show observed and theoretical results of simple topology and complex topology, respectively, when the parallelism is 4 and 1.

Fig. 5
figure 5

Relationship between latency and checkpoint interval of pipeline topology

Fig. 6
figure 6

Relationship between latency and checkpoint interval of complex topology

As shown in Fig. 5, the Y-axis represents the tuple average latency under different workload intensity and operator utilisation, while X-axis represents the checkpoint interval. The curve labelled ‘Measured’ represents the real tuple average latency measured during the experiment, while the curve labelled ‘Estimated’ is the tuple average latency calculated by the checkpoint performance model. The experimental results show that the tuple average latency shows a gradual decline with the gradual increase of the checkpoint interval under different workload intensity λ and resource utilisation ρ. Moreover, when the checkpoint interval is short, the latency decreases more rapidly and the amplitude is larger. For each increase in the checkpoint interval, the latency is reduced by approximately 1% to 10%; however, as the checkpoint interval continues to increase, the trend of latency decline also slows down significantly.

The measurements for the complex topology in Fig. 6 show similar characteristics. It should be noted that the effect of checkpoint operations on tuple processing delays in this model depends on not only the checkpoint interval, but also the time of a checkpoint alignment. In the complex topology experiment, as the parallelism of the topology is set to 1, there will be only one alignment operation at the join operator, and the checkpoint interval has less influence on the average processing delay. By comparing the theoretical and experimental results, It can be concluded that the observed data conform to the data estimated by our performance model.

3.2.2 Checkpoint recovery model verification

Experiments are conducted to verify the checkpoint recovery model. In these experiments faults are injected into the TaskManager to trigger Flink fault-tolerance mechanism and a mass of data under different conditions are collected. As shown in Fig. 7, the Y-axis represents the reprocess time under different operator utilisation, while X-axis represents the checkpoint interval. The bars labelled ‘Measured’ represent the real reprocess time measured during the experiment under different workload intensity, while the bars labelled ‘Estimated’ represents the reprocess time calculated by the checkpoint recovery model. In these experiments, the service time of the operators are homogeneous. Therefore, It can be seen from Eq. 8, the reprocess time is only related to checkpoint interval and operator utilisation. The results prove this conclusion and show that under different checkpoint intervals, the average reprocess time measured by multiple experiments are very close to the estimated reprocess time of our model.

Fig. 7
figure 7

Relationship between reprocessing time and checkpoint interval

3.2.3 Optimal checkpoint interval method validation

In this set of experiments, the optimal checkpoint interval recommended based on F is observed by adjusting the failure rate f_rate and the weight α. F = α ∗ E[T](n) + (1 − α) ∗ f _ rate ∗ Trestart In the above formula, When the value of the weight α tends to 1, it indicates that the user is more concerned with the application performance. When the weight α tends to 0, it indicates that the user is more concerned with the application recovery time. According to the actual situation of the cluster, except the reprocess time, the time required for fault detection, state recovery and topology restart, is about 30s. In Fig. 8, the red arrows indicate the recommended optimal checkpoint interval.

Fig. 8
figure 8

The calculation of the optimised checkpoint interval

The experimental results show that when the failure rate is small (such as 0.1%), the minimum value of F approaches to longer checkpoint interval; when the failure rate increases gradually (increases to 10%), the minimum value of F approaches to shorter checkpoint interval. It can be observed by comparing Fig. 8b and d that the recommended checkpoint interval increases with the increase in input rate λ for the same failure rate. As λ increases, the checkpoint interval has a significant impact on the tuple processing delay, especially when the utilisation ρ is close to 1. The influence of the weight α on the recommended checkpoint interval can be observed through analysing the Fig. 8d, e and f. When α tends to 0, the user is more concerned about the fault recovery time. At this time, the recommended checkpoint interval will be shorter to guarantee rapid recovery when a fault occurs; when α tends to 1, the user is more concerned about the impact of the checkpoint on the processing delay. At this time, the recommended checkpoint interval will increase accordingly. In summary, based upon the performance of different application parameters and results, we can conclude that the proposed checkpoint interval optimisation model system is a reliable one and has better user requirements and can be recommended for getting a better checkpoint interval.

4 Conclusion

This paper analyses the fault-tolerance mechanism of Flink lightweight asynchronous checkpoint and its performance based on Jackson queuing network. The experimental results show that our proposed checkpoint optimisation performance model based on the Jackson network is similar to the Flink system processing latency as the checkpoint interval changes, and the deviation between the estimated and measured values does not exceed 15%. The statistical results of the checkpoint recovery model in multiple experiments are very close to the estimated values. The F optimisation indicator proposed in our paper can recommend a more appropriate checkpoint interval depending on the system’s failure rate and the application’s operating parameters. Experiments show that the checkpoint interval recommendation calculated by our checkpoint interval optimisation model is consistent with the checkpoint interval recommendation obtained by using actual values.

In the future, we have planned to analyse different types of practical application topologies and examine how checkpoint interval can be adjusted based on real-time varying workloads.