1 Introduction and motivation

Nowadays, applications spanning various domains including social networks, e-commerce sites, and healthcare generate vast quantities of data. The growing velocity and volume of such data generation has subsequently required the substantial computing capacity in order to store and process such data effectively [1]. Such large-scale computing systems, encompassing data centre clusters, comprise hundreds and thousands of individual machines interconnected together that underpin application operation consumed by both businesses and consumers alike.

A combination of increasing application demand and technological innovations has resulted in greater system scale in the regions of tens of thousands of servers within an individual cluster [2]. However, such complexity has subsequently resulted in an increase in complexity within such systems, manifesting in the form of emergent phenomena whereby system operation exhibits behaviour unforeseen at design time. Such emergent phenomenon manifesting within large-scale cloud data centres has been observed to negatively impact application performance. One such phenomenon, known as the long-tail problem, is characterized by a minor subset of task stragglers that operate unusually slower in comparison with normal task behaviour within a job. Task stragglers occur within any highly parallelized system and become even more apparent for jobs containing many tasks executing across a large number of machines.

Frameworks such as MapReduce, Spark, and Dryad [1, 3, 4] process vast quantities of data via parallelizing jobs into a smaller subset of tasks and thus make such applications susceptible to stragglers. For example, within MapReduce, a job can only complete once all tasks have completed their execution. However, the occurrence of stragglers results in an atypically long task execution duration, thus degrading the performance of the entire job. The challenge in effectively addressing stragglers is that their root-cause is not well understood [5] and can be resultant due to various reasons spanning daemon processes, data skew, failures, resource contention, and energy management tools [6, 7], manifesting within the application, operating systems (OS), or physical hardware. This can subsequently lead to subsequent applications that depend on job outputs to also fail pending on its completion [8, 9].

This has resulted in a growing body of straggler research pertaining to analysing their underlying causes [9, 10], straggler forecasting [11, 12], and straggler mitigation techniques [1316] including speculative execution [17], replication, load balancing, and scheduling [18]. Each of these works predominantly focuses on a certain subset phenomenon within a particular context of system operation of application framework. Thus, straggler research has reached sufficient level of maturity whereby it is worthwhile to appraise the landscape of research within the field, identify cross-cutting challenges within areas, and evaluate future challenges on the horizon for future generation computing systems.

1.1 Motivation

The core motivation behind this methodical survey is to conduct a systematic review of straggler research within large-scale cloud data centres. This systematic review encompasses clearly defining and analysing the impact of stragglers, a taxonomy of various straggler management techniques for forecasting and mitigations, as well as identify future directions within the field.

1.2 Article organization

The rest of the article is structured as follows: Sect. 2 presents the background information for straggler definition as well as straggler management within large-scale systems. Section 3 presents the taxonomy of straggler causes. Section 4 explores the existing literature for straggler management techniques. Section 5 presents the comparison of straggler management techniques based on the taxonomy of straggler causes and outlines the observation, trend analysis, and future research directions. Finally, Sect. 6 summarizes the article.

2 Background

2.1 Straggler definition and impact

Applications execute within large-scale computing systems such as data centres and clusters by submitting jobs via a resource manager (YARN, Mesos, Borg, etc.). In this context, a job is composed of multiple smaller tasks (defined as the smallest unit of computation observable by the resource manager) [19]. Such jobs and subsequent tasks are scheduled onto different machines in a parallelized manner to accelerate job completion and are often divided into phases creating a direct acyclic graph (DAG) [20]. Application frameworks (such as MapReduce) attempt to sub-divide jobs so that tasks will approximately complete within the same timeframe for each phase [21]. This is achieved by providing a subset of data (known as shards) to each task, and allocating the appropriate resources to tasks (CPU, memory, etc.). This is calculated via the resource requirement module of the resource manager [22].

However, even with such measures in place, within large-scale cloud data centres a subset of tasks within a job will manifest as stragglers [23, 24]. In this context, a straggler is defined as task which execute abnormally slow in comparison with the average task duration within a job [2]. The phrase ‘abnormally slow’ is typically identified as any task with a task completion time 50% greater than the (average) task completion time for a job phase [25, 26]. Slowly executing tasks (stragglers) affect the performance and completion time of the entire job [14], increasing resource utilization and performance degradation of applications at increased scale [27, 28], thus reducing system availability and incurring additional operational costs [29]. It has been identified from analysis of production systems at scale [28] that approximately 4–6% of task stragglers negatively affect over 50% of the overall jobs within the greater system.

2.2 Straggler management

Due to the impact of long-tail problem within distributed computing systems, there have been concentrated efforts in order to effectively mitigate their effects. This has been tackled by the research community via the creation of various straggler management techniques. In this context, straggler management comprises all mechanisms that have been created in order to mitigate the effects and impact of straggler manifestation. Figure 1 shows the depiction of straggler tasks and non-straggler tasks.

Fig. 1
figure 1

Depiction of straggler tasks and non-straggler tasks

Such straggler management techniques can be predominantly considered into two main classes: detection and mitigation [30, 31]. Detection focuses on approaches to identify straggler manifestation a priori or post-priori job execution within the cloud data centre, such as offline analytics and online monitoring mechanisms [32, 33] and an example of straggler detection is NearestFit [1]. Mitigation approaches focus on avoiding [34] or tolerating (detected) straggler manifestation during job execution such as scheduling, load balancing, and replication [26, 35, 36]. The examples of straggler mitigation are Dolly [13], GRASS [14], LATE [16], and Wrangler [15].

2.3 Related surveys and our contributions

To present day, to the best of our knowledge, only two works have conducted a survey pertaining to straggler research. Umesh and Jitendar [37] discussed an overview of straggler handling algorithms for MapReduce framework, while Ashwin et al. [38] reviewed several straggler handling techniques. While these reviews cover specific cases of stragglers related to specific frameworks and installations, they do not necessarily provide a comprehensive survey of the straggler causes and straggler management techniques which exist within the research community. Furthermore, these works do not discuss in detail the precise root-causes and analysis of straggler behaviour, which underpin the design of straggler management techniques. Therefore, this paper attempts to provide a systematic review and taxonomy of straggler causes and map them directly to straggler management techniques along with trend analysis.

3 Taxonomy of straggler causes

As mentioned in Sect. 1, the challenge within this research area is the myriad of potential causes of straggler manifestation. According to our comprehensive appraisal of the literature, we have identified eight key causes for straggler occurrence that manifest within large-scale cloud data centres. Figure 2 shows the taxonomy of straggler causes.

  1. 1.

    Data abstraction Stragglers can occur due to information obfuscation at different levels of the system. The literature [3942] has identified that information can be hidden at two different levels: (i) OS level and (ii) application level. During the execution of resources, the master node (controller) hides information from workers (cluster nodes) at OS level. (ii) At application level, the information regarding platform services and infrastructure services is kept hidden from the software services.

  2. 2.

    CPU utilization It has been identified that there is a strong correlation between high system CPU utilization and straggler occurrence [7, 12, 43, 44]. The reason for this occurrence is resource contention. This is further compounded due to Head-of-Line blocking (HOL blocking), task interference during execution, busy locks, queue issues, hazard rates of task execution and launching additional speculative replicas, which requires additional time for execution.

  3. 3.

    Scheduling It has been identified that scheduling and resource allocation decisions also influence straggler manifestation [4548]. For job scheduling, stragglers can occur due to a large number of enqueued jobs within a (machine, master scheduler) that are pending for available resources to be revoked (i.e. only a portion of tasks within a job are able to successfully acquire their necessary resources to commence execution). Furthermore, straggler may occur due to the poor admission control mechanisms, which is used to submit the jobs for execution [49]. The poor admission control mechanism launches multiple tasks together, resulting in resource exhaustion causing slowdown. Lastly, dynamicity of QoS requirements at runtime results in an inability to effectively manage the resources which leads to further the straggler occurrence. In terms of resource scheduling, stragglers can occur in following situations [4952]: (1) when resources are allocated to the jobs in an inefficient manner without available resource optimization, leading to ineffective scheduling of resources for job execution and (2) sometimes resources are still in active stage even they are not utilized for execution of jobs, which consumes more energy and affects the performance of other resources because some resources need more power to run continuously.

  4. 4.

    Inaccessible local disk Stragglers may occur when a machine hard disk is not accessible to residing tasks. Such inaccessibility is predominantly caused by [9, 5359]: (i) increasing backup tasks and (ii) failing to store output. Stragglers can occur, when it is difficult to find the required task due to the large backlog of the tasks waiting for execution. Sometimes, an error can occur while storing the output on the disk, causing a problem when some tasks want to access those data during execution.

  5. 5.

    Data skew Straggles can occur due to the data skew, caused by the different data sizes and time variation in accessing required data [56, 57, 60, 61]. With several tasks operating on a split version of a very large shared dataset, an uneven distribution of the data amongst these tasks potentially results in some tasks to progress slowly in comparison with tasks within the same phase (and subsequently delays the future sub-phases and the entire job). Data non-uniformity can also impact data access and processing time data, directly affecting the timing delays between tasks, further increasing the probability of straggler occurrence. Moreover, data locality for job execution results in lower latencies, while distant data will take longer to be accessed, incurring additional delays in task completion, again, manifesting as a straggler.

  6. 6.

    Resource contention Resource contention occurs when the same resource is shared by multiple tasks [9, 13, 14, 17, 5355, 58, 59, 6269]. Resource contention occurs due to conflict over task access and oversubscription to a resources within multi-tenant machines which can be exuberated within different scenarios including: (1) hardware heterogeneity, (2) poor user code, (3) extra cloning, (4) ineffective algorithm logic, (5) temporary slowdowns, (6) additional task clones requiring more resources and (7) resource usage being higher than accepted threshold value. Hardware heterogeneity is the main reason of resource contention, which occurs due to a mismatch between hardware specification and specified application constraints (e.g. budget, deadline, etc.) leading to task performance degradation. The source code of scheduling algorithm also affects the performance of the scaling system due to its coding style in terms of space and time complexity. Sometimes, poorly written source code schedules resources inefficiently, which can increase resource consumption and unavailability of required resources to specific jobs [35]. The cloning of tasks is creating a similar of copy to task to run parallel on another resource for fast execution.

    The cloning of tasks needs more resources (increases resource usage), which can also put tasks of other jobs on hold and when the tasks are waiting for other resources, then stragglers can occur. An ineffective logic in the resource scheduling algorithm can also lead to an inefficient allocation of resources and increase resource usage, which leads to resource contention for future tasks. Temporary slowdown can occur due to inefficient allocation of resources, which needs to be corrected; otherwise, it will cause straggler occurrence during execution of resources.

  7. 7.

    Task execution The successful execution of a task is important to avoid straggler occurrence during execution of jobs [10, 28, 7074]. During job execution, stragglers can occur due to unhandled requests or ineffective task interference and task incompatibility management. When a processing request is unhandled or not fully handled, tasks expecting the results of this request will have to wait until the full request output is ready, manifesting in straggling tasks. This occurs due to data dependency and task dependency. If the tasks are not oblivious to the heterogeneity of the underlying resources of the platform, their incompatibility (non-synchronization) due to different types of workloads or requirements can manifest in slower execution and ultimately straggler occurrence.

  8. 8.

    Faults Faults within software and hardware resulting in to crash-stop and late-timing failure can cause straggler occurrence in large-scale systems [17, 18, 63, 64, 75, 76]. The main reasons for software-induced faults can be: development, logic or overflow errors as well as misconfigurations. In terms of hardware, the main fault occurrence reasons are: physical damage, device failures, daemon processes, or power-related issues such as effective energy management. Ironically, fault tolerance and recovery mechanisms can themselves result in straggler manifestation (for example, checkpointing introduces burst in disk access and increases resource contention, resulting in a higher system hazard rate).

Fig. 2
figure 2

Taxonomy of straggler causes

3.1 Relationship between straggler causes

Based on different types of causes of stragglers in large-scale systems, we have identified the correlation among them, as described in Table 1. As identified in [28], stragglers are not resultant of a singular cause, but can potentially be correlated. For example, data abstraction can occur due to tasks in a queue waiting for execution. Resource contention is the main reason of stragglers due to the sharing of resources among different applications, which are running on different nodes, which further affects the CPU utilization by overloading the resources. Straggler occurs during scheduling of jobs as well as resources, and the reasons for straggler occurrence during resource scheduling can be heterogenous resources, poor user code or logic error, and too many copies of straggler tasks that are running simultaneously. The reasons for inaccessible local disk can be large copies of backup tasks and failing to store required output, which happens due to task interference and its incompatibility with other tasks. The other reason can be that requirements are changing dynamically. Data skew happens due to straggler occurrence at application level due to data hiding or failing to write data. The other reason can be inefficient allocation of resources for processing of data, which can increase running time of resource. The resource contention occurs at OS level, when master node hides the information from workers. Further, the overutilization of CPU causes the resource contention due to increasing speculative copies as well as when the performance of node degrades. Moreover, poor admission control can also affect the resource utilization and creates resource contention when the value of required resources is increased than the available resources. Further, resource contention affects the task execution due to unavailability of shared resources. Fault occurrences during job execution can happen due to resource failure and resource misconfiguration [77].

Table 1 Correlation among straggler causes

4 Straggler management techniques: current status

Straggler management techniques can be categorized into two broad categories: straggler detection and straggler mitigation [78]. Each category can be further sub-divided into specific areas as shown in Fig. 3.

Fig. 3
figure 3

Taxonomy of straggler management techniques

4.1 Straggler detection techniques

Straggler detection techniques are leveraged in order to identify straggler occurrence during job execution.

4.1.1 Offline straggler detection

Offline straggler detection technique attempts to identify straggler manifestation in order to enhance speculative execution via leveraging offline analytics (i.e. analysing and modelling task execution and progress patterns derived from empirical data a priori execution).

Coppa and Finocchi [1] identified three different challenges such as straggling tasks, load unbalancing and data skewness, which affects the performance of computing systems. To overcome these challenges, authors proposed a profile-guided progress indicator called NearestFit to gather the required combination of closest neighbour regression using statistical curve fitting approach. NearestFit is mainly suitable for long running applications and helps to identify the above discussed challenges to increase the efficiency of computing systems. Authors implemented the NodeIterator triangle counting algorithm using homogeneous clusters in Hadoop to test the capability of NearestFit dynamically in terms run time and progress.

Ouyang et al. [70] proposed a technique for modelling and ranking node-level stragglers (MRNLS) in CDCs based on analysing the execution trace log data of parallel jobs. This was conducted by a graph-based algorithm that is used to partition the server nodes into small nodes to execute more jobs in parallel. The proposed techniques improve the performance of computing systems by reducing task stragglers occurrence. Cong et al. [72] proposed a machine learning-based straggler detection (MLSD) technique using unsupervised clustering method. The proposed technique effectively manages the resources while executing the jobs and diagnosing the stragglers at runtime. Wei et al. [10] proposed straggler detection approach (SDA) for data-intensive computing in cloud environment to detect stragglers at early stage to preserve the efficiency of the CDC. Further, statistical method for outlier detection called Turkey is developed to detect straggler at run time because it starts the speculative execution earlier than the standard deviation method.

4.1.2 Online straggler detection

Online straggler detection technique detects the straggler to improve speculative execution using online monitoring tools.

Farshid [79] analysed that map phase of MapReduce (MR) framework takes longer with the increase in the number of servers, which further affects negatively the execution time of MapReduce job. Moreover, authors designed an analytical model to identify the impact of stragglers on efficiency of computing system using map phase in terms of application, system, and hardware parameters. Experimental results show that model reduces the execution time during execution of MapReduce applications. Zaharia et al. [80] proposed a resilient distributed datasets (RDDs), a distributed memory abstraction, which enables developers to provide a fault-tolerant module while performing in-memory computations on a huge number of clusters. RDD uses coarse-grained transformations to offer controlled form of shared memory to perform different memory-intensive computations in an iterative manner. Further, Spark is used to implement RDDs in a controlled environment to evaluate its performance.

Wang et al. [17] proposed heuristic algorithm (HA) to search for the best replication to reduce latency in computing systems. The proposed algorithm is used to implement the proposed algorithm, and experimental results demonstrate that this is capable of reducing latency and its impact on cost of execution of workloads. Jeffrey and Sanjay [54] explored data processing on large clusters (DPRCs) to perform different aspects such as (1) providing fault tolerance by distributing computations, (2) optimizing network bandwidth by decreasing the quantity of data transferred throughout the network, and (3) decreasing impact of slow machines and improving fault tolerance. In DPRC [54], speculative copy of task is executed by MapReduce on another node for increasing job completion time and reducing response time. It is challenging to select the task for which to execute speculation because it is not trivial to identify the machine or node, which is running slower than average. To implement DPRC effectively, stragglers are recognized at the earliest possible stage calculated by progress scores.

Garraghan et al. [28] explored the root-cause of stragglers (RCS) and provided a method to analyse the root-cause analysis in a massive scale virtualized CDCs to solve the long-tail challenge effectively. Authors used online analytic agents and offline execution patterns modelling for straggler detection while monitoring tasks dynamically. Heecheol et al. [81] proposed secure distributed computing (SDC) approach using recovery threshold value to efficiently deal with the impact of straggling [82], which uses polynomial codes on sub-tasks allocated to nodes.

4.2 Straggler mitigation techniques

Straggler mitigation technique comprises all mechanisms and approaches to tolerate or avoid the impact of straggler manifestation. Such techniques can be further sub-divided into three sub-categories [68, 8385]: load balancing based, replication based, and scheduling based.

4.2.1 Load balancing-based straggler mitigation

Load balancing-based straggler mitigation technique manages the load during mitigation of stragglers.

Ouyang et al. [2] proposed a method to reduce late-timing failure (LTF) and analyse the root-cause of stragglers in cloud data centres (CDC) such as server failures or task concurrency and resource contention. Further, this study identified the high temporal resource contention as a main root-cause of stragglers. Further, the output of experiments demonstrates that this technique maintains the efficiency of the computing systems while tolerating the system failures effectively. Yanfei et al. [86] proposed a user transparent task slot management approach called FlexSlot, which identifies the stragglers automatically and resizes their slots to improve the speed of execution of task. The approach also balances the usage of resources by automatically changing the number of available slots of nodes to improve its utilization. Moreover, FlexSlot uses adaptive speculative execution approach to improve mitigation of skew data.

Neda et al. [71] proposed log-assisted straggler-aware (LASA) I/O scheduler for high-end computing to mitigate the impact of storage server stragglers. Further, a scheduling algorithm is proposed to make effective decisions to manage stragglers at runtime. The output of experiments demonstrate that LASA is performing better in load balancing while mitigating the storage server stragglers dynamically. Eman et al. [62] proposed a parallel model for straggler mitigation in distributed spatial simulation called priority asynchronous parallel (PAP) to exploit data dependencies of parallel processes to be computed and synchronized based on data priority to the other workers. Moreover, load balancing and partitioning method are proposed to balance the workloads among different nodes and help to improve the performance speedup by a large extent. Haozhao et al. [87] proposed heterogeneity-aware gradient coding (HGC) scheme to execute the jobs in heterogenous environment and efficiently tolerate the stragglers without degrading the effectiveness of the cloud services [34]. The output of experiments demonstrates that HGC scheme outperforms in computation time.

4.2.2 Replication-based straggler mitigation

Replication-based straggler mitigation technique replicates the adequate number of tasks during mitigation of stragglers.

Mehmet et al. [8] analysed the trade-off between latency and cost (TLC) using simple replication or erasure coding for straggler mitigation in executing jobs with many tasks. Experimental results show that delaying redundancy is not effective in reducing cost. Further, Mehmet et al. [55] developed a straggler mitigation (SM) technique using delayed relaunch of tasks, which helps to reduce cost and latency effectively. Wang et al. [9] proposed an idea of an efficient task replication technique (TRT) for straggler management to improve the response time in parallel computations. This technique is implemented in [88] and demonstrates empirically that replicating all operations can result in significant mean and tail latency reduction in real-world systems including domain name system (DNS) queries, database servers, and packet forwarding within networks.

Tien-Dat [11, 64] proposed energy-efficient straggler mitigation (EESM) technique for effective management of big-data applications in the cloud computing environment to optimize the energy consumption during straggler occurrence. Firstly, authors characterize the effect of straggler mitigation on energy efficiency. Secondly, a straggler detection framework is developed, and they identified that only 12% of the detected tasks are real stragglers [64]. The usage of huge number of speculative copies is the main reason for unnecessary energy consumption. Thirdly, a reservation-based straggler handling approach is proposed to optimize the energy efficiency by allocating the required resources at runtime effectively.

Wang et al. [89] analysed the trade-off between latency and cost to find out the best replication technique for straggler management based on following parameters: (1) when to perform replication for straggling tasks, (2) number of replicas to be launched, and (3) is it necessary to destroy the original copy or not. Further, a straggler management approach (SMA) is proposed to calculate the value of latency-based empirical distribution of execution time of task. The output of experiments demonstrates that this work gives better for two performance parameters such as cost and latency. Lei et al. [90] proposed a straggler management technique called Combination Re-Execution Scheduling Technology (CREST) for fast speculation of straggler tasks in MapReduce framework, which further reduces the response time of MapReduce jobs. The re-execution of set of tasks on set of computing nodes in CREST improves the speed of task execution.

Radheshyam et al. [91] proposed a job-aware scheduling (JAS) technique to optimize the running time of different jobs by maintaining the harmony among them, which are executing on the same cluster. JAS technique is implemented using for MapReduce framework. Further, proposed algorithm selects the most compatible task with executing task to reduce more execution time. Moreover, a heuristic-based load balancing technique is developed to avoid the underloading and overloading of resources. Matei et al. [16] explored the MapReduce framework for straggler management and improved its performance in heterogenous environment. Further, a resource scheduling algorithm, longest approximate time to end (LATE) is proposed to improve the robustness in regard to heterogeneity and improves response time of tasks. LATE scheduling algorithm [80] estimates the longest approximate time and select the task with the longest approximate time as straggler tasks and execute its speculative copy on another fast node to speed up the job completion time. SAMR scheduling technique [18] computes the completion of tasks at runtime and discovers the straggler task based on execution time. Historic information of node is used to detect more reliable node in SAMR and weights of reduce and map stages are updated after completion of every task.

Farhat et al. [63] proposed a straggler management technique for modelling and optimization (SMMO) of straggling mappers to show the stochastic behaviour of mapper nodes and its negative effect on completion time of MapReduce jobs. Authors identified task inter-arrival time of jobs to map the required nodes of heterogenous CDC in an optimized way. The experimental results demonstrate that the proposed technique reduced the execution time of jobs at runtime. Behrouzi-Far et al. [92] proposed an efficient straggler replication framework in large-scale parallel computing to analyse the performance of the system in terms of latency–cost trade-off. Further, it identifies the best replication technique based on different criteria such as: (i) number of replicas required, (ii) time to replicate straggling tasks, and (iii) determine whether to kill the original task. Finally, performance evaluation is described that latency and cost are reduced in Google Cluster Trace as compared to MapReduce.

4.2.3 Scheduling-based straggler mitigation

Scheduling-based straggler mitigation technique schedules the resource for jobs during mitigation of stragglers.

Ananthanarayanan et al. [13] explored the straggler mitigation techniques and identified the impact of reasons of stragglers in latency-sensitive jobs. Further, authors designed workloads with small number of jobs and performed cloning of small jobs. It has been identified that the cloning of small jobs uses less resources but improves the reliability of computing services. Moreover, a system named Dolly is developed to generate multiple clones of jobs and execute jobs within their specified budget. Experimental results demonstrate that Dolly sped up jobs by 46% by using only 5% extra resources.

Ananthanarayanan et al. [14] proposed greedy speculative scheduling and resource-aware speculative scheduling (GRASS) technique, which uses speculation to mitigate the impact of stragglers in approximation jobs. GRASS uses extra resources for speculation and improves accuracy for deadline-bound jobs by 47% and speeds up error-bound jobs by 38%. Aaron et al. [53] addressed the straggler problem for iterative convergent parallel (ICP) machine learning technique to identify the behaviour (in terms of delay) of the system during execution of jobs by injecting the stragglers. Amazon EC2 and Microsoft Azure [93] are used to evaluate the performance of system in terms of execution time.

Ouyang et al. [25] proposed a straggler management technique (SMT) to find the task stragglers by calculating threshold value at runtime. Further, this technique considers important key parameters such as resource utilization, task execution, and job QoS timing constraints to manage straggler tasks effectively. Neeraja et al. [15] proposed straggler management technique called Wrangler to proactively avoid the conditions, which cause stragglers. Wrangler [13] uses interpretable linear modelling approach to reduce the resource wastage by eradicating the requirement for replicating tasks. It uses fewer resources to complete the job in a faster way and avoids the straggler proactively by predicting in advance. A cluster resource utilization-based statistical learning technique is used for confidence measure to offer reliable task scheduling by predicting errors in advance. The output of experiments shows that Wrangler produces improvements in terms of job completion time and resource utilization as compared to speculative execution.

Quan et al. [18] proposed a self-adaptive MapReduce (SAMR) scheduling technique for straggler management, which estimates task progress automatically and adapts to the changing conditions of environment dynamically. SAMR uses MapReduce mechanism to divide jobs into tasks and execute on different available nodes. SAMR does not create backup tasks for regular tasks. SAMR reduces the execution time of MapReduce jobs while executing tasks in heterogenous environment. Enhanced SAMR (ESAMR) [27] uses the k-means clustering algorithm to categorize the historic data of each node into k-clusters and identifies the straggler task more accurately. Furthermore, ESAMR uses weights of reduce and map stages to find the Time to End on different nodes, which can easily identify the more reliable node.

Ananthanarayanan et al. [27] studied and explored the straggler management in resource-aware techniques and identified the main causes of stragglers such as varying bandwidth, network congestion, workload imbalance, and contention of resources (network, memory, and processor). Furthermore, Mantri [27] is used to monitor task execution and take a proactive action to sustain the efficiency of the CDC in the case of resource contention or hardware/software failure [9496]. It uses Bing traces to evaluate the performance, and it improves job completion time to a large extent.

Ouyang et al. [97] proposed a straggler management mechanism (SMM) to improve the execution efficiency of Internet-ware applications by dynamically calculating the straggler threshold, considering important parameters such as optimal system resource utilization, task execution progress, and job QoS timing constraints. Further, YARN architecture is used to implement dynamic straggler threshold to test the performance of the proposed mechanism and experimental results give the better outcomes in terms of response time. Yan et al. [98] developed large-scale multimedia semantic concept (LMSC) model to improve the scalability of the computing systems with heterogenous environment. Robust subspace bagging algorithm is used to improve learning process, and further, a task scheduling algorithm is proposed to improve the scalability by executing heterogenous tasks. Proposed model is tested on MapReduce framework, and experimental results demonstrate its superiority.

Figure 4 presents the evolution (2008–2019) of different types of straggler management techniques along with their focus of study and QoS. Table 2 shows the comparison of different types of straggler management techniques based on different parameters.

Fig. 4
figure 4

Evolution of straggler management techniques

Table 2 Comparison of straggler management techniques

5 Comparison of straggler management techniques based on taxonomy

Table 3 shows the comparison of straggler management techniques based on taxonomy of straggler causes from Fig. 1 and Table 2.

Table 3 Comparison of straggler management techniques based on taxonomy of straggler causes

5.1 Analysis of experimental results: practical use-case

The existing straggler management techniques have been categorized into two categories, i.e. straggler detection and mitigation techniques. Table 4 shows the analysis of experimental results of straggler detection and mitigation techniques in the context of different performance parameters. Future researchers can use Table 4 to validate their research work based on the values of various performance parameters identified from the existing literature. The literature reported that there are four types of data abstraction levels (OS, application, server, and VM), where straggler can occur.

Table 4 Analysis of experimental results of straggler detection and mitigation techniques

5.2 Trend analysis

Our systematic review has identified different types of result outcomes for different categories of straggler management techniques developed from year 2008 to year 2019. The scheduling-based straggler mitigation technique appears prominent across the years except year 2012. After the scheduling-based techniques, researchers focused on replication-based straggler mitigation, during the year 2013 to 2019. The offline, online, and load-balancing straggler management techniques are less focused on from year 2008 to year 2019 requiring research to improve the straggler management in large-scale systems. Researchers focused on scheduling and replication-based straggler management in years 2018 and 2019. Figure 5 shows the year-wise publications of straggler management techniques, and it has been clearly depicted that research from year 2008 to 2016 was highly progressive in this area, declining after 2017 and 2018 while progressing in 2019.

Fig. 5
figure 5

Publications of straggler management techniques

The literature reports the research related to straggler management is mostly published in journals (31%), followed by conferences (28%), transactions (21%), and book chapters (10%). The rest of the research is published in symposiums, workshops, white papers, and PhD thesis. Figure 6 shows the research conducted related to straggler management at different levels such as Application, Server, OS, VM, and cooling. Figure 6 clearly shows that most of the research work has been done at the application level (46%) and followed by VM level (21%). Only 3% of research work has been done at cooling level.

Fig. 6
figure 6

Straggler-type breakdown in the literature

The literature reports 44% of research work considered between 0 and 100 nodes for performance evaluation, and only 7% research work considered 1000 + nodes. There are four different types of studies identified from the literature: real testbed based (63%), systematic reviews (7%), conceptual models (10%), and simulation based (20%). Most of the technical research papers (63%) consider real testbeds for performance evaluation. There are only two reviews [37, 38], which have been done in this area. Table 5 shows the different research work related to different performance parameters identified from Table 4.

Table 5 Research work related to performance parameters

5.3 Observations

From the trend analysis, it is observable that current related works focus on studying and mitigating specific straggler types, ranging from resource contention to data skew as shown in Table 2. This appears to be a necessity given the complexities and management strategies appropriate for each straggler type. The challenge is that it is possible for straggler manifestation to be correlated in terms of system phenomena, but also management techniques themselves (e.g. use of speculative copies to address data skew causes increased resource contention).

The important research challenges within the large-scale cloud data centres such as latency, scalability, energy consumption, and data processing are contributing to the rise in research in the field of straggler management, which can be solved by using artificial intelligence techniques. On the other hand, there is a need of real cloud infrastructure (at least 50 physical nodes) to test the performance of future straggler management techniques, but it would be very expensive to afford for academic institutes. To solve this problem, industries such as Facebook, Google, and Amazon should collaborate with academic institutes to provide required infrastructure to do real experiments.

This systematic review also identifies various research directions for perspective researcher scholars, who are working in the field of straggler management for distributed systems and searching for new research challenges to improve the performance of cloud services. The straggler management is an evolving field of research for large-scale systems, and it is quite challenge ring to execute user workloads without occurrence of stragglers. To solve this problem, there is a need to recognize the reasons of long-tail problem or stragglers and their correlations, which can help to find out the dependency among stragglers. This study [1] developed straggler management technique for profile guide more accurately, but accurate predication is difficult to get if job is very small to gather required profiling data. An efficient data recovery is achieved in [80], but it has been identified that the memory requirements do not grow to intolerable levels as the size of dataset is increasing, which further causes the stragglers. The jobs are increasing with time, but there is need to analyse the impact of multiple jobs on probability of stragglers [13]. Existing techniques use historic data to estimate resource requirement [17]. However, there is a need to develop an online strategy to simultaneously learn the execution time distribution and launch replicas, instead of estimating time using historical traces. Further, the replication increases the reliability of execution of jobs, but it consumes more energy, which is a global challenge to address [8]. The scale-up/down infrastructure by switching on/off the virtual machines/nodes based on the resource usage of the cluster to save energy is required [91]. The dependency among tasks during task execution further effects causes the stragglers because some tasks need to complete in order to begin others [89]. Existing straggler management techniques are required to improve to attain to reduce straggler occurrence. By using this systematic review, causes of straggler can be identified easily. Therefore, an effective straggler management technique can be developed to execute the jobs without straggler occurrence while fulfilling the dynamic requirements of job, which helps to increase the efficiency of large-scale cloud data centres.

5.4 Future research directions

Although a substantial progress has been made in straggler management techniques for large-scale systems, there are still many pressing issues and challenges in this field that need to be addressed. Based on existing research, we have identified various open issues pending in this area.

5.4.1 Data processing

Data processing in straggler management is an important challenge [54]. It happens due to the skew in data that the computing system is able to process effectively. There are two types of problems which reduce the data processing capability of systems: (1) large variation of data size and (2) non-uniformity of data. These two reasons degrade the performance of large-scale computing systems. To improve the straggler management mechanism, there should be less variation as well as less non-uniformity of data. Tackling this challenge can further improve the processing speed of computing systems in terms of execution time and latency.

5.4.2 Heterogeneity

Hardware heterogeneity is the main reason for resource contention, which occurs due different types of resources (with different configurations, different providers, etc.) being used, and sometimes some resources are not compatible to execute jobs in a coordinative manner. There is a need for a single interface, which can provide a stable platform for interaction of different types of hardware in a collaborative manner.

5.4.3 Latency

The latency is another important challenge in straggler management of large-scale systems, which can affect the performance of computing systems. There are different types of reasons for latency: (1) non-uniformity of data, (2) resource contention, (3) poor user code, and (4) extra cloning. To improve the processing of computing systems, there is the need to make data uniform initially. Further, efficient resource scheduling algorithms are required, which can reduce resource contention at runtime and reduce the latency [99]. The extra cloning of tasks to speed up the execution can increase the latency because there is a requirement for more number of resources to process more number of copies. There is a need to develop an effective straggler management technique, which schedules resources and reduces latency at runtime.

5.4.4 Scalability

To improve the performance of computing systems, the systems must be more scalable to serve the jobs within their specific deadline without further delay at runtime [100]. The scalability of the computing system can increase the capacity of the system when the load increases, which can further reduce the problem of occurrence of stragglers.

5.4.5 Resource sharing

The sharing of resources among different jobs can improve resource utilization, but it leads to resource contention, which can degrade the performance of large-scale computing systems [101]. There is a need for an effective resource contention technique, which can identify the reasons of resource contention and provide the possible solutions to avoid additional resource over-allocation, ultimately contributing to straggler occurrence.

5.4.6 Energy management

The literature reports [99102] that the straggler management techniques create several copies of the same task to mitigate the effects of stragglers. Copying a task reserves additional resources such as the disk, memory of CPU time, and increasing use of particular resource. As the resource is more continuously used, its energy consumption rises. Depending on the type of the resource, its performance can degrade as its energy consumption increases above a certain threshold level.

6 Summary and conclusions

In this paper, we have provided a comprehensive literature review of current straggler research within Computer Science, an important problem which directly debilitates the performance of large-scale computing systems. We proposed a taxonomy of straggler causes as identified from different types of straggler management techniques. Moreover, various straggler management techniques have been reviewed and classified into two categories: straggler detection and straggler mitigation. The comparison of straggler detection and straggler mitigation has been presented in detail, and the taxonomy mapping-based comparison has been described, and various result outcomes related to straggler management have been presented. Observations of interest include the focused nature of straggler causes, and mitigation solutions may potentially interfere with each other due to correlated root-causes. Hence, there is a possibility of designing a multi-purpose straggler management technique which profiles and acts based on the type of identified straggler.