1 Introduction

Large-scale data repositories play an important role and are the main source of information. Timely and accurate retrieval of information is critical for organizations to reach success. The workload can be defined as data loads, batch reports or complex requests that include insert, delete and update operations, which may be resource demanding (such as input/output devices, storage and memory). The queries or workload is generated by either a decision support system (DSS) or online transaction processing (OLTP) for a database management system (DBMS) and online analytical processing (OLAP) for a data warehouse (DW). The mix type (mix of OLTP and DSS) of workload is also gaining importance. Therefore, identification of the workload type is deemed important for better workload management. In recent literature, more types of database workload are being studied. Resource allocation is performed based on the workload type. A database administrator (DBA) used to manage the system workload by tuning and configuring it manually. Due to the increase in the volume of data and change in workload behavior, DBAs cannot manage it efficiently and thus DBMS performance decreases. When DBMSs become overloaded, resources cannot be allocated properly. The DBMS performance suffers when the DBA or workload manager is unable to handle it properly; therefore, some requests may not be entertained and are rejected or delayed. Due to low performance, users also lose their interest and it becomes the reason for the wastage of organizational resources like money and time.

The term autonomic computing (AC) was introduced by the IBM in 2001 [151], and since then, research communities, database and data warehouse vendors have been working on the incorporation of autonomic technology. Incorporation of AC is performed in various areas of study, and it is equally important in large-scale data repositories. AC has a number of characteristics, and one or more of them are incorporated into the system to make it autonomic. The study [49] highlighted few shortcomings in the DBMSs with the autonomic point of view. The well-known commercial DBMSs including Oracle, DB2 and SQL Server are investigated in the studies [89, 90, 123] to see how these shortcomings have been achieved in recent DBMSs. Autonomicity is incorporated into different components, utilities and tools, and certain levels are achieved by DBMSs [122]. Database researchers continue to focus on proper management of the database workload due to heterogeneity, growth in data size and complexity in the workload. Autonomic workload management is related to AC characteristics such as self-prediction, self-healing, self-inspection, self-adaptation, self-configuration and self-optimization [66, 151]. These works are being carried out on classification, performance prediction and adaptation of workload. The AC technology has the potential to enhance the performance of large-scale data repositories. However, there are challenges in workload management which include predicting the execution time of workload, resource contention workloads and dealing with problematic workload [150]. A number of tools, models and algorithms have been presented in [121] that are developed for managing the system workload with respect to AC. The literature reveals that two types of workload, i.e., DSS and OLTP, are considered to classify the workload and predictions are made to forecast the performance of workload and adaptation is performed on changing the behavior of workload. The tasks that were performed by a DBA are performed autonomically without or with less human intervention, and DBA has the time to do other useful tasks that will improve the performance, reduce the time and cost of ownership.

In the literature, there exist a few relevant surveys [18, 87, 91, 158]. The survey [18] only presents workload characterization, and it does not describe AC perspective. The survey [87] discusses AC only and is not relevant to workload management. The survey presented in [91] describes the literature review of database workload management with respect to AC for mainly three commercial databases only which are Oracle, DB2 and SQL Server. The recent survey [158] presents a taxonomy of workload management in DBMSs. This survey covers the workload management techniques for characterization, admission control, scheduling and execution control. However, for our survey, we explored all the techniques used for workload management through workload classification, workload performance prediction and workload adaptation. It provides detailed analyses of workload management techniques related to large-scale data repositories such as DBMSs and DWs from the year 2001–2017. We categorized these techniques under three autonomic characteristics, i.e., self-inspection, self-prediction and self-adaptation. The main objective of this study is to conduct a state-of-the-art survey of workload performance tuning to see insights on how the management is done intelligently by incorporating AC characteristics. The paper focuses on management of two types of database workload OLTP and DSS in databases and DWs related to large-scale data repositories. The contributions of this paper include detailed analyses of workload management with respect to AC. Three main AC characteristics (self-inspection, self-prediction and self-adaptation) are selected in this work that are broadly used in workload management by well-known vendors of large-scale repositories. A taxonomy of autonomic workload performance tuning (AWPT) is presented, and the approaches for performance modeling in DWs are explored. State-of-the-art research issues are also described, and research directions are provided.

The organization of the paper is as follows. Autonomic workload management is presented in Sect. 2. The research methodology followed in this research is provided in Sect. 3. The taxonomy for workload management created in this research is presented in Sect. 4. The issues in large-scale data repositories are discussed in Sect. 5. At the end research is concluded with future work in Sect. 6.

2 Autonomic computing and autonomic workload management

Early unreliable and static systems are now being converted to reliable and dynamic systems, and its dynamic behavior is also evolving. The increase in computer usage has created complexity and due to this manageability has become very difficult for recent and future information technology. These workload manageability issues motivate toward intelligent or AC where the computations will be performed autonomically, and in the meanwhile, DBAs or IT professional can do other useful activities of DBMSs rather than operational tasks with less cost and less human involvement. The systems which have the ability to manage all the activities without or with less human involvement are called autonomic systems [77, 151]. The idea of AC is taken from a human nervous system that is performing required tasks in the human body without conscious mental effort about what is needed and when is needed.

The evolutionary process of AC consists of five basic levels starting from basic to an autonomic level such as managed, predictive, adaptive and autonomic [87, 151]. Figure 1 presents the MAPEK (Monitor, Analyze, Plan, Execute and Knowledge) model of the autonomic system. The workload acts as managed element. Managed element, the main component in an autonomic system, is handled through manageability interface which consists of sensors and effectors.

Fig. 1
figure 1

MAPEK model

From the managed element, Sensors collect the necessary information and Effectors are adjusted according to the intended behavior.

Monitor examines the activities of Sensor and stores the data collected by Sensors in the Knowledge-base. The Analyze component performs a comparison of parameters of data with intended parameters and stores the analyzed data in the knowledge-base. Using the analyzed data, the plan component finds the trends of data by using different methods. The Execute component performs adjustments of the parameters of the managed element through the effector component. The knowledge-base is the repository where all collected data and analyzed data are stored. For workload management, the workload can be mapped to the managed element. Sensors and effectors are detecting and identifying workload type or workload performance. The monitor and analyze components observe and classify the workload, respectively. The Plan component examines the trend of incoming data. After executing data through execute, the future data are stored into the knowledge-base.

For the system to be autonomic, it holds some AC characteristics [151] that include self-configuration, self-inspection, self-prediction, self-adapting, self-healing and self-optimization. An AC system is capable of monitoring the system itself and can perform self-optimization of the resources. Self-optimization is achieved by the AC system by performing different activities efficiently on the basis of environmental setting and available resources. The configuration of AC system is performed itself to meet the desired objectives, and the configurations are predicted and adapted dynamically. All the reconfigurations are done autonomically with less or no human intervention. Self-healing capability of AC system recovers a system from attack using logs statistics and backup files and brings the system to a consistent state after every attack. The AC system has the ability of self-adaptation to recognize the changes occurring in the system and adapt the changes including changes in indexes, data layout, the data structure of the database for improving the system performance. In this study, we have focused on three AC characteristics, i.e., self-prediction, self-inspection and self-adaptation.

DBAs have to face a number of problems while managing the database. The reason behind this is the versatility of data, the growth of data volume, database functionality enhancement, homogenous and heterogeneous data, database maintenance and e-service [70]. In large-scale data repositories, including DBMSs and DWs, to increase the efficiency and reduce the human intervention AC is becoming the need of the systems. The DBMSs that have autonomic capabilities are called autonomic database management systems (ADBMS) and DBMSs are evolving toward ADBMSs [89, 90, 123, 124]. The vendors of well-known DBMSs including SQL Server, DB2 and Oracle are making their DBMS products autonomic, and DB2 is the leading DBMS in this regard. DBMSs are evaluated [89, 90, 123] and optimizers are compared [124] with respect to autonomic perspective along with the degree of human intervention, and the workloads of DBMSs are observed from the autonomic point of view [121]. Workload management with respect to AC (self-inspection, self-prediction, self-adaptation) needs thorough investigation in large-scale data repositories. Self-inspection supports in monitoring and controlling the workflow, self-prediction supports in forecasting the performance aspects of workload, and self-adaptation supports in managing the evolving behavior of the workload.

3 Methodology

We developed a methodology for the survey to find literature for large-scale data repositories such as databases and data warehouses with respect to workload management. The journal papers and conference papers are searched using Google Scholar and DBLP with queries “workload monitoring and classification,” “workload performance prediction,” workload adaptation,” etc. This survey focuses on AC used in large-scale data repositories: to determine AC characteristics that are incorporated, and the techniques that are used. The timeline for this study is set from 2001 to 2017 due to the fact that the term AC was introduced in the year 2001 by the IBM. Workload management was categorized into three types including workload classification, performance prediction and workload adaptation. We created a taxonomy of AWPT that shows these three categories and their subcategories. Our work follows this taxonomy throughout the paper. For inclusion/exclusion of papers, we manually reviewed the literature and a criterion is defined.

The papers found in the literature that present autonomic or self-managing models, frameworks and architectures for large-scale data repositories are included in this survey. To limit the scope, we only considered workload management in databases and data warehouses, especially in large-scale data repositories. All the papers related to workload management are included, which falls under the three selected AC characteristics: self-inspection, self-prediction and self-adaptation; that contributes toward performance tuning. All the papers that present techniques, models, architectures and frameworks developed for workload management and focused on monitoring and controlling of the workload are put into self-inspection category. All the papers related to workload predictions are put into a self-prediction category. Similarly, all the papers related to workload adaptation are put into the self-adaptation category.

The studies related to workload management in big data, cloud and distributed databases are excluded. The papers which discuss workload types other than OLTP and DSS such as workload in mobile, cloud, web, social network are also excluded. The short notes, technical papers, organizations white papers were not included. The papers that are not in English language or cannot be translated well are excluded. All the papers related to workload management that exhibit other autonomic characteristics are excluded. The manually searched 158 papers fall into the three categories and possess autonomic characteristics. In this study, 89 journal papers, 65 conference papers, 1 book chapter and 3 reports are used. The journals and conferences are related to different publishers like IEEE (75), ACM (32), VLDB (13), Springer (13), Wiley (3), Elsevier (14).

The year range-wise publications in three categories classification, prediction and adaptation of workload are presented in Fig. 2.

Fig. 2
figure 2

AWPT publications

The work done with respect to autonomic characteristics such as self-inspection, self-prediction and self-adaptation is shown in Fig. 3.

Fig. 3
figure 3

AWPT publications per subcategory

4 Autonomic workload performance tuning in large-scale data repositories

To describe and classify the autonomic characteristics needed for large-scale data repositories, we have created the taxonomy of autonomic workload performance tuning (AWPT). Three aspects of large-scale data repositories with respect to autonomic characteristics include workload classification, performance prediction and adaptation. The taxonomy is shown in Fig. 4. In the context of large-scale data repositories, the workload is the job (in the form of queries) assigned to a system to be executed in a given time duration, whereas, generally in computer science, it refers to every input or request received by a specified technological infrastructure [106]. Workload management is performed automatically or through the DBAs [80]. After the development of the autonomic technology, the trend is converted from automatic to autonomic [37, 99]. According to this trend, we consider highly relevant to map AC characteristics to workload management. The works of large-scale data repositories related to workload classification, prediction and adaptation are mapped to AC characteristics which are self-inspection, self-prediction and self-adaptation, respectively. For all the works, the techniques and approaches used are discussed, and their limitations are described.

Fig. 4
figure 4

Autonomic workload performance tuning (AWPT) taxonomy

Many studies [6, 13, 18, 81, 116, 125, 134, 138, 147] for workload characterization are found in the literature with applications in various domains such as (see Fig. 5): database management systems, online transaction processing systems, microservices, web browser, video services online social network and mobile devices. This paper focuses only on the workload management of large-scale data repositories through workflow monitoring and performance tuning that includes DBMSs and DWs, excluding other types of workload that are not under the scope of the paper.

Fig. 5
figure 5

Workload characterization in various applications and domains [6, 13, 18, 81, 116, 125, 134, 138, 147]

The workloads are of different types in different application areas which include DBMS workload [2, 9, 65, 67, 72, 157], Web workload [17, 19, 20, 71, 116, 112, 120, 137], OSN workload [12, 22, 29, 116, 152], Video service workload [28, 52, 134, 129, 138], Mobile app workload [21, 40, 79, 155] and Cloud workload [6, 63, 125, 140, 147]. Every type of workload has its own features and is handled through different ways using a variety of techniques. Data analysis approaches used for different domain/applications workload as shown in Fig. 6 are described as follows. A database workload is a request set upon the DBMS. The database workload can be classified as OLTP, DSS, a mix of OLTP and DSS, commands, sessions and further operations that run in a time interval [69]. Several papers [15, 51, 65, 69, 105, 134] characterize database workload using different statistical, data mining and machine learning techniques. In today’s enterprises, OLTP systems are popular data processing systems. Few examples of OLTP systems include retail sales, order entry and financial transaction systems. In OLTP systems, insert, update and delete queries are used. The other types of workload can be system logs and variable snapshots [6]. Different techniques like fuzzy logic and neural network are used for the characterization. The nature and the characteristics of web workload have been studied since the web’s commencement. Calzarossa in her paper [18] discussed the characteristics of a different impression of conventional web workload in detail. Different techniques like machine learning, fitting, stats, clustering, time series are used for the analysis. Most of the part of conventional web workloads consist of hypertext transfer protocol (HTTP) requests issued by clients. Contrarily, recent technologies allow users to share their own multimedia and textual content to search for information. Different aspects of conventional workload discussed in the literature include web content [89], shopping services [122], web robot traffic [66, 150]. In the literature, these aspects have been analyzed from different perspectives. Characterization of microservices also includes in web workload [121].

Fig. 6
figure 6

Data analysis approaches used for different domain/applications workload

The online social network (OSN) workload refers to the load produced by the users which are used to exploit the sophisticated information to provide services offered by these technologies. In the literature, the most relevant research questions considered are referred to user behavior, content propagation and network structure and evolution [106]. Crawling is the method used by most researchers [70, 77, 124] to gather information using application programming interface (API). Video service type of workload refers to users produced load who access and share self-generated content or supplied by media producers. Understanding this content helps in storage and content management, the design of content distribution systems, capacity provisioning and resource allocation [106]. In the literature, different researchers [13, 134, 138] use different techniques like stats, fitting, graphs, clustering for video service workload characterization. Mobile devices like tablets and smartphones deploy many diverse mobile apps that allow users to share and access resources and content. In the literature, two perspectives are considered smartphone and app store for app usage patterns [106]. Several papers [81, 116, 125, 147] characterize the mobile device workload according to above-mentioned perspective using machine learning, Markovian, stats or clustering techniques. A large variety of services and applications deployed on the infrastructure of cloud produce the load which is referred to as cloud workload. Most studies [18, 109, 110, 157] focus on the workload characterization processed by public or private cloud infrastructures by focusing on facts like user behavior, application characterization and virtual machine behavior.

4.1 Workload classification: techniques to support self-inspection

Workload classification plays a key role in performance tuning and is important in performance engineering [18]. This section presents all the techniques that support self-inspection in workload classification. Workload classification is performed on the basis of workflow properties such as resource demands, query cost. It involves two functional components which are performance monitoring and characterization [110]. Monitoring the performance is reactive approach and works when performance degrades, whereas the characterization is a proactive approach which keeps track changes of the workload. Self-inspection is an AC characteristic which is the capability of monitoring the workload. Many techniques, tools and algorithms have been developed for monitoring the workload [101] that are used for controlling the workload. A GATEKEEPER tool was developed for scheduling and workload admission control of E-commerce workload to enhance stability and response time [50]. For managing the workload in an efficient manner through suspension and resumption, a tool PREDATOR was developed [23, 25]. An online tuning framework was proposed for examining changes in the physical design and continuous monitoring of workload [16]. The physical design was altered based on information obtained from the query execution plan (QEP), its cost was calculated, and best QEP is selected. Transaction Processing Performance Council (TPC) [145], a nonprofit organization, recognized the workload characterization which is used to measure system performance for developing benchmarks for experimental purposes. In workload management, classification of workload has become very important and through self-inspection, and the workload can be managed in a better way by knowing the type of workload [33, 113]. In the literature, many studies described the workload classification through adaptation [81, 83, 105, 141]. The study [2] presents workload classification and scheduling in DBMSs, and [63] discussed prediction on time series analysis. Similarly, the work in [9] applied automated techniques using system logs for producing customer behavior model graph (CBMG). The study [74] provides the workload characterization of data centers through fingerprints, and in [31], the TPC-H workload is characterized on Apache Spark for optimization.

In current large-scale data repositories, two classes of workload are mainly used that include DSS and OLTP. DSS consists of complex workloads, business analysis and requires large storage space, whereas OLTP consists of daily transactional activities. The mix type of workflow is also encountered in the system that is a mixture of DSS and OLTP. In the literature, many techniques and approaches have been developed and implemented for workload classification that supports self-inspection, and the details are discussed as follows.

4.1.1 Statistics

For monitoring and characterizing the workload many statistics are used in the literature that includes correlation, mean (harmonic, geometric, arithmetic) or average, mode, variance, range (max or min) and standard deviation. Principal component analysis (PCA) was used in [114, 129] to find principle components by mapping correlated parameters to linearly uncorrelated parameters through an orthogonal transformation for dimensionality reduction and simplification. Visualization methods such as box plots, histograms, scatter plots are used for inspecting the workload parameters. Single-parameter and multi-parameter histograms are used for presenting the correlation among workload parameters. The study [67] characterized the workload through the parameters such as DB transaction and SQL statement using the standard TPC benchmarks. A number of experiments are performed by applying different methods such as nonhomogeneous Poisson process, histograms and summaries (average, distributions) for concurrency control, buffer management and memory management.

4.1.2 Numerical fitting

Numerical fitting techniques have been used for studying the patterns and behaviors of the workload. Several methods such as least squares and maximum likelihood estimations are used for parameter estimation of the function to best fit on the experimental data. The best fitness is evaluated through different statistics such as Anderson–Darling, R-squared, Kolmogorov–Smimov and F tests. For instance, in the study [140], the prediction accuracy is evaluated through the R-squared method, and the shape of experimental data is determined through probabilistic distribution methods such as exponential, lognormal, binomial, Weibull gamma. The work [84] shows that power law distribution is used for presenting the behavior of large distributed values of workload properties. The concept of chains of the sequential request is used for presenting spatial locality of the workload in characterizing the Netflix workload [138]. The DSS and OLTP workloads have different characteristics and have been investigated in [67, 72]. The arrival patterns of the transactions are modeled using a combination of statistical techniques. For nonlinear regression, a numerical fitting technique such as Levenberg–Marquardt is used and for the interactive type of workload, DB transaction arrival time is used.

4.1.3 Prediction model

Prediction models have been used for classification of workload. Predictions are performed for enhancing buffer/hit ratio in database cache, user access behavior and characterization of the access pattern for caching OLAP systems in multi-dimensional information system [128]. Time and sequence attributes are used by applying Markov chain model, and the combination of other statistical techniques is also used for prediction of OLAP queries.

4.1.4 Clustering techniques

Clustering techniques are widely used for classification of workload [51, 71, 132, 157] and can be categorized into two types, hierarchical and nonhierarchical clustering. On the one hand hierarchical clustering techniques group similar classes of the same nature. Clusters are stored in the form of a hierarchy of less similar to most similar parameters through clustering techniques such as hierarchical agglomerative clustering (HAC) and divisive (monothetic or polythetic) clustering. On the other hand, nonhierarchical clustering techniques create nonoverlapping clusters of data; the number of clusters is known in advance, so independent clusters are formed with no hierarchical relationship among them. The approaches of nonhierarchical clustering are nearest neighbor (k-mean) relocation and single pass. The performance measures used for measuring the quality of cluster include purity, confusion matrix, f measure, Fowlkes–Mallows index, Jaccard index and Rand measure.

A model relational workload analyzer database (REDWAR) was developed for characterizing the workload that analyzes SQL statements with respect to WHERE or GROUP BY clauses. For clustering, two types of clusters are developed, i.e., CLUE and HALC. The CLUE characterized OLTP workload, and HALC, a heuristics clustering algorithm, is used to manage a large amount of data. The study [157] presented two types (DSS and OLTP) of the workload in MySQL database for classification of workload using data mining and statistical techniques such as classification and regression tree (CRT) and hierarchical clustering.

4.1.5 Decision tree induction

Literature reveals that many studies were done to classify the database workload of a relational database in IBM’s DB2 environment [47, 48] using decision tree technique. The workload management studies [44, 47] presented a classifier that classifies the workload into DSS and OLTP types of workloads using induction tree. A classification model [46] was developed to classify the workload based on workload characteristics and detects the changes in the type of workload. Psychic Skeptic Prediction framework (PSP) was developed to predict the changes in a shift of workload from DSS to OLTP and OLTP to DSS [46].

4.1.6 Classification and regression tree (CRT)

Workload classification for two types of workload (DSS and OLTP) can be performed using classification and regression tree [157]. The study developed a classification model and identified few features that support classification. The identified features which are effective in classification include the number of user logging (Innodb_log_writes), query ratio of Select and Update/Insert/Delete (Com_ratio), the number of a statement executed (Questions) and the number of query cache hits (Qcache_hits). Table 1 presents the summary of workload classification, the techniques used, the attributes selected as input and the type of workload predicted.

Table 1 Classification techniques for workload management

4.2 Workload performance prediction: techniques to support self-prediction

In workload management, performance prediction plays an important role in performance tuning. This section presents workload performance prediction techniques that support self-predictions. Self-prediction is the AC characteristics that forecast the future trends based on historical data. Many studies [54, 59] have been carried out for workload performance predictions in DBMSs. The study [98] predicts the performance and resource for cloud database by developing the DBSeer framework. The combination of machine learning and analytical modeling is performed for performance enhancement [38]. Distribution-based predictions are preferred instead of single-point predictions [76]. The focus of the research is on distribution-based prediction, as it works on modeling the workload on the real-valued distribution of features using probability distribution function. It observes the actual values of training data and predicts the probability distribution parameters such as mean, variance over target values, whereas single-point distribution which refers to accurate workload performance prediction and prediction accuracy is challenging and not adequate. In [5], learning-based models are applied for predictions of different workloads. In the literature, the prediction is performed for time ranges [59], throughput and response time [4], multiple performance metrics [54] and other attributes in large-scale data repositories. The details of the techniques and approaches of workload management that support self-prediction are discussed below.

4.2.1 Binary tree

The study [46] proposed a framework called as PSP to predict workload shifts from DSS to OLTP type. PSP has two modules, i.e., online and offline. Psychic is offline, and Skeptic is an online module. Normally, the PSP works in offline fashion. However, when the shift is due, it works online. The offline prediction model is built using polynomial regression technique. The self-healing and self-optimizing characteristics make it autonomic. PSP cannot manage drastic changes of the workload and is used for scheduled tasks. The work [59] provides a prediction model for query execution time in a warehouse environment. It uses binary tree algorithm to build predictions of query runtime (PQR) tree that predict the query execution time using execution plan and system load. PQR is developed in two steps, first is to obtain a PQR tree and second is to predict time range for a new query and is updated periodically.

4.2.2 What-if model

The studies presented in [142, 143] developed a testbed named as “Ursa Minor” for workload prediction. It is a cluster-based storage system and has two components, i.e., Observer and Stardust. These components are developed using what-if model. The system is designed in such a way that it can be used for existing systems as well as for the new systems through modeling infrastructure. The Observer is an expectation-based model for performing predictions, and Stardust is developed for shared or distributed systems. Ursa Minor is scalable due to cluster-based technique and provides adaptive behavior through online choice. The system performance is much higher; however, re-encoding process takes some extra time. To predict the response time and throughput dynamically, DB resource advisor [4, 101] has been developed in SQL Server. It predicts the status of resources using what-if model. The Resource Advisor is experimented on OLTP workload, and changes in the workload are predicted accurately. Resource allocation by the advisor is done through continuous monitoring. It has a limitation that when the size of the buffer pool is low, the advisor has more overhead per transaction.

4.2.3 Exploratory model and confirmatory model

The study [88] presents the requirement of workload models for ADBMSs that includes exploratory and confirmatory models. Workload monitoring is done in the exploratory model, and for workload analysis, the confirmatory model is used. Many data mining and machine learning techniques are applied to the concept of the requirement of exploratory and confirmatory models for the ADBMS. Some issues of the proposed work are presented such as efficient monitoring, storage of workload model and maintenance of the models. The proposed study has no generic solution and is performed in a test scenario.

4.2.4 Machine learning

A number of machine learning techniques have been applied to perform prediction. Principal component analysis (PCA) [54] is applied for the prediction that is used for singular value decomposition or eigenvalue of a data matrix and presents the variance of data in the best way. PCA performs well when it is implemented independently for two datasets and it has limitations, i.e., it cannot identify the relationships. Another approach known as canonical correlation analysis (CCA) can find the correlation between the sets using Euclidean vector spaces. Similar queries are found based on the high value of Euclidean dot product. It has a limitation whereby the queries are textually the same but may have different performances. The study [39] applied fuzzy logic for developing fuzzy rules by assuming the parameter values on the TPC-C dataset, and the study [10] applied Kernel canonical correlation analysis (KCCA) that is based on CCA.

4.2.5 Statistical and cost-based models

The study [136] used a modular approach for estimation of the execution time of SQL query, and its execution plan is mapped for planning the elementary steps. The execution time of many DB access, IO access is evaluated through this approach. Real applications are used to evaluate through TPC-H benchmark for transactional and analytical workloads. Cost-based analyzer [154] elements are used to predict the execution time of query in Postgres through query execution plan, and associated queries are used to estimate the optimizer’s cost of query parameters. The study [153] predicts the number of CPU-disk visits and CPU-benefit time for large data size, and the study [43] measures the idle time of concurrent queries by using the I/O access time spent. The query execution time was predicted for simultaneous workload using query fingerprints in [42].

4.2.6 Performance model

The study [30] presents the performance modeling of complex queries of various sizes using three benchmarks TPC-W, RUBBoS and RUBiS. The work describes that execution time of all queries is not important to be predicted and only about 20% of the queries are performance sensitive. The study [78] describes the performance modeling for nine different applications by developing an automatic framework COMPASS for performance prediction of CPU. Analytical performance modeling was performed by PEMOGEN [15] in [139] for relieving the burden and performance modeling during the program execution. For predicting query response time, a neural system-based tool was developed. Queries predictions are done on benchmarks dataset TPC-DS for evaluating long and short running queries [156]. In OLTP system, the association between system throughput and query response time is analyzed by proposing performance prediction approach for concurrent database workload using machine learning approach for query scheduling, resource allocation and QoS management [95].

The study [69] presented an approach through a feature space which uses kernel properties to predict response time using machine learning in a grid environment. The work [131] described a performance prediction systems taxonomy and provided many configuration parameters that affect and control the performance of DBMS and DW. A layered AC framework was provided for performance prediction [75]. The resource allocation issue is handled by providing a framework that does performance analysis and prediction of OLTP workloads using performance parameters like resource consumption, resource bottlenecks and throughput [97]. The reputation-based scheduling was provided for distributed systems to predict the future behavior [35]. The performance prediction of SPARQL query was presented in [61] that links data in a decentralized manner where execution time is obtained through machine learning with 84% accuracy. Similarly, run-time behavior [141] was predicted through the regression modeling that removes manual observation of runtime information which helps in developing complex structures. The study [83] reduced the query response time through scheduling the queries by proposing efficient scheduler technique.

Many studies are available in the literature regarding performance modeling in data warehouses. Performance modeling is done in various ways, which include business intelligence [41], map-reduce [144], materialized view [8], indexing [92], ontology-based [104], autonomic system [110], cache managers [128] and hybrid approaches [73]. Figure 7 shows a variety of approaches that have been used for modeling the performance in DWs. There exist two types of indexing which are performed on DWs. First is main memory indexes such as B+ -Trees and T-Trees, and the other is cache-conscious indexes such as CSS-Trees and CSB+ -Trees [14, 92]. Literature reveals that cache-conscious indexes performed better compared to main memory indexes. DWs performance was improved by developing cache manager called WATCHMAN [57, 100] by using well-known techniques including least recently used (LRU), most recently used (MRU), Q-Pipe and CLOCK for cache replacement.

Fig. 7
figure 7

Approaches to performance modeling in data warehouses

In a DW, selection of materialized view is important and a number of studies were performed to examine the evolutionary algorithms for materialized view selection which includes data cubes, simulated annealing and DynaMat [8, 36]. The semantic web and AC technology are used for DWs cache allocation via decision support system [104], and use of ontologies in designing DWs is discussed in [24, 103]. Business intelligence (BI) is incorporated in data warehousing [41] and is also used for visualization, data exploration [27] and legal context [56]. Many systems have incorporated AC characteristics for improving the performance which includes self-configuration [151], self-optimization [124], self-inspection [16, 23, 25, 50, 101], self-prediction [59] and self-adaptation [32]. For handling OLTP and OLAP, a number of hybrid approaches are proposed which includes Hekaton, HYRISE and Hyper [58, 73] for concurrent processing in DW environment. The map-reduce [144] is the best approach for data processing and analysis of large-scale data repositories and has been implementing in Hive, Hadoop DB and Pig [3, 55]. The study [86] provided customer behavior model for E-commerce that uses state transition graph to manage customer activities. Hill climbing algorithm is used to achieve suboptimal solution at peak time. The study [76] presents the runtime of workload prediction of Hive query through the mixture density networks model.

Table 2 shows the techniques of performance prediction and the predicted attributes for managing the workload.

Table 2 Performance prediction techniques and the predicted attributes in workload management

4.3 Workload adaptation: techniques to support self-adaptation

This section presents the techniques that support self-adaptation in workload management. The changes in the workload are dynamically handled through the self-adaptation ability of the system. In a real environment, the workload behavior can evolve at any time due to the workload versatility in large-scale data repositories. The behavior of the workflow entering into the system is nondeterministic, and the trend of behavior may change with time. Data repositories workflow can be OLTP or DSS, and it experiences a mix of OLTP and DSS. Many studies have been done to create adaptive systems. Many areas of computer science are explored for the applications of workload adaptation such as web services [93, 94] and DBMSs [66, 118, 119, 130, 153]. Legacy systems can be enhanced for improved efficiency using self-adaptation that has the ability to retain the modifications in the system according to changes in the environment. Many adaptation approaches have been proposed in workload management such as threshold approach, heuristic approach and performance model approach. The performance modeling approach is proved as the best adaptation approach. The details of these approaches are discussed below.

4.3.1 Heuristic approach

The application of stochastic search techniques was applied in earlier research that incorporated the concept of randomness [32]. The limitation of heuristics approach is reduced through the use of randomness of stochastic search and is used for the self-adaptation system. A prototype is developed for stochastics search, and search-based software engineering is performed to verify the claim.

4.3.2 Threshold control approach

DB2 query patroller (QP) [34] is used to control the requests dynamically according to available resources. The execution of small and large priority queries is done with no delay, and the completion of queries and trends is provided by QP. Different system and user-level privileges are assigned using QP by the DBA. QP acts as a gatekeeper and monitors the workload for optimal resource utilization. Based on control theory principles, query prioritization is performed in [102]. The study [11] presented Teradata’s active system management (ASM) that manages the workload through resource allocation to different groups using preventive approach.

4.3.3 Performance model approach

The QoS controller is developed for managing the workload for e-commerce applications [93, 94]. The performance such as the probability of rejection, average throughput and average response time is achieved through QoS controller by parameter adjustments. The QoS controller predicts the performance using queuing model, and adaptation is performed through performance modeling. The study [112] presents the architecture of cluster-based web services, and for each gateway, the workload is divided, in-service classes. The services opt resources with respect to MPLs, and the service classes are maintained in the performance model through a feedback control component. The study [130] proposed a framework for scheduling the concurrent requests to achieve QoS through response time as service level agreement (SLA). On concurrent requests, the external queue management system (EQMS) is used to limit the multi-programming level (MPL) that provides an adaptive response to the MPL advisor and scheduler for dynamically managing the workload. It can handle all types of workload by imposing a limit on MPL by reducing the contention. The study [64] presented a self-tuning system called starfish and is used for big data analytics. Another self-tuning system is proposed in [26] that has the capabilities of auto-tuning and self-assessment. The study [63] discussed the characteristics of workload intensity behavior (WIB) by presenting a self-adaptive approach that presents the autonomic classification and prediction methods. Workload classification and forecasting (WCF) is developed to connect the WIB groups with existing prediction methods. Similarly, the studies [62, 96, 117] provide runtime models that provide adaptation in a self-managed way. Self-adaptive approach has also been developed to maintain performance requirement and resource efficiency automatically by using Descartes Modeling Language (DML) to model runtime adaptation. The study [53] provides self-adaptation approaches for the distributed systems, and [107] provides a solution for performance tuning through self-adaptation using context-aware model.

4.3.4 Machine learning approach

Machine learning approaches are widely used for performance tuning. The study [66] presents the comparison of artificial neural networks (ANNs) with natural neural networks (NNNs). The ANNs perform better than NNNs, in two steps of the adaptation process; first is the analysis and training data and second is NN topology development and computations of weights. The study [82] presents AdaptDB that adaptively partition the distributed joins and provides good performance. Hyper-joins algorithm is developed where data shuffling is avoided on the join attribute. The adaptation is performed well for the experiments performed on TPC-H and CMT datasets. Similarly, the study [60] presented AdPart that is used for data partitioning through hashing. The study [106] presents the context-aware system with a self-adaptation ability that performs well for unknown context. The KNN is used in [7] to develop AQWA for an adaptive workload-aware approach for partitioning huge data. When the workloads are executed, the partitions are updated in AQWA and no prior knowledge is needed about the workload. The fuzzy inference system (FIS) is also used for self-tuning in databases management system [126]. The performance parameters include database size, number of users and buffer/hit ratio. Fuzzy rules are developed and evaluated through TPC-E and TPC-C benchmark data. The study [133] developed a technique for increasing robustness and accuracy for prediction algorithm. The study [81] presented an adaptive approach for workload prediction that categorizes the workload into classes based on workload features. The evaluations are performed on Google clusters and compared with machine learning techniques such as support vector machine, ARIMA and linear regression.

As the workload varies dynamically with time, if the change is not properly managed or controlled by the system, it causes overload. The study [111] provides the mechanism for controlling the admission of database workload by proposing a predictive and reactive model. The study [127] provides a methodology based on data partitions for improving the performance of large data and automation of partition process [146]. The study [85] provides a framework WiSeDB for managing the workload. A decision tree is used for scheduling, resource provisioning and query replacement decision. This approach performs well with little re-training and achieves performance goals by adapting offline models which provides better cost trade-offs.

Recently, deep learning is used in different areas of research and produces better results as compared to flat learning. A few studies [1, 106, 115, 135, 149] that are found in the literature have investigated and applied deep learning in large-scale data repositories. The architecture of Peloton [115], the first self-driving DBMS and autonomous operation, is performed in the Peloton. Deep learning framework is used for workload forecasting. Due to advancement in deep learning, adaptiveness can also be achieved. The paper [149] describes that database applications can take advantage of deep learning and system performance can be improved. They discussed possible improvements by jointly applying database techniques and deep learning techniques. Table 3 presents the different workload adaptation techniques used in DBMS workload management.

Table 3 Workload adaptation techniques used in workload management

4.4 Comparative analysis

Workload management in large-scale repositories is related to effective monitoring of the workflow and controlling queries in an efficient way to achieve the performance. The literature reveals that a number of researchers in industry and academia have designed and developed advanced technologies for workload management and these developments have also been incorporated in commercial products of large-scale data repositories. As autonomic workload characterization is a requirement for automatic configuration and performance tuning, the literature reveals that several studies have been conducted to autonomically characterize the DBMS workload [2, 44, 46,47,48, 157]. These studies have characterized workload/queries autonomically for OLTP and DSS type. However, DBAs need to manage mix type of workload (mix of OLTP and DSS) experienced by the DBMSs [2, 35, 45, 147]. The autonomic detection and classification of mix type of workload are significant to be performed. As the behavior of workload is nondeterministic [2], while tuning the DBMS, the same configuration setting cannot be used for mix type of workload to achieve optimal performance [144].

Adaptation is also an important aspect of workload management that has been investigated by many researchers. They have developed some adaptive frameworks and models using queuing theory, machine learning approaches. The adaptive frameworks and models are good; however, these could be further investigated to improve their accuracy, reliability and robustness. Variety of techniques are applied to achieve the AC capabilities for workload management. Most of the workload management problems are solved using the techniques as discussed in Tables 1, 2 and 3. Based on the literature, little work was found on deep learning in large-scale data repositories. Deep learning can enhance the learning abilities of the data repositories, making them more intelligent, and performance tuning can be achieved. The combination of deep learning techniques and large-scale data repositories techniques can result in applications that are more autonomous with predictive and adaptive capabilities.

The guidelines for the future research for system workload classification, prediction and adaptation in large-scale data repositories are as follows. There is a need to develop frameworks, models and architectures that have the ability to solve all these workload management problems. The existing solutions can handle the changing behavior of workload to some extent; however, the future solutions should cater sudden changes that occur in the workload which are experienced in the real applications. Adaptation of changes in small-scale repositories is manageable; however, large-scale repositories are difficult to manage. Therefore, solutions are required for large-scale repositories when the data size and space complexity grow up till certain limit. The calibration of the learnt models can play important role in improving the proficiency of the system. The calibration could be performed by applying machine learning techniques that can learn from the existing solutions and provide more realistic solutions to the workload management problems. Due to the large volume of data repositories, deep learning can be applied to analyze the data in more depth. Similarly, existing machine learning techniques need to be further investigated for better workload management and performance tuning.

This study has focused on three autonomic characteristics, i.e., self-inspection, self-prediction and self-adaptation of the workload management. Further studies could be performed for other autonomic characteristics such as self-protection, self-configuration, self-healing and self-optimization to find their corresponding solutions. Moreover, studies can be conducted to get insights on other autonomic characteristics that could be incorporated to enhance the workload management. The accuracy and efficiency are the important aspects of the workload management. Algorithms need to be designed that could improve data storage and retrieval from the data repositories. More solutions are required in workload characterization for OLTP, DSS and mix type of workload, through data mining techniques for efficient retrieval. The predictive and adaptive solutions will be helpful for resource utilization in large-scale data repositories. The scheduling, resource allocation and de-allocation algorithms could improve the performance of DBMSs and DWs by using the predictions. Workload management solutions are also required for the Internet of things (IoT) by providing adaptive frameworks for the dynamic and heterogeneous environment.

A number of DBMSs such as Oracle, SQL Server, MySQL, DB2 and DWs need to be examined to see the insight on the overall functionality and components-wise functionality that supports workload management. For example, storage optimization could be done for efficient retrieval. The vendors of the large-scale data repositories could incorporate autonomic workload management to improve the performance of their products. NoSQL databases are gaining attention and providing solutions for structured, unstructured and semistructured databases. Further work could be done by investigating the workload management perspective in complex and unstructured databases. In a distributed database environment, workload classification, prediction and adaptation could be of further interest by considering a number of factors like complex workloads, number of users, network congestion. Many proposed studies for workload management are performed through testing and simulations; however, these studies need to be tested in large-scale real environments. For workload classification, the techniques that support self-inspection include statistics, numerical fitting, prediction model, clustering, decision tree induction and CRT. For workload performance prediction, the techniques that support self-prediction are a binary tree, what-if model, exploratory and confirmatory model, machine learning, statistical and cost-based models, and performance model. Similarly, for workload adaptation, the techniques that support self-adaptation are heuristic, threshold control, performance model and machine learning.

The workload management problems also exist in other emerging domains such as web browser, mobile devices, online social networks, microservices, online transaction processing systems, video services and cloud environment. The nature of workload is different in various domains; therefore, its management should also be performed accordingly. These domains need to be explored by investigating their workload management issues and challenges. There is a need for solutions for autonomic workload management with respect to classification, prediction, adaptation and other AC characteristics in the corresponding domains. Other research areas such as distributed databases, cloud computing, IoT and big data analytics could be explored with regard to AC characteristics. Moreover, applications of machine learning, including deep learning techniques, need to be examined for workload management problems in large-scale data repositories.

5 Research issues in large-scale data repositories

The literature reveals that in large-scale data repositories, various models, architectures and frameworks have been developed for the classification of workload, its performance prediction and its adaptation. AC technology supports the large-scale data repositories to be autonomic and has potential to be incorporated. Literature shows that systems are made partially autonomic by incorporating one or more self-* properties (self-inspection, self-configuration, self-healing, self-prediction and self-adaptation etc.). It is challenging to incorporate self-* properties to make it fully autonomic. In this section, the issues and limitations of the existing techniques are discussed. Workload classification in the literature characterizes the workload with less accuracy due to missing important attributes which ultimately leads to less effective and inaccurate classification. While managing the workload, it is important to know what type of workload is entering into the system. The type of workload can be predicted on the basis of workload characteristics; thus, existing classification approaches can be further improved.

A number of questions arise about workload types in workload management. For instance, which workload types need characterization? How should the workload be classified based on its type? What are potential techniques to improve workload classification and how to optimize the results? As mix workload type is essential to handle, what is the percentage of different workload types in a mix-type workload? If we know this in advance, the distribution of execution can be handled efficiently. The performance of queries depends on a number of metrics. Literature reveals some of the performance metrics (see Table 2). In large-scale data repositories, performance metrics such as memory, communication cost and locks need to be predicted for performance optimization, scheduling, resource allocation and adaptation. Before workload execution, many questions arise for managing the workload in large-scale data repositories, such as when should a query be executed? How much can we delay the query execution? In case of a problematic query, should we kill the workload? How will the query perform?

For performance tuning, workload prediction and adaptation are required for capacity planning, system sizing and resource allocation. For system sizing problem with a time constraint, a number of questions arise such as how much memory, network bandwidth and processing power are required to execute the workload? Another problem is the capacity planning for the workload, for which questions arise related to up-gradation or down-gradation of the system? A number of questions related to resource allocation arise. For instance, what is the workload execution time for resource contention queries? How much is the total workload and how many concurrent requests exist? What is the number of bytes sent and received for workload execution? What is the number of disk input/output requests and the cost of communication? How much buffer size is required and how many disk writes are performed for workload execution? As the behavior of workload is nondeterministic, therefore the adaptation of workload can enhance the system performance. Workload adaptation effectiveness and accuracy have also become the challenges in large-scale data repositories. As the workload is nondeterministic and the behavior of the workload may change with time, the following questions arise with respect to workload adaptation. How can the changes be adapted to the changing behavior of workload? How and when the change occurs in the workload behavior? What are the techniques to achieve the optimal results in workload adaptation?

6 Conclusion and future work

This study has provided the state-of-the-art literature survey on autonomic workload management in large-scale data repositories. The survey organized the literature in a way that provides guidelines for researchers working on large-scale data repositories for workload management with respect to classification, performance prediction and adaptation. Limitations found from the literature are highlighted in different domains of workload management. Early studies applied traditional techniques for classification, performance prediction and adaptation of workload in large-scale data repositories. Current studies focus on autonomic approach to manage the workload intelligently without human intervention. However, the literature reveals that only limited solutions exist. The survey also highlights the needs for developing and building new autonomic models, frameworks and architectures. Machine learning techniques can be used for developing new frameworks and models. It was also found that deep learning, which is an emerging learning technology, has been applied in workload management.

Our future work includes developing a framework that characterizes the workload, predicts the performance and adapts the workload autonomically. We will explore machine learning techniques that can best solve the workload management problems. Autonomic characteristics will be investigated to make the system more autonomic to alleviate the burden of DBA.