Keywords

1 Introduction

The rapid growth of social media in the last decade has changed the electronic data storage. The data storage essentially takes place on the disk drives. The disk drive technology also rapidly evolved to cater the need for the Big Data, Data revival, Data storing and Data Mining Purpose. The protection of the data essentially depends on the reliability of the disk drive. The disk drive speed and performance with minimum cost still plays the vital role as compared to other faster storage devices such as NVRAM, SSD and so forth in the data storage industry. The disk drive performance model plays a critical role to size the application, to cater the performance based on the business need.

This paper has made use of different mathematical models and compared them in order to predict the performance model of the disk drive based on the real time disk drive performance data. It compares the real time performance data of the disk drive with different kind of workload along with attributes and predicts the performance for user required workload based on the proposed model. The goal and use is to size the application based on the disk drive performance to meet the application performance for the business. This model can be used by the hard disk pre-sales team or the marketing teams to actually predict the IO performance of the storage systems running with different applications.

The proposed disk drive performance model predict how well any application will perform on the selected disk drive based on performance indices such as response time, MBPS, IOPS etc. when the disk performs intended workload. The experimental results and the model used in this paper to validate the efficiency or accuracy of proposed models with an error bound of 5 % using the real-time collected performance data. This paper work compares performance prediction with two different models and suggests linear polynomial method is the better model as it shows least deviation from the actual performance data.

2 Background and Related Work

The Data Storage Industry uses different storage technologies such as DAS (Direct Attached Storage), NAS (Network Attached Storage) and SAN (Storage Area Network). These data storage techniques are used in the modern datacenters which essentially use the disk drives. The performance of the disk plays a major role in order to meet the need of the users by depending on their type usage and different applications. The performance of the storage system depends on the performance of the hard disk. There are different kinds of storage system to cater the need of high performance of the application. There are storage systems such as NVRAM, SSD, FC, SCSI, and SATA et al. [13].

Many attempts have been made to compute or analyze the performance of the disk drives. In order to setup the storage system the performance prediction for different kind of workload plays a major role. There are several attempts made to predict the performance of different kinds of hard disk drives et al. [14]. Many research activities are done on the performance model of the hard disk in both analytical model as well as simulation way. But the deployment of the model is a big challenge in terms of time, expertise, complexity and the kind of resources required for the predictive model to run. Others such as authors in Ref. [2] have proposed different approaches on the hard disk performance prediction mechanism using the machine language tools such as CHART model and artificial neural network model. Similarly, other authors in [5] work proposed a different approach that is based on Adaptive Neuro Fuzzy Inference Systems.

In this situation, it is highly desirable to have a black box model for disk drive performance prediction with simple and accurate algorithm. Although there are research on different black box model for the disk drive model [68] has done but the efficient, simple and improved model is highly desirable. The goal is to be able to device a method to find out the performance model without any prior detail of complex design and algorithm of the disk drive l. In order to achieve the performance model of the disk drive the working storage setup with access to the disk drive is required. With the different kinds of workload inputs which potentially affect the disk drive performance has to be trained using the efficient mathematical equation the data generation system will learn the disk drive behavior, functionality etc. using the different scenarios.

2.1 Polynomial Model for Prediction

The polynomial model is considered to be the simplest one that analyzes the data in a very effective manner. Although the polynomials are of different order, but the order of the polynomial should be as low as possible. The high-order polynomials should be avoided unless they can be justified for reasons outside the data. Therefore, in order to predict the performance of the hard disk, here we consider a linear polynomial model that analyzes the data and predict the performance of the disk drive. A linear polynomial is any polynomial defined by an equation which is of the form

$$ p\left( x \right) = ax + b $$
(1)

where a, b are real numbers and \( a! = 0. \)

2.2 Radical Model for Prediction

In order to analyze the data and predict the performance of the disk drives, a radical equation can also be used which is derived out of the experimental result pattern.

In order to show the superiority of the linear polynomial model, we have compared the prediction of disk performance through this model with radical equation model.

A radical equation is one that includes a radical sign, which includes square root \( \sqrt x \), cube root \( \sqrt[3]{x} \), and nth root \( \sqrt[n]{x} \). A common form of a radical equation is

$$ a = \sqrt[n]{{x^{m} }} $$
(2)

(Equivalent to \( a = x^{\frac{m}{n}} \)) where m and n are integers.

3 Approach and Uniqueness

The main goal, and also the primary difference compared to other competitive approaches available for the performance prediction is to design a disk drive performance model that has no prior knowledge about the disk drive functional design and its implementation details. This will enable us to use mathematical tools to implement the methodology and, the same can be used over a wide variety of storage systems with minimal (hopefully none) additional effort to predict the performance.

3.1 Workload Representation in the Model

As it is already mentioned, our performance model uses the real performance data and validates against the proposed models for the prediction functionality. Particularly in this case we have identified the different parameters which influence the performance of the disk drive when they vary. In any disk drive based storage devices, the major components that influence the change in its performance are different modes/types of IOs performed by applications on the Host. The typical working model is as mentioned in the Fig. 1.

Fig. 1
figure 1

Working model

To come out with a best performance model, the following parameters are considered for the study.

  • Data access pattern (random, sequential, segmented in this case the values considered as 1, 2, and 3 respectively)

  • Types of IO requests (reads, writes and mix of reads and writes).

  • Data transfer size (KB).

  • Q-depth: number of processes simultaneously used to issue IO requests on to the disk drive.

The expected output performance of the disk drive that has to be predicted is

  1. 1.

    IO’s per second.

  2. 2.

    MB’s per second.

  3. 3.

    Response time in milliseconds.

Input data and output data that has to be predicted are used as Int-input data and Out-actual output to be obtained.

4 Results and Contributions

The Sample data is collected from a test setup as shown below in Fig. 2. The setup has a Windows host server connected to a storage system using FC SAN switched infrastructure. At the backend of the storage system, a series of daisy chained disk enclosures consisting of 72 GB size disks are used to pool the disk space.

Fig. 2
figure 2

Test setup

To observe the behavior and result of the different approaches, we present a set of comparison graph representations. The percentage of relative error is shown as

$$ R_{e} = \left| {T_{op} (Model) - T_{op} (Real)} \right| * 100/T_{op} (Real) $$
(3)

where,

op :

read or write,

\( T_{op} (Real) \) :

real throughput,

\( T_{op} (Model) \) :

throughput predicted by the model.

In Fig. 3, the graph represents the comparison between the linear polynomial method, the Radical method and the actual experimentation for 5K reads.

Fig. 3
figure 3

Comparison between various performance models

Polynomial derived from the sample data points

$$ f\left( x \right) = \frac{{\left( {b * x} \right) + a}}{(x + c)}. $$
(4)

where a = 40.58, b = 2.12 and c = 7.024 are the constants derived from the regression analysis out of the sample data.

Estimated Response time (RT) is calculated as

$$ RT = f(x) * x $$
(5)

where x is queue depth, outstanding commands in the queue.

The estimated IOPs is calculated as

$$ IOPs = 1000/f(x) $$
(6)

The Radical equation derived from the sample data points

$$ f\left( x \right) = ( a + b)/\sqrt {(1 + x)} $$
(7)

where a = 2 and b = 4.75.

Estimated Response time (RT) is calculated as

$$ RT = f(x) * x $$
(8)

where x is queue depth, outstanding commands in the queue.

The estimated IOPs is calculated as

$$ IOPs = 1000/f(x) $$
(9)

4.1 Comparison Between Actual, Radical, and Linear Polynomial Model for Different IO Rate

The graph shows that the polynomial curve is very close to the actual IOPS for a 5K read performance. Hence the application can very well predict the IOPS that can be achieved for a given input x just by applying the mathematical Linear Polynomial.

4.2 Comparison for Random Reads and Random Writes

From Fig. 4, the graphs shows the similar evidences when the IO throughout versus Response times are plotted using Liner Polynomial method for different samples of 8K reads and 8K writes. Actual graph is pretty much close to the modeled graph.

Fig. 4
figure 4

Random writes measured versus estimated

5 Conclusions

This paper confirms an innovative approach for performance modeling of disk drives using the linear polynomial model. The essential objective of this model is to design a self-managed storage system for the different kinds of application considering the disk drive performance. Another possibility of the model is to define the performance of a given storage device. If manufacturers published this model for the different storage devices as part of the specification or as a generic tool, potential buyers could trace their applications performance, and feed them in the models for different storage devices to see which one is better for them before they actually buy the storage device. Based on the results obtained by applying this linear polynomial model, it has been observed that the proposed model performs better as compare to the counterpart. The scope for future research includes advanced model to accommodate different disk sizes and vendor requirements, and come up with a performance model for the disk drive based storage arrays, considering the Array firmware overhead on the disk drive firmware.