Introduction

Modern data center

Data center are used to compute, process, and store data from various end users of the systems like Insurance, Banking, Educational universities, colleges, Government offices, Private offices, etc., The data processed in the data centers are extremely large and the high latency or service disruption is not acceptable for the customers.

Day by day the usage of internet increases, network bandwidth increases, increment in number of users, increased work load, in addition to this introduction to IoT, all such factors would affect the data center when everything has to be computed at data center. With respect to IoT environment any device will be connected to internet and it will be managed for anywhere which might affect the speed and latency and this will also become a overload to the data center when all transactions are stored in data center.

For example, is Indian Government is slowly migrating to digital by moving all the data of citizens to digital way. Example is AADHAAR number for each citizen is representation of each citizen digitally. And the data like AADHAAR must ensure high availability as it is accessed by various organisations like Banks or for many welfare schemes by the various depts of the govt itself. Event today during the COVID pandemic the vaccination drive is tracked by AADHAR. Unavailability of AADHAAR data while accessing during vaccination program would disrupt the vaccination drive services all over India.

Data center is connected across the core network and connected the end users to access devices, access devices connected to distribution side of the network. The distribution network consists of distribution or aggregation switches and routers (Dally and Towles 2004). The switches in distribution side is connected to core-switches. The core switches will have uplink which will be connected to cloud or WAN. High availability and backup is achieved at all the three levels. At each level there is a backup device at core, distribution and access devices to ensure the availability and avoid disruption to the end users (https://it.gwu.edu/sites/g/files/zaxdzs1131/f/image/Data%20Center%20Hosting%20Description.pdf). This is depicted in Fig. 1.

Fig. 1
figure 1

Architecture of data center

There is a backup path at all levels; disaster recovery mechanism is important for any data centers to take care of the highly valued data of customer data and communication between them.

Challenges in Data Center

The challenges faced in data centers are identified and explained in detail in this section (Liu et al. 2013; Wu et al. 2012), (IEEE Standards Association, “Media access control (mac) bridges and virtual bridge local area networks,” 2011), (A scalable, commodity data center network architecture, SIGCOMM’08 Proceedings of the ACM SIGCOMM 2008 conference on Data communication, Pages 63–74, Seattle, WA, USA—August 17-22, 2008).

Network maintenance

Networking is an important element of data center; it includes network devices such as routers, switches, storage devices, transferring of data over this network, etc., The maintenance of this becomes a bigger challenge. Seamless traffic should be transmitted over this network. Data center inter connection is made possible through metro networking, backhaul, wireless, etc. The network maintenance involves the lifetime expectancy of network elements and service disruptions.

Data center consists of connectivity of blade server, top of rack switches, routers, servers for computation and high data storage devices. This network path is used to deploy the business requirements. TOR (Top Of Rack) switches are connected to switch blades and routers.

Ethernet is used as transport medium over the deployed network. The devices can be ethernet switches or IP routed devices. It connects huge servers and network devices, so scalability becomes a challenge. The network processors in these devices to be with high end CPU processing cores which increases the speed and reduce the latency of data response (Greenberg et al. 2009a).

Server maintenance

Data centers handles huge amount of data which results with computation with highly scalable size of servers (https://searchdatacenter.techtarget.com/tip/A-simple-server-maintenance-checklist-for-modern-data-centers). Large size business needs access to more than one data center. The link aggregation is implemented everywhere which increases the bandwidth also. Any requirements in business involves both the servers and data storage in large size working together to achieve the goal. To access any website, it requires communication between set of servers with contents stored in different data storages. By accessing both the servers and storages it provides response within seconds (Pinheiro et al. 2007a).

Uptime

Data center operates all the time without break, the breakages may be during planned maintenance, software or Hardware failures, planned shutdown, etc., Maintenance happens when some unexpected failures to be addressed.

In both the planned and unplanned maintenance (https://it.gwu.edu/sites/g/files/zaxdzs1131/f/image/Data%20Center%20Hosting%20Description.pdf), service disruptions are clearly experienced by the customers. Only difference is that in planned maintenance, customer is aware of the down time well in advance and in unplanned maintenance, customer is not aware of the down time and it makes the environment chaos. Unscheduled maintenance is due to the failure of any critical equipment in the network. This is the downtime of the network. The network path deployed should be maintained, along with power supply systems and network devices that are deployed in both active and redundant network paths. All these maintenance activities are to be performed in the late nights of the geographical location such that less disruption occurs for the end customer.

Cost

Cost is managed based on size of the data center deployed for business. It involves parameters such as number of servers, storage units, racks, power device capacity, licenses of software, requirements of network, etc., Operational and maintenance cost also to be considered (Greenberg et al. 2009b).

Energy efficiency

Energy is also a significant factor for Data centers: when the energy consumption is high it directly impacts the cost and it increases highly. When there is a over usage of Power, Administrator should identify and optimise the usage of power.

Data center inspection

Data center administrator monitors the network elements, data storages regularly and as a result of avoiding or reducing the potential down time of the data center. Tools are available to alert the operator on any issues on the data Center. If operator overlooks then there will be a down time in data center; to avoid down time, proactive measurements and prediction are required. The disk storage failure can be predicted with applications developed using machine learning models and the data collected from the disk storages over a period. Data loss and latency can be reduced if we use this prediction model.

Security

When the devices in the data center are not secure it can affect the service disruption. The traffic flowing in the data center network should be monitored thoroughly to find out the malicious traffic (Harsh et al. 2018). It could also identify the suspicious attacks in the data center network.

High availability

All devices in the network should be highly available and if any device restarts or is switched off then the recovery mechanism or backup communication path should be defined and should be in operational condition.

Speed

With high speed, latency should be less. Should also achieve high speed in processing the data, computational speed (Harsh et al. 2018).

Data storage

Business requirements should satisfy the required data storage, because it must handle huge amounts of data. Any disk repository failures (Patterson et al. 1988b) should be identified well in advance; currently we have RAID (Redundant Array of Inexpensive Disks) (Schulze 1988; Pinheiro et al. 2007b; Dally and Towles 2004). To be more effective systems, identify machine learning models and predict the failures well in advance to reduce the downtime of the data center. The various types of storages used in the data centers are hard disk drive, solid state drive and storage blades (Fig. 2).

Fig. 2
figure 2

Data center challenges

Previous study

Wang classified the failures by mechanism of failures and based on their mode and cause; Murray (Patterson et al. 1988a) classified failures based on bad sector failure, read write failures and logical failures. They have made more differential predictions but experimentally failure predictions are less and the method of raising alert was not showcased. We included more relevant SMART parameters in the experiments made by us. Lu et al. (Lu et al. 2020a, b) predicted the disk failures based on SMART, Performance and Location of the disk in the data center with F-ratio as 95% however they used the complex model of CNN. We have used the boosting technique to get 99.99% accuracy.

Current limitations

  • Devices embedded with predictive models using SMART monitoring metrics

  • Models are proprietary are of simple, threshold-based normalizations

  • Gives very high false alarms leading to very weak predictive power

Analysis on selection of smart parameters

SMART is self-monitoring, analysis and reporting technology (Seagate statement on enhanced smart attributes), [http://smartlinux.sourceforge.net/smart/faq.php?#2 (“How does S.M.A.R.T. work?”)] which measures the attributes that are used to protect and secure the data and to reduce the downtime of the system by predicting the fault, failure, and performance degradation of the device. SMART parameters (Pinheiro et al 2007b) are mainly used to reduce the hard disk failures and avoid loss of data.

Generally SMART parameters have threshold values, some of the parameters have values as Zero which means disk is in good condition and if it exceeds this then it is likely to fail. Some of the parameters have varying increasing values which means the disk is again likely to fail.

HDD and SSD manufacturers are manufacturing the devices by considering reliability factor. Reliability means storage devices are performing without failures to avoid downtimes. SMART parameters are used to predict the reliability of storage devices. Data loss from disc is unacceptable and such loss cannot be incurred in business and personal.

Statistical methods (Patterson et al. 1988b), Ranksum Test, Z-score test performed with respect to the period of the SMART parameter values and identified the major failure are due to the below discussed SMART parameters from the data sets shared by back blaze (https://www.backblaze.com/b2/hard-drive-test-data.html#how-you-can-use-the-data). Z-score method is used to compare the data point to the mean population. From these test methods, from more than hundreds of SMART parameters the following were identified, and their values will be learnt and trained. There are more than 100 SMART parameters, but we select them based on these testing, Please refer Table 1. SMART parameters show’s the values and its impact of failure in drive. It can be used in SCSI, PCI, ATA, etc.,

Table 1 SMART parameters

Analysis indicates that some signals are used for identifying drive failures.

  1. A.

    Count on reallocated sectors (SMART 5)

    Reallocated sector means bad sectors of disc drive. This sector is not safe to store the data. When the hard disk experiences read/write error then that sector is marked as “reallocated” and it copies the data from this sector to another sector to prevent the data loss or corruption (Murray et al. 2005).

    Number of reallocated sectors is the count of sectors that are marked as reallocated sector due to read or write error; Growing count is to be considered as a prediction parameter for hard disk failure. This smart parameter is supported by Seagate, IBM, Samsung, Fujitsu, HP, etc.

    Following are the main aspects to of this failure:

    1. Reallocated sector count

    2. When the hard disk identifies an error in read or write operation, then it flags the sector as “reallocated”. The data are moved to reserved area. This is called as remapping. This is also called as remaps.

    3. This has the count of bad sectors which have been identified and remapped.

    4. When this parameter has higher value, then it is advised to replace the drive.

    5. Lifetime of the drive can be measured with this parameter

    6. This parameter is very critical

    7. Crossing the limit of this count is degradation of this parameter, which may indicate imminent drive failure.

    8. Immediate back of data and replacement of hardware is recommended

    9. No Fix for this failure

  2. B.

    Power Cycle ON/OFF (SMART 12)

    Number of times the power ‘ON’ and ‘OFF’ cycle of the hard disk drive completely. If the count is higher, then this parameter is degrading.

  3. C.

    Reported uncorrectable errors (SMART 187)

    Reported uncorrectable errors refers to the number of errors which cannot be corrected using Error Correcting Code. It is referred as SMART 187 (0xBB). This smart parameter is supported by Seagate, IBM, Samsung, Fujitsu, HP, etc.

    Following are the main aspects to of this failure:

    1. Count of reported uncorrectable errors

    2. This parameter reports the number of reads which cannot be corrected using hardware ECC (Error correcting code—Recovery of sector).

    3. This parameter indicates electromechanical failures of the drive

    4. Hard disks with zero values for this parameter will never fail.

    5. When this SMART parameter raises above zero, value other than zero, then the disk should be replaced immediately without any delay.

  4. D.

    Command timeout (SMART 188)

    Command time out is number of times the operation is aborted because of the hard disk time out. It is referred as SMART 188 (0xBC)

    1. Number of times the operations are halted or aborted due to the timeout of HDD

    2. In normal condition, value of this parameter should be zero

    3. This is a critical parameter

    4. This parameter indicates problems with data cable or power supply

    5. If this is above zero, disk should be replaced

  5. E.

    Temperature (SMART 194)

    Count of temperature is measured by monitoring the hard disk. This parameter holds the current temperature of the hard disk. This parameter is also informational. This parameter has the value of the heat sensor which is built inside hard disk (Murray et al. 2005).

  6. F.

    Remap operations (SMART 196)

    Remap operation count is number of times data are transferred from relocated sectors to other sectors of the disk. Number of times the operation is successful or unsuccessful is stored in this parameter. This SMART parameter is a bad sector indicator.

  7. G.

    Uncorrectable errors while reading/writing sector (SMART 198)

    This SMART parameter is used to store the total number of errors that are uncorrectable while doing write and read operation on the sector. Increase in the count of this SMART parameter denotes the malfunction on the surface of the disk or any failure in mechanical system

Classification of smart parameters based on good and bad disks

As there are lot of challenges in classifying the data set, disk failures can be predicted based on the smart parameters with troubleshooting tools and diagnostic software by the network and lab administrators who track the operation and maintenance of Data center.

We used correlation matrix by calculating the correlation factors between SMART attributes and it is visualized with the below graph ignore the uncorrelated parameters and considered the correlated parameters, below Fig. 3 depicts the correlation matrix and SMART parameters selected.

Fig. 3
figure 3

Correlation matrix—SMART parameter selection

Top SMART parameters which can be used, and it varies from the threshold and the disk failed is measured and % of accuracy on failures are based on the data sets and it is in the below Fig. 4.

Fig. 4
figure 4

Percentage based—SMART parameter selection

Backblaze has recorded the data from the year 2013 till the Q3 of 2020. We have used them and identified the SMART parameters. From the Fig. 3 we can clearly see SMART_187, SMART_196, SMART_5 can be used primarily for identifying the disk likely to fail. SMART 197 and 198 are 46% and 44% approximately. SMART_188 is 3% failed disks where the other parameters have failed. So, we can ignore the SMART_188 and build applications are models based on SMART_187, SMART_196, SMART_5. The accuracy of the disk failure identification can be improved by using the parameters SMART 197 and 198.

Data set description

Total Data set is 50174; the SMART parameters considered are Smart 5, Smart 12, Smart 187, Smart 194, Smart 196, Smart 197, and Smart 198. In the data set number of passed disk is 49920, number of failed drive is 254. Please refer Fig. 5 and Table 2.

Fig. 5
figure 5

Summary of Dataset

Table 2 Dataset description

Existing machine learning models

  1. A.

    K-nearest neighbour algorithm

    It is a supervised learning model, when a new data is given as input to this algorithm, it positions the new data to the available classified group. This algorithm looks for similarity in the available data and groups the new input data to the similar group. This algorithm is mainly used for the applications where classification is required (https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning). During the training phase of the algorithm, the complete data set is read and stored for training data and when new test data are fed in, it classifies based on the available category of classification from training model. This algorithm always compares the new test data to the similarities of the existing features of the dataset.

    Two classifications are made with a defined data set: Group A and Group B. Nearest neighbor’s K value to be defined, there is no specific method to define K value of KNN algorithm. When K value is smaller such as “1” or “2”, the noise in the data will be higher. It’s better to have K value as “5”. K value can be even more but calculation will be higher.

    In Fig. 6, blue data point is new input data: the more neighbors are shown in black data points, and the new blue data point would be classified as black data classification.

  2. B.

    Support vector machine learning model

    Support vector machine (SVM) is used for classification of the inputs and it is known as supervised learning model. It detects the outliers. In the given dimension, hyper plane is constructed between classes which can be used for classification (Cristianini and Shawe-Taylor 2000). Marginal lines are constructed between classes and data points near to these marginal lines are called as support vectors. Please refer Fig. 7.

    Equation of a line is y = mx + c; this is same as y – mx − c = 0.

    Two vectors,\(t = \left( {\begin{array}{*{20}c} { - c} \\ { - m} \\ 1 \\ \end{array} } \right)\)

    And \(x = \left( {\begin{array}{*{20}c} 1 \\ X \\ Y \\ \end{array} } \right)\)

    $${t}^{T}x=-c*1+ -m*x+1*y$$
    $${t}^{T}x=-c -mx+y$$

    Maximum margin can be calculated using the above graph; maximum margin = XP − XN.

  3. C.

    Random forest classifier

    Random forest classifier is a supervised learning method which constructs several decision trees by selecting various features of the data set and it finally merges the prediction output with various decision trees to identify an accurate prediction result (Anantharaman et al. 2018).

Fig. 6
figure 6

K-nearest neighbor with k = 3 and k = 5

Fig. 7
figure 7

Support vector machine. XP is a positive point, XN is a negative point, t is a weight vector, c is the bias

Please refer Fig. 8. In these two subsets of features smart_198 and smart_197 are taken and the prediction happens using both the trees and finally the results are summed up or average can be taken as the final prediction.

Fig. 8
figure 8

Random forest classification

Solution approach

Increase in cost due to data center downtime is significant between the years 2010 and 2016 as per the case study done in US based on 63 data center organizations.

Machine learning is a process in which learning model is built based on the past and reacting and predicting. Machine learning is based on the historical data of the application. The machine learning model should be efficiently used.

Storage failures can be predictable, the machine learning approach can be implemented, and disk failures can be predicted using disk SMART parameters with ensembling approach of decision trees, random forest classifier and recursive boosting approach. This solution will help in early prediction of disk failures thus by reducing the down time and related cost impact can be reduced for data centers. Please refer the sequence of blocks in Fig. 9.

Fig. 9
figure 9

Prediction sequence—software blocks

Data center monitors the disk and stores the data in the CSV format or in spread sheet format. The dataset are extracted, and the data transformation is made by filling the non-filled values and removing the data samples which are exceptional and not related to the range. From the cleaned data set, the data set is identified with feature set. Supervised learning model is implemented using the transformed data set. The prediction is made on the learnt model. If the prediction says the disk is going to fail, then the alert message is raised to the administrator.

Pseudo code 1

Train the machine learning model with 11 Months’ data and the current month data.

figure a

Disk prediction system predicts, and alert message is sent to administrator. Alert message consist of Disk ID, failure probability, days in advance the disk is going to fail. Please refer below.

figure b

Flow chart: RandomTreeRecursiveBoosting

The sequence of the solution is depicted in Fig. 10 using flowchart for RandomTreeRecursiveBoosting.

Fig. 10
figure 10

Flowchart of the solution

Logarithmic loss

Boosting is to identify the weak learner from the previous execution (i − 1)th stage of model and tune that error to the ith stage of the model; incremental learning is performed, until the logarithmic loss is reduced.

Logarithmic loss is calculated by summing up the actual outcome and other possible outcome with probability prediction of each sample of the training set.

$${\text{LogLoss Factor}} = - \frac{1}{Y}\mathop \sum \limits_{i = 1}^{i = Y} \left[ {x_{i} \log p_{i} + \left( {1 - x_{i} } \right)\log \left( {1 - p_{i} } \right)} \right]$$

Y—Number of samples in training data set.

\({x}_{i}\)—actual outcome of the “i” th sample, \({1-x}_{i}\)– second classification of the ith sample.

If \({x}_{i}\) is 0 (True), (1 − \({x}_{i})\) is 1 (False).

\({p}_{i}\)—Probability with respect to \({x}_{i}\) for the ith sample.

Bias is generally defined as difference between the prediction of our model output (average) and the correct value which we are trying to predict. Model with high bias oversimplifies the model which cause pays very little attention to the training data. It might lead to high error on training and test data (https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229).

Variance is used to calculate the spread of the data set used for learning. For the given value of data, the prediction model’s variance is calculated (https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229). Calculating the standard deviation, that is how far the data from the mean is the Variance. When the variance is high the model is not generalized, and when the variance is small, they are very close to mean of the dataset. Lot of attention is made to training data which makes the model efficient with training data and having high error rate on the test data splitted.

Here both bias and variance are considered which makes it more effective which help us to reduce Mean square error (MSE).

$${\text{Error}}\left( x \right) = \left( {E\left[ {q\prime \left( x \right)} \right] - q\left( x \right)} \right)^{2} + \left[ {\left( {q\prime \left( x \right) - E\left[ {q\prime \left( x \right)} \right]} \right)^{2} } \right] + \sigma^{2}$$

Error (x) is the sum of Bias2, Variance, Remaining Error.

In boosting method, predicts from multiple decision trees. Each decision tree has the nodes from various subset of features from the dataset. Each individual formed decision trees are unique and not same. Now, the various decisions from these trees are captured and finalized.

In this method, the trees are built sequentially, and it corrects the errors of the previous trees. The tress which are not having many levels are easily understandable and it becomes simple for making decisions instead of analyzing sequentially by iterating through the trees to decide. But for a larger tree boosting algorithm helps with splitting the trees with various parameters. Prediction with estimator iterator as 3 is depicted in Fig. 11

Fig. 11
figure 11

Recursive boosting learning with estimator = 3

Parameters to be considered for efficient learning:

  1. 1.

    Number of times the execution to be estimated, when there is no improvement in loss or error reduction

  2. 2.

    Minimum number of samples used to split

  3. 3.

    Maximum depth of the tree

  4. 4.

    Maximum number of terminal nodes.

  5. 5.

    Number of features to be considered

Pseudo code 2: RandTreeRecursiveBoosting

figure c

Metrics—performance evaluation

The efficiency of the machine learning models can be evaluated based on the various performance evaluation methods, they are discussed as follows:

True positive True positive indicates correctly predicting the failure from the disks with failures.

True negative True negative indicates that a disk is predicted as healthy and which is free from failures.

False positive False positive indicates that disk in a good condition as failed.

False negative False negative indicates that a disk with a failure is predicted as healthy and free from failures.

Accuracy

Accuracy metric defines the overall accuracy in predicting whether disk is pass or fail.

$${\text{Accuracy}} = \frac{{{\text{TruePositive}} + {\text{TrueNegative}}}}{{{\text{TruePositive}} + {\text{True Negative}} + {\text{FalsePositive}} + {\text{FalseNegative}}}} \times 100\%$$

Error

Error estimate is used to identify the false disk prediction

$${\text{Error}} = \frac{{{\text{FalsePositive}} + {\text{FalseNegative}}}}{{{\text{TruePositive}} + {\text{True Negative}} + {\text{FalsePositive}} + {\text{FalseNegative}}}} \times 100\%$$

Confusion matrix

Confusion matrix is a matrix which is used to visualize the number of disks passed and failed with respect to prediction and actual status of the disk.  This matrix will have four entries, it is a 2 X 2 Matrix for Pass and Fail classification. The four entries of confusion matrix are explained below. Please refer Fig. 12.

  • True Positive “P” represents the “P” Number of disks which are failed in actuals are classified as failed using the identified algorithm. This is a correct classification.

  • False Negative “R” represents the “R” number of disks are misclassified as not going to be failed but in actual the disks are failed.

  • False Positive “Q” represents the “Q” number of disks are misclassified as failed but in actual the disks are in good health.

  • True Negative “S” represents the “S” Number of disks are classified as failed where the actual disk status is also failed. This is a correct classification.

Fig. 12
figure 12

Confusion matrix

Recall

The sensitivity or recall compares the total number of disks that are predicted to fail with the total number of disk failures present. The Recall metrics is used to predict the positive instances present in the data set. The recall value denotes the instance in the data set predicted to have the actual disk failure, below is used to calculate the sensitivity data,

$${\text{Sensitivity}} = \frac{{{\text{TruePositive}}}}{{{\text{TruePositive}} + {\text{FalseNegative}}}} \times 100\%$$

Specificity or true negative rate

When the prediction output is zero and the disk is free from failures, this evaluation is called as Specificity

$${\text{Specificity}} = \frac{{{\text{TrueNegative}}}}{{{\text{TrueNegative}} + {\text{FalsePositive}}}} \times 100\%$$

Precision

The precision value is the percentage of ratio of True Positive and the sum of true and false positives. The following equation is used to calculate precision value:

$${\text{Precision}} = \frac{{{\text{TruePositive}}}}{{{\text{TruePositive}} + {\text{FalsePositive}}}} \times 100\%$$

Correlation coefficient

The correlation co-efficient is a performance evaluation metric using statistical concepts. The correlation coefficient has relationship between the predicted disk status obtained from the experiment and the actual disk failure data. This is a performance  measurement between the predicted and actual value. The prediction of the framework lies between − 1 and + 1. If the model has the correlation co-efficient value as − 1 then the model makes incorrect predictions. It becomes unreliable model. When the correlation co-efficient value is + 1, the model makes correct predictions and it is a reliable model.

$${\text{Correlation}} \,{\text{Co - efficient}} = \frac{{{\text{TruePos}} \times {\text{TrueNeg}} - {\text{FalsePos}} \times {\text{FalseNeg }} }}{{\sqrt {({\text{TruePos}}} + {\text{FalsePos}}) \left( {{\text{TruePos}} + {\text{FalseNeg}}} \right)\left( {{\text{TrueNeg}} + {\text{FalsePos}}} \right)\left( {{\text{TrueNeg}} + {\text{FalseNeg}}} \right)}} \times 100\%$$

F-Measure

Accuracy of Prediction Model can be evaluated using this metric. F-Measure is calculated using the values of precision and recall which are measured from the prediction made using various classification and learning models. The recall value is same as the sensitivity value. Calculation of F-Measure is as follows:

$${\text{F - Measure}} = 2 \times \frac{{{\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}}$$

Results

Various machine learning models were implemented along with the boosting techniques whose results are depicted in this section.

KNN

The following is the confusion matrix results (Fig. 13) of K-Nearest neighbour machine learning model. In 10,035 test data, Correct classified disk pass is 9975 and disk failed is 54. 6; disk data are not predicted correctly, and it is misclassified.

Fig. 13
figure 13

Confusion Matrix—KNN

In the below graph (Fig. 14), blue colour is disk pass and orange colour is disk failed. The prediction of disk which are in healthy state are mostly predicted correctly, in which the precision, recall and F1 score value of disk with good health is 1, and it is lesser in disk whose health is not good.

Fig. 14
figure 14

Precision, Recall, F1-Score—KNN

In the given test data, 9978 disk data are with Disk Pass label and 57 are disk Fail labelled, whereas the predicted data have 9975 passed and 54 failed. It is represented in the following Fig. 15.

Fig. 15
figure 15

Count of actual and predicted test data

Random forest classifier

The following is the confusion matrix (Fig. 16) results of Random Forest Classifier machine learning model. In 10,035 test data, Correct classified disk pass is 9975 and disk failed is 54. 6. Disk data are not predicted correctly, and it is misclassified.

Fig. 16
figure 16

Confusion matrix—Random Forest Classifier

In the below graph Fig. 17, blue colour shows disk pass and orange colour is disk failed. The prediction for disks which are in healthy state are mostly predicted correctly, in which the precision, recall and F1 score value of disk with good health is 1, and it is approximately nearer to the disks whose health is not good.

Fig. 17
figure 17

Performance metrics—Random Forest Classifier

99.88% is correctly classified and 0.02% is incorrectly classified using this learning model; this model looks more nearer to the prediction.

In the given test data, 9978 disk data instances are labelled with Disk Pass and 57 disk data instances are labelled as Disk Fail. The predicted output shows 9977 disk data instances are in Disk Pass status and 56 disk data instances are in Disk Fail Status shown in the following Fig. 18.

Fig. 18
figure 18

Count of actual and predicted test data—Random Forest Classifier

SVM

The following Fig. 19 is the confusion matrix of Supervised Machine Learning model. In 10,035 test data, correctly classified disk pass is 9978 disk instances and there is no disk failures and it is Zero. In this model, 57 disk data are not predicted correctly. All 57 failed (Unhealthy) disks which are likely to be failed is incorrectly classified as passed and not going to fail. This is the major disadvantage in using SVM for disk failure prediction.

Fig. 19
figure 19

Confusion matrix—SVM

In the below graph Fig. 20, blue colour is disk pass and orange colour is disk failed. The prediction of disk which are with status as Pass is predicted correctly.  The precision, recall and F1 score value of disk with status as Pass is always one, but disk with status as Fail has these values as zero. All the disk with status as Fail is predicted as Pass.  

Fig. 20
figure 20

Performance metrics—SVM

99.43% is correctly classified and 0.57% is incorrectly classified using this learning model, however all the 0.57% are misclassified as disk are healthy which are failed in actuals.

In the given test data, 9978 disk data instances are labelled with Disk Pass and 57 disk data instances are labelled as Disk Fail. The predicted output shows 9978 disk data instances are in Disk Pass status and zero disk data instance as Disk Fail Status. When it predicts as no disk failure, it becomes a major error in prediction. Therefore, SVM is not suitable for disk failure prediction. It is shown in the following Fig. 21.

Fig. 21
figure 21

Count of actual and predicted test data—SVM

Logistic regression machine learning algorithm

The following Fig. 22 is the confusion matrix output of Logistic Regression model. In 10,035 test data, 9902 disk data instances are correctly classified the disk status as pass and 22 disk data instances are correctly classified the disk status as Fail. In logistic regression model, there lot of misclassifications in the predictions. In this test data,"111" disk data instances are not predicted correctly. In the misclassified 111 disk data instances, 35 failed (Unhealthy) disks which are likely to be failed is incorrectly classified as passed and 76 disks are predicted as failed which are passed in actuals.

Fig. 22
figure 22

Confusion matrix—logistic regression

In the below graph Fig. 23, blue colour is disk pass and orange colour is disk failed. The prediction for disks which are in healthy state are mostly predicted correctly, in which the precision, recall and F1 score value of disk with good health is 1, and most of the disk fail is predicted incorrectly as disk pass and performance metrics in the graph has lower values especially Recall and F1 score are very low for disk fail labelled data set.

Fig. 23
figure 23

Performance metrics—logistic regression

99.93% is correctly classified and 1.1% is incorrectly classified using this learning model; however, most of the disks (34 disks) that are classified as False is predicted as pass and 76 disks which are passed are misclassified as false.

In the given test data, 9978 disk data are with Disk Pass label and 57 are disk Fail labelled and whereas in the predicted data has 9902 pass and 22 disk failures, whereas 111 are misclassified. Therefore, there are higher misclassifications in logistic regression algorithm. It is represented in the following Fig. 24.

Fig. 24
figure 24

Count of actual and predicted test data—logistic regression

Naïve Bayes machine learning algorithm

The following Fig. 25 is the confusion matrix results of Naive Bayes Learning model. In 10,035 test data, 9978 disk data instances are  correctly classified as disk pass and zero disk data instance as Fail. 57 disk data are not predicted correctly, and it is misclassified. These 57 disk data instances which are misclassified as Pass is going to be Fail in actual. 

Fig. 25
figure 25

Confusion matrix—Naïve Bayes

In the below graph Fig. 26, blue colour represents disk pass and orange colour indicates disk failed. The prediction of disks which are in healthy state is mostly predicted correctly, in which the precision, recall and F1 score value of disk with good health is 1, and most of the disk fail is predicted incorrectly as disk pass and performance metrics in the graph have lower values; especially Recall and F1 score are very low for disk fail labelled data set.

Fig. 26
figure 26

Performance metrics—Naïve Bayes

99.43% data are correctly classified and 0.54% incorrectly classified using this learning model. Most of the disks are classified as disk is in Pass status in the prediction however those disks are with Fail status. In the given data set, 0.54% test data is classified the Disk status as Pass but in actual they are in Fail Status. 

In the given test data, 9978 disk data instances are labelled with Disk Pass and 57 disk data instances are labelled as Disk Fail. The predicted output shows 9978 disk data instances are in Disk Pass status and 56 disk data instances are in Disk Fail Status and 54  disk instances are misclassified the Disk Status as Pass but in actual it is Fail. Misclassification rate is higher in Naïve Bayes algorithm. This algorithm will not suit for disk failure prediction. It is shown in the following Fig. 27.

Fig. 27
figure 27

Count of actual and predicted test data—Naïve Bayes

RandTreeRecursiveBoosting

Figure 28 lists the confusion matrix results of RandTreeRecursiveBoosting model. In 10,035 test data, Correct classified disk pass is 9975 and disk failed is 56. Data of one disk data are not predicted correctly and are misclassified.

Fig. 28
figure 28

Confusion matrix—RandTreeRecursiveBoosting

In the below graph Fig. 29, blue colour represents disk pass and orange colour represents disk failed. The prediction of disks which are in healthy state are mostly predicted correctly, in which the precision, recall and F1 score value of disk with good health is 1, and it is approximately nearer to the disk whose health is not good.

Fig. 29
figure 29

Performance metrics—RandTreeRecursiveBoosting

Using this learning model, 99.88% is correctly classified and 0.02% is incorrectly classified using this learning model; this model looks more nearer to the prediction.

In the given test data, 9978 disk data instances are labelled with Disk Pass and 57 disk data instances are labelled as Disk Fail. The predicted output shows 9978 disk data instances are in Disk Pass status and 56 disk data instances are in Disk Fail Status. It is shown in the following Fig. 30.

Fig. 30
figure 30

Count of actual and predicted test data—RandTreeRecursiveBoosting

Comparison of identified machine learning algorithms

From the following Fig. 31, Random classifier method predicted 10,033 disk data as correct classification and 2 disks are not correctly classified.

Fig. 31
figure 31

Comparison of correctly classified and misclassified—various ML algorithms

Precision value and Recall is ‘1’ for Random classifier comparing to other learning models in Fig. 32. SVM, Logistic regression and Naïve Bayes are not suitable for the disk failure precision since they have lesser precision and recall value.

Fig. 32
figure 32

Comparison of precision and recall—various ML algorithms

F-Measure value is higher  for RandTreeRecursiveBoosting and Random Forest classifier when compared to KNN algorithm. However the F-Measure value is very less for SVM, Logistic regression and Naive Bayes algorithms. As the output value of F-Measure is very low from the test data, SVM, Logisitc Regression and Naive Bayes algorithm is not suitable for disk failure prediction. Please refer the following Fig. 33

Fig. 33
figure 33

Comparison of F-measure—Various ML algorithms

Accuracy score and error on disk failure prediction using various machine learning models as given below in Fig. 34.

Fig. 34
figure 34

Comparison of accuracy score and error—various ML algorithms

Earliest detection

Our approach helps an early detection of failure from 3 days up to 30 days in advance with ~ 75% probability. X-axis represents the number of days in advance the failure is predicted, and Y-Axis represents the percentage of prediction. The following Fig. 35 depicts the same.

Fig. 35
figure 35

Number of days for earliest detection

Conclusion

In this paper, analysis made on the selection of SMART parameters, identified the required SMART parameters as features for the training models. SMART Parameters are identified using the techniques such as correlation matrix and rank sum test. Also compared the solution with the various traditional machine learning models. Our solution, uses  combination of ensembled method of decision trees, building random forest trees and boosting techniques. Random forest is built with the  subset of features. The learning model is boosted for number of iterations which is defined by the value of estimator. Estimator has the value which reduces the error (defined by Bias, Variance) or by using the logarithmic loss method. When there is no improvement in the model it is stopped, and prediction is made. Results of various existing models and the ensembled approach of RandTreeRecursiveBoosting are executed and portrayed in this paper. From this paper, we can use boosting technique along with Random forest for accurate prediction at the rate of 99.99%. Disk failure can be predicted in advance from 3 to 30 days at 75% probability.

Future work

The machine learning model should be fine-tuned with all the parameters identified for efficient learning to be learnt more with features, and dataset should be increased for more than 1 lakh data samples in the set. Prediction probability of having prediction at least 8 days prior would help the administrator to perform proper databackup and replacement, whereas 3 days is lesser to replace the disk data set. RandTreeRecursiveBoosting to be improved with more weight and logarithmic steps to be decreased such that complexity of the algorithm can also be reduced.