1 Introduction

Nowadays, the amount of data generated by all sectors of the economy and the society is growing exponentially. The information and the communication technologies and the automation of the industrial processes are currently some of the most important generators of data. For instance, the internet has become the biggest producer of data in the entire history of humanity.

The comprehension of data is an activity that has always accompanied to the human beings. However, in this “Information Age”, the capacity and the necessity of acquiring, producing and generating data have reached unimaginable dimensions. As a result, the conventional data processing methodologies have become obsoletes. This is the reason why the new concept of “Big Data” is emerging. Big Data can be considered as a modern socio-technical phenomenon [1] that appears as a consequence of the current massive data generation. Some well-known companies that employ Big Data to obtain useful information are Google, eBay, Amazon, Facebook, Twitter, IBM, LinkedIn, AOL, etc. [2]. For example, it is estimated that Google processes more than 25 petabytes (25 × 1015 bytes) every day.

The Big Data has been defined in the industrial field by six dimensions that can be called “The 6 Vs” (see Fig. 1). This term concerns the following dimensions [3, 4]: Volume (the amount of data), Velocity (the speed at which data is created), Variety (the different natures of the data), Veracity (the certainty of data meaning), Validity (accuracy of data) and Volatility (how long the data need to be stored).

Fig. 1
figure 1

Dimensions of Big Data

These dimensions will determine the type of Big Data that is being considered. The complexity of the Big Data analysis is further defined by the volume, the velocity, the variety and the volatility. The usefulness of the analysis is usually dependent on the validity and the veracity of the data.

Besides the communication systems, social networks and companies that operate online, there is a significant source of Big Data related to the digital sensors worldwide in industrial equipment, automobiles, electrical meters and shipping crates. These sensors are capable of evaluating locations, movements, voltages, vibrations, magnetic fields and countless variables of a certain system. This concerns not only the volume of data that is generated but also the variety of such data.

This chapter is focused on the particular case of the Big Data generated by the equipment of the wind farms.

2 Big Data and Wind Turbines

The large number of data generated by the monitoring systems results in complex scenarios when they need to be treated. Data can come from different sources and their content can be completely random. Even so, the information can be correlated and their sorting can be useful for decision making. This situation is a common link in almost all industrial sectors where the incorporation of new technologies and the emergence of Condition Monitoring Systems (CMS) supported by Supervisory Control and Data Acquisition (SCADA) systems make the data processing a critical factor [5].

The field of the renewable energies is one of those sectors where the previous issue arises. The high volumes of data used in the operations and maintenance (O&M) tasks makes the introduction of Big Data a key factor. Wind farms usually divide the data analysis in three categories for decision making: descriptive analysis, post-event diagnostics and prognostics. The first category identifies the features with statistical calculations and graphics. The second category analyses the cause-effect of any change from a threshold. Finally, the prognostics predict the system changes [6].

Descriptive analysis is the basis of the following steps. Data collection must be as wide as possible to obtain a first approach. One of the first relationships that wind farms consider is the wind speed and power output connection. This is due to the fact that different wind farms can have wind turbines with similar specifications and their comparison can reveal the most efficient conditions.

The prognostic analysis is based on predictive modelling where several techniques such as regression trees or neural networks can be introduced to have an accurate model. Diverse inputs can be considered, e.g. speeds, electromagnetic data or vibration, to develop the model. The application of the techniques will entail the detection of degraded performances at earlier stages [7].

2.1 Condition Monitoring Approaches for Wind Turbines

Most of the wind turbines (WTs) are three-blade units [8, 9]. The energy generated by the blades is redirected from the main shaft to the generator through the gearbox. At the top of the tower, assembled on the foundation, the nacelle is found. A yaw system controls its alignment from the direction of the wind. The pitch system is mounted in each blade to position them depending on the wind. It also acts as an aerodynamic brake when needed. Finally, a meteorological unit provides information about the wind (speed and direction) to the control system.

Condition monitoring (CM) is implemented from basic operations of the equipment to study [10]. The system provides the “condition”, the state of a characteristic parameter that represents the health of the component(s) being monitored. CM operates from different sensors and signal processing equipment in WTs. The main purpose is to monitor components ranging from blades, gearboxes, generators to bearings or towers.

CM reduces interferences during the features transport. Data processing, sorting and manipulation according to the objectives pursued, are usually performed by a digital signal processor. Then it can be shown, stored or transmitted to another system. One of the advantages for these systems is, therefore, that monitoring can be processed online or in certain time intervals. Thus, it is possible to maximise the productivity, to minimise downtimes, and to increase the Reliability, Availability, Maintainability and Safety (RAMS) levels [11].

Different techniques are available for CM:

  • Vibration analysis [12].

  • Acoustic emission [13].

  • Ultrasonic testing techniques [14].

  • Oil analysis [15].

  • Thermography [16].

  • Other methods.

The accurate data acquisition is critical to determine the occurrence of a failure and the subsequent solution. This can be achieved with the optimal type, number and placement of sensors. Data acquisition is always the first step of the CM process and includes the measurement of the required conditions (e.g. sound, vibration, voltage, temperature or speed), turning them into electronic signals. Then, signal processing introduces the handling (e.g. fast Fourier transform, wavelet transforms, hidden Markov models, statistical methods and trend analysis) and storage of data.

2.2 Supervisory Control and Data Acquisition Systems for Wind Turbines

SCADA systems are currently being introduced in WTs due to their effectiveness has been proved in other industries for detection and diagnostics of failures [17]. They are presented as an inexpensive and optimal solution to [18] control feedback for the health monitoring while reducing the O&M costs [19]. Nevertheless, they also present some minor disadvantages due to the operational or reliability conditions [20].

The SCADA system considers a large amount of measurements such as temperatures or wind and energy conversion parameters [21]. These data have raised considerable interest in different areas, e.g. wind power forecasting [22], production assessment [23] and of course, for fault detection [24].

In the case of the WTs, the introduction of SCADA systems verifies the efficiency when their components deteriorate. This degradation can indicate problems of different nature such as misalignments in the drive-train, friction caused by bearing or gear faults. The basic elements of the performance monitoring consist of a first collection of raw values by the sensors. After the application of the appropriate filters, anomalies are detected. Finally, a diagnosis will be provided. The anomaly detection includes a series of techniques that range from simple threshold checks to statistical analyses [25].

3 Data Reduction Techniques

As aforementioned, the wind farms are becoming a source of massive data. The purpose of these data is to describe the condition of the systems. However, the data are useless by themselves, they are only valuable when information can be gathered from them. It is necessary to process the data in order to extract useful information, but this is an arduous task when there is a very large amount of data. For this reason, it is essential to employ some techniques that allow for reducing the amount of data without losing the main information that they can provide. With this purpose, two procedures are proposed in the following sections. The first one is to analyse a continuous signal coming from a CMS by extracting feature parameters and, the second one provides a reduction for SCADA systems by filtering the unnecessary data.

3.1 Feature Parameters for CMSs Signals

The CMSs installed in WTs are employed to evaluate variables such as vibration, lubrication oil or generator current signal. These systems usually provide a continuous monitoring of the variables. For this reason, it is important to develop algorithms capable of detecting possible abnormal behaviours of the variables over the time [26].

The main goal of this section is to perform a statistical study of the historical data of a CMS in order to achieve some feature parameters. These parameters facilitate to focus the analysis on the information that is really significant. Consequently, an important reduction of the amount of data is obtained. The feature parameters that will be used in this chapter are explained below [2732]:

  • Average: the average can be useful for those signals without abrupt changes, i.e. signals that are almost constant. For example, it could be useful for humidity or temperature signals.

  • Peaks: The more representative peaks are usually those that correspond to a maximum value of the signal within a certain time interval. These peaks can be referred to the time domain or to the different harmonics in the frequency domain. Other feature parameter related to the peaks is the peak to peak value that is defined as the distance between the maximum and the minimum amplitude of the signal.

  • Correlation coefficient (r): This coefficient is a statistical procedure used to determine the relationship between several signals. This parameter can be used to identify important changes between a received signal and the historical data. It can run from −1 (perfect negative correlation) to 1 (perfect positive correlation). It is 0 when the signals are totally independent. The correlation coefficient between two signals x and y can be obtained as follows:

$$r = \frac{{N\left( {\mathop \sum \nolimits_{n = 1}^{N} xy} \right) - \left( {\mathop \sum \nolimits_{n = 1}^{N} x} \right)\left( {\mathop \sum \nolimits_{n = 1}^{N} y} \right)}}{{\sqrt {\left( {N\mathop \sum \nolimits_{n = 1}^{N} x^{2} - \left( {\mathop \sum \nolimits_{n = 1}^{N} x} \right)^{2} } \right)\left( {N\mathop \sum \nolimits_{n = 1}^{N} y^{2} - \left( {\mathop \sum \nolimits_{n = 1}^{N} y} \right)^{2} } \right)} }}$$
  • Root Mean Square (RMS): This is a time analysis feature that corresponds to the measure of the signal power. It can be useful for detecting some out-of-balance in rotating systems. It can be calculated by:

$$RMS = \sqrt {\frac{{\mathop \sum \nolimits_{n = 1}^{N} \left( {y\left( n \right)} \right)^{2} }}{N}}$$

being N the total number of discrete values of the signal y. Other common parameter is the Delta RMS that is the difference between the current RMS and the previous value.

  • Standard Deviation: This parameter is used to obtain the dispersion of a data set. It can be calculated by:

$$SD = \sqrt {\frac{{\mathop \sum \nolimits_{n = 1}^{N} \left( {y\left( n \right) - Mean} \right)^{2} }}{N - 1}}$$
  • Skewness: This parameter is an indicator of the signal symmetry. It is defined by:

$$Skewness = \sqrt {\frac{{\mathop \sum \nolimits_{n = 1}^{N} \left( {y\left( n \right) - Mean} \right)^{3} }}{{\left( {N - 1} \right)S^{3} }}}$$
  • Kurtosis: This parameter corresponds to the scaled fourth moment of the signal. It is a measure of how concentrated the data are around a central zone of the distribution. It is calculated by:

$$Kurtosis = \frac{{\mathop \sum \nolimits_{n = 1}^{N} \left( {y\left( n \right) - Mean} \right)^{4} }}{{\left( {N - 1} \right)S^{4} }}$$
  • Crest Factor: This parameter is capable of detecting abnormal behaviours in an early stage. It is defined by:

$$Crest\,Factor = \frac{Peak}{RMS}$$
  • Shape Indicator: This factor is affected by the shape of the signal but it is independent of its dimensions. It is obtained as follows:

$$Shape\,Indicator = \frac{RMS}{{\frac{1}{N}\mathop \sum \nolimits_{n = 1}^{N} \left| {y(n)} \right|}}$$
  • Other parameters: Other parameters are widely used such as enveloping, demodulation, FM0, NA4, FM4, M6A, M8A, NB4, sideband level factor, sideband index, zero-order figure of merit, impulse indicator, clearance factor etc.

These parameters can be only evaluated on finite signals. For this reason, it is necessary to choose some pieces of the continuous signal. The goal is to obtain the main features of the entire signal analysing only some pieces. Therefore, there are two factors why the data are reduced: firstly, a continuous signal is converted into several finite signals and, secondly some parameters of these finite signals are saved.

Table 1 shows a general structure of the data using the method proposed.

Table 1 Association of data of the CMS and the condition of the WT

The element \(e_{ij}^{k}\) corresponds to the j parameter of the k piece collected at the time (date) i.

The main objective of this method is to determine the condition of the WT by making a comparison between the historic data and the data that is being receiving. With this purpose, the historical data will be subjected to a pattern recognition analysis to determine what features are significant. There are a lot of models for pattern recognition analysis, i.e. statistical model, structural model, template matching model, neural network based model, fuzzy based model, hybrid models, etc. [33, 34].

A neural network (NN) based model will be implemented to analyse the data in this chapter. The NN are complex structures based on the biological neurons. These structures provide a good solution for those problems that cannot be analytically defined. Basically, the NN receives a dataset that is used into a training process to recognise the parameters. In this process, some weights are adapted to provide an adequate output. The different parameters of the signals will be considered as inputs, whereas the condition of the WT will correspond to the desired output of the NN. Further information about NN can be found in Refs. [35, 36]. A case study is developed in Sect. 4.1 in order to clarify the procedure hereby explained.

3.2 Data Analysis for the SCADA System

Besides the evaluation of the variables cited in the previous section, other signals can be collected to complete the data acquisition of a CMS, such as power, pressures, speeds and temperatures among others. With all these data, it is possible to track and analyse the set from the emergence of incipient failures. A SCADA system consisting of different processing tools that transform the data received into real-time analysable information is involved. The displays that comprise the system are configurable to obtain the information when and where it is needed (see Fig. 2).

Fig. 2
figure 2

SCADA system

One of the main advantages of the SCADA system that will be presented for the cases studies is that allows almost infinite storage data in the original resolution. The software included can create and analyse process flow diagrams and graphics. The settings can be adapted to any operating system through menus and toolbars. In addition, the information can be exported to other formats, such as spreadsheets.

The second purpose in this research is to identify alarms from their location in a power curve. Likewise, it is interesting to know how many of those alarms go unnoticed by the system for being within the prediction bounds. The main problem associated to this task will be the definition of the curve. Due to the high number of data, a previous pre-processing will be done to remove non-significant data. This case could also be extended to other stored signals besides the wind speed and the power.

4 Case Studies

In the former section, two methodologies for processing the Big Data coming from WTs have been proposed and explained. Both methodologies are aimed to reduce the amount of data without losing the main information. In order to clarify how these procedures have to be applied, this section presents two case studies.

4.1 Case Study for CMSs Signals

A drive-train CMS is considered for this case study. This system provides a continuous vibration signal of 8 different points of the drive-train, attending to the point of the drive train that is being monitoring. The sampling rate of the CMS is 1000 samples/s. Therefore, a total of 8000 samples are received per second. The data have been collected during two years, therefore, more than \(5 \times 10^{11}\) samples have been generated by this CMS along that period of time.

In order to apply the methodology explained in Sect. 3.1, pieces of one second each three hours have been considered. Considering the sampling rate of the CMS, a total of 4.6 × 107. As can be observed, this is the first reduction of the amount of data and it corresponds to a reduction of 99.99%. Therefore, the computational costs will be drastically reduced.

Once the set of pieces has been chosen, the following parameters are calculated attending to the definitions in Sect. 3.1: RMS, average, standard deviation, maximum peak, kurtosis, crest factor, shape factor and impulse indicator. The evaluation of these parameters allows for a further reduction of the amount of data to analyse. Concretely, a total of 46,720 data will be used to determine the patterns in the CMS data.

The different conditions of the WT are defined in an alarm report where the state of the WT is collected along the last two years. In this case study, the NN designed is able to differentiate between 4 possible states: “Alarm 1”, “Alarm 2”, “Alarm 3” or “No Alarms”. Each set of inputs is associated with a specific condition of the WT and the relationships are established by the NN. Therefore, the purpose of the NN is to determine the state of the WT when a new set of data is available, i.e. to predict the condition of the WT attending to a new set of inputs. The following Fig. 3 shows the NN designed for this case study.

Fig. 3
figure 3

Neural network designed for the case study

The NN is formed by three layers. The input layer has 64 neurons that corresponds to the amount of inputs (8 signals by 8 parameters). The output layer is composed by 4 neurons according to the possible outputs considered for this case study. Finally, the hidden layer is composed by 16 neurons because the pyramid rule has been applied [37]. The pyramid rule suggests that the number of neuron of the hidden layer must be equal to the square root of the product between the number of input neurons and the number of output neurons.

Figure 4 shows the outcomes of the NN through a confusion matrix. The confusion matrix indicates the output provided by the NN (output class) compared with the real condition of the system (target class). The diagonal of the points those cases in which the outcomes of the NN are right (green cells). The values placed in the grey cells provide the percentages of successes and error for each type of output. The percentages in the fifth row provide information about how many conditions of each type the NN is detecting. However, the percentages in the fifth column express the degree of success when a certain condition has been detected. Finally the blue cell shows a summary of the results that determine the goodness of the NN.

Fig. 4
figure 4

Confusion matrix. Results of the neural network

Figure 4 shows that the real condition of the WT can be successfully determined by using this method in 71.7% of cases. This is a very good result considering that only the 0.00001% of the total available data have been employed.

Once the patterns have been recognised by the NN, the new data from CMS can be pre-processed in order to achieve the mentioned parameters. These new data should be introduced in the NN and the output can provide information of the state of the WT. In this process the amount of data will be reduced from 8000 samples/s to only 64 samples/s. This technique can reduce the 99.2% of the data. Therefore, this method can result very useful to treat Big Data.

4.2 Case Study for the SCADA Systems Using Wind Speed-Power Curves

This second case study will focused on the information related to the wind speed and the power. Both features will be connected from the power curve. The power curve of a wind turbine indicates the electrical power that is available for these devices depending on the wind speed. It is usually close to zero for low speeds. Then, it quickly increases until reaching 10–15 m/s. From those speeds, the curve keeps constant as the result of the limitation devices attached to the turbine. This maximum power is often referred as the nominal power. Once speeds of 20–25 m/s are reached, the wind turbine operation is cancelled due to the activation of protection mechanisms. Therefore, power curves are often not represented at speeds exceeding these limits. In short, it can be said that the power curve is a useful indicator to evaluate the efficiency of a wind turbine.

Power curves are obtained from actual measurements on a wind turbine where an anemometer is strategically positioned. It must be located at certain distance from the rotor to avoid turbulences and therefore, to lose reliability for the stored speed. One of the main constraints of any wind-power curve is that, in practice, the speed fluctuates; so it is important to work with mean values to represent the curve effectively. A non-proper designed curve may show errors of up to 10% between the wind-power ratios.

Regarding the study, the SCADA system stores signals of wind speed and power every ten minutes, i.e. 52,560 samples per year; and subdivides them into sampled, maximum, minimum and average collections, as well as the standard deviation. Once the data are extracted and converted into a readable format by software, it is reordered, from lowest to highest, to get the curve (see Fig. 5). The first representation should fit to the theoretical model expected with minor exceptions (high wind speeds and power outputs).

Fig. 5
figure 5

Initial scenario

Figure 6 (left) is the result of introducing Big Data in the case study. This task has been carried out with a curve fitting tool, doing a previous data selection where the appropriate samples are identified from statistical calculations. An exploratory data analysis is used to remove outliers (alarms in some cases) as well as redundant information. This way, it can be seen a reduction of the initial 52,560 to an 841 samples, representing a decrease of the processed data up to 80% of the total amount (Table 2). Figure 6 also represents the data resulting from the descriptive analysis (left) versus the 904 samples indicating the occurrence of an alarm (right). It can be noted that the sum of both graphics still gives an accurate insight to the data registered by the sensors.

Fig. 6
figure 6

Post-processed curve (left) versus alarms (right)

Table 2 Descriptive analysis

A second regression analysis is conducted to finally obtain Fig. 7. Once the curve that best describes data series is selected, a post processing analysis can be performed. This enables the creation of a graphic with prediction bounds and the calculation of the 95% confidence intervals for the coefficient estimates.

Fig. 7
figure 7

Wind-power curve

The prior step is critical for the development of further analysis where alarms and operating states are linked to the power curve. The importance of this research is that some of the considered alarms have been found when the drive-train was monitored. The idea, still in development, is to create a pattern recognition where alarms can be identified from their location. Something similar could happen with the information that is not detectable for being within the prediction bounds.

Through a first approach, some unusual performances have been found such as data being positioned above the power curve. This behaviour corresponds to alarms where currents and temperatures are involved and it results in an uncommon speed-power ratio. However, this situation occurs in the 2% of the cases studied. The general trend is to locate the failures up to 500 kW and from 8 to 15 m/s, but usually below the curve. In quantitative terms, this can be translated into up to the 58% of the failures detected in terms of wind speed, and up to the 35% in terms of power. Moreover, it should be mentioned that approximately the 53% of the failures are within the prediction bounds and may go unnoticed if they are based on this technique.

5 Conclusions

This chapter has deepened the analysis of the Big Data generated by the systems associated with the maintenance of wind farms. An introduction about the current importance of Big has been included and Data An analysis of the data coming from Condition Monitoring and Supervisory Control and Data Acquisition Systems has been carried out. Two methods has been proposed in order to facilitate the analysis.

The first one is based on the extraction of feature parameter from the signals provided by the Condition Monitoring System. Once the feature parameters have been obtained, a neural network is designed for pattern recognition. It has been demonstrated that only using less than 1% of the data, it is possible to determine the condition of the WT with a 70% of accuracy.

The second methodology is to analyse the data coming from a SCADA system by prior filtering and selection of the adequate data. An analysis of the wind-power curve has been performed by using data of a real SCADA. The data has been filtered and divided into two groups. The first group correspond to points fitting into the normal levels of wind-power. These points can be used to obtain statistical information about the adequate performance of the Wind Turbine. The second group can be used for detecting failures of certain components. This methodology allows a reduction of data up to the 98% of the total for further analysis without losing precision.