Keywords

1 Introduction

Artificial neural network (ANN) models have been used for rainfall prediction [1, 2] and found suitable for handling complex large dataset, particularly of nonlinear nature. There are several methods [3] apart from artificial neural networks which have been used for forecasting rainfall; however, ANN has proved to be useful in identifying complex nonlinear relationship between input–output variables. As the number of hidden layers are increased for better performance [4], the concept of deep learning comes in, which are useful for rainfall prediction. The deep neural network models have mathematics behind them, the understanding of which enables one for the selection of architecture for fine tuning of deep learning models, setting of values for hyperparameters and applying appropriate optimization. However, the success of a model for prediction or classification is directly impacted by the data used for training the model. The real-world data in its raw form may not be suitable to train the model, which signifies the importance of data pre-processing.

Pre-processing of data refers to improving the quality of data, which involves data cleaning, data reduction, data transformation and data encoding. In data cleaning, the missing values, duplicate values and outliers are dealt with. Data reduction refers to number of features being reduced, particularly adopted to reduce the effect of curse of dimensionality. Data transformation is applied to scale the data either using normalization or standardization. Data encoding ensures the categorical features in text format are encoded to numbers.

This paper presents steps involved in pre-processing raw labelled dataset (Seattle weather) with 25,551 records, to make it suitable for input to a deep neural network model. Insight into the data is further gained to identify the architecture suitable for chosen problem. Deep learning model, that is multilayer perceptron model using sequential model API with dense layer, is built and compiled using Adam optimizer for desired accuracy.

2 Methodology

Data obtained in raw form is suitably pre-processed to ensure all feature variables and target variable fit the DNN model. Using quantitative approach, the relationship between variables is identified. On pre-processing, the number of input neurons is identified. After pre-processing, data is divided into train–test data and scaling is applied. The DNN-sequential approach is adopted, and dense layers are added. At hidden layers, the activation function is chosen. As the problem belongs to the class of binary classification, sigmoid function is applied at the output layer. The epoch size is identified by observing the model loss and model accuracy curve. The test accuracy is further computed using the prediction capability of the model trained.

3 Data and Data Pre-processing

3.1 Data Set

The present study is made on Seattle, US weather dataset [5]. The dataset contains records of daily rainfall patterns from Jan 1st, 1948 to Dec 12th, 2017. The dataset consists of five columns, and the description of these columns is as follows as shown in Table 1.

Table 1 Data description for Seattle weather dataset

There are total number of records which are 25,551. The memory usage is approximately 998 KB. Here DATE, PRCP, TMAX and TMIN are features (X), and RAIN is our target variable Y. RAIN is categorical in nature which has two possible output—True (RAIN) or False (Not RAIN). Thus, the problem belongs to the class—binary classification. Features—PRCP, TMAX and TMIN—are numerical continuous values, and DATE is in format YYYY-MM-DD. Sample five records are displayed below from the dataset as shown in Tables 2 and 3 which shows the statistical description of the dataset:

Table 2 First five records of Seattle weather dataset
Table 3 Statistical description of Seattle weather dataset

3.2 Data Pre-processing

Deep neural network (DNN) models like any other machine learning model requires pre-processing of data, before the data is passed to input neuron. One of the important step is to identify the missing values and if found treat them appropriately. Missing values can be dropped and can be substituted by either mean, median or any other relevant value like 0 or 1. It is also necessary to check for duplicate data to make our model more impactful. Often, the raw dataset may contain columns which are less important and can be ignored or avoided as input to the model which are identified during pre-processing. Also, certain new columns may be required to be generated from the existing one for extracting more feature value for the model. Deep learning model takes input to the input layer neurons in the form of real values, essentially it is to be ensured that text to number encoding is done prior to that.

It is observed that the dataset consists of three null values in PRCP and RAIN columns as shown in Table 4. Before we send our data to the model, it is required that the null values to be treated with appropriate action. In this case compared to the huge number of records dropping, the three records with null values in PRCP and RAIN are recommended, and accordingly, they are removed from the dataset leaving 25,548 records for further processing. Next, the dataset is checked for duplicate values and it is observed that there are no duplicate records in the dataset.

Table 4 NaN values in dataset

The DATE field is broken into ‘YEAR’, ‘MON’, ‘DAY’. The value true is replaced by 1 and False by 0 in the field ‘RAIN’, that is the text (Boolean) data is converted to numeric. Thus, feature X will consist of ‘PRCP’, ‘TMAX’, ‘TMIN’, ‘YEAR’, ‘MON’, ‘DAY’ and ‘RAIN’ as target column Y. So now the data appears as shown in Table 5.

Table 5 Sample data after splitting DATE column

3.3 Data Insight

The dataset contains data from the year 1948 to 2017, and the month-wise distribution of data from the 1948 to 2017 is shown in Table 6.

Table 6 Month-wise total records since 1948–2017

The rainfall experience in a particular month for the period 1948 to 2017 is shown in Table 7. It is observed that lowest rainfall is observed in the month of July every year, and highest rainfall is observed in the month of December every year.

Table 7 Month-wise rainfall records since 1948–2017

The histogram for the column rainfall is shown in Fig. 1. The number of records with no rainfall is 14,648, and number of records with rainfall is 10,900.

Fig. 1
figure 1

Histogram for column RAIN

To get the scatter plot of precipitation with the temperature, additional column AVGTEMP is created from TMAX and TMIN. The obtained scatter plot is shown in Fig. 2. On referring to Table 3, it is observed that the maximum value of PRCP is 5.02, which as shown in Fig. 2 is an outlier. Except for few cases where the average temperature is in between 48 and 60°F, we observe PRCP value above 2.0.

Fig. 2
figure 2

Scatter plot

The precipitation is the water released from clouds in the form of rainfall [6]. Refer to Table 8 [7] below and observe Fig. 3, it is seen that October to April every year, moderate rainfall is experienced and from May to September light rainfall is experienced in Seattle.

Table 8 Precipitation intensity
Fig. 3
figure 3

Precipitation trend

3.4 Train–Test Split

To estimate the performance of machine learning algorithms, the data is split into training and testing data. Usually, the data is split into train–validate–test data. The train data is used to train the model; using validation data, the model is validated and tested using test data. In the present experiment, the data is split into only train and test data. The ratio chosen here is 80:20, that is 80% of training data and 20% of testing data.

3.5 Feature Scaling

Machine learning algorithms like linear regression, logistic regression and neural network that use gradient descent as an optimization technique require data to be scaled [8]. On observing the values across different columns in the Seattle weather dataset, we find there is varying range of values. By applying scaling, all values are brought on same scale before giving it to our deep neural network model. Scaling can be done using either normalization or standardization. In normalization, values are scaled in the range from 0 to 1, while in standardization values are centred around the mean. As we see outliers with respect to PRCP and AVGTEMP in the data, standardization on train and test data is applied.

4 Deep Neural Network Model

Neural network models in simple terms described as mathematical function which maps the input to generate the desired output. It comprises of input layer, output layer, arbitrary number of hidden layers, a set of weights, and bias between each layer and a choice of activation function and loss function.

On completion of transformation of the data, DNN-sequential model is applied here on the pre-processed data as it allows layer by layer model building, which forms network of dense layer. As there are six input features (except ‘RAIN’ which is target variable) as shown in Table 5, the first dense input layer is set with six features. Activation function rectified linear activation function ‘ReLU’ is used at hidden layers. Another dense layer with four neurons is added with ‘ReLU’ activation. As the problem belongs to the class of binary classification, ‘sigmoid’ function is used at the output layer, which will generate output either 1 or 0, for rainfall or no rainfall, respectively. Thus, the model has 6 input, two hidden layers with 6 and 4 neurons and output layer with one output. The model is implemented using keras.io [9]. The model summary is shown in Table 9.

Table 9 Model summary

To train deep neural network models, adaptive optimization algorithms are used. The examples include Adam [10], Adagrad [11], RMSprop [12]. Adaptive here refers that it computes individual learning rate for different parameters. The model is compiled using Adam optimizer which is seen as a combination of RMSprop and stochastic gradient descent [13] with momentum with few distinctions. The nature of problem being binary classification and as the target variables are {0,1}, the loss function is computed using cross-entropy [14]. The model is fitted to training data using 10 epochs and batch size of 64. One epoch is the complete pass through the training data. Epoch is a hyperparameter, and there is no thumb rule for that. Batch size is the number of sample processed in the single mini-batch.

5 Result and Conclusion

In this section, the results generated at various stages are discussed.

5.1 Weights and Bias

The model weights for first layer dense_1, with the output shape (None, 6) generates 42 parameters, that is 36 weights and 6 bias, similarly 24 weights and 4 bias for dense_2 and 4 weights and 1 bias for dense_3, total 75 parameters. The values generated across each weight and bias is as shown in Table 10.

Table 10 Weight and bias at hidden layers

5.2 Training and Validation: Loss and Accuracy

After the model is complied, the training data is fit using 10 epoch and 64 batch size. As the number of epochs increases, more information is learned. Both training and validation accuracy increase as the number of epochs increases. The resultant training and validation loss and accuracy in each epoch are shown in Table 11. The model took 2.48 s to train.

Table 11 Training and validation loss and accuracy

Figure 4 shows at each increasing epoch the model loss and decreases and the model accuracy increases. From 8th epoch onwards, the curve starts flattening, and by 10th epoch, it becomes stagnant. Here, the batch size is 64. Thus, further training is stopped and model used for testing data.

Fig. 4
figure 4

Model accuracy and model loss

5.3 Test Loss and Test Accuracy

The model is applied on testing data, and the resultant test loss and test accuracy is shown in Table 12.

Table 12 Test loss and test accuracy

5.4 Comparative Analysis

The target variable in the present problem statement is binary, that is 1 for it will rain and 0 for will not rain. Thus, it is a classification problem and several machine learning classification algorithms can be fitted to this dataset on appropriate pre-processing. Logistic regression model [15] is one such model which best fits for binary classification. The training data is fitted to logistic regression model, and it is observed that test accuracy obtained is 0.9330724070450098.

6 Conclusion

The paper presents the steps involved in pre-processing the raw data before passing it to a deep neural network model. The architecture of the deep neural network model is influenced by the feature vectors that go as input and the target variable which is the expected output. The activation function and optimizer used impacts the loss function. Here, the Seattle weather data is used for rainfall prediction which is available for a period from 1948 to 2017. The prediction of rainfall on a particular day which belongs to the class of binary classification is trained using deep neural network model. The sequential model with dense layer, ReLU activation function at hidden layer, sigmoid function at output layer, Adam optimizer, 10 epochs with batch size 64 is implemented on the dataset to achieve the test accuracy of 97.33%. The training data when was fitted to logistic regression model also, and it gave accuracy of 93.30%. Thus, it is recommended to use DNN model, as in logistic regression the classification is linear, whereas the DNN model will be useful for more complex and nonlinear data.