Keywords

1 Introduction

Several researchers have used a variety of methods to define climatic types and delineate zones of similar climate. One of the methods that is popular among researchers to define region is through the combined use of principal component analysis (PCA) and cluster analysis. PCA is a data reduction technique where it allows the researcher to reorient the data, thus the first few dimensions account as much of the available information as possible. When working with less dimensions of the data set, this makes it easier to visualize the data and identify interesting patterns [1]. In defining region, the fundamental modes of PCA are considered for the clustering process [2]. Typically, in extracting components, one of the three rules, scree plot, Kaiser’s rule and proportion of explain variance are taken as guideline [3, 4]. Scree plot proposed by [5] is a graphical approach which involved plotting the variance accounted for by each principal component in the order of descending eigenvalues from the largest to the smallest. For a large dimension data set especially rainfall data set, this approach is unsuitable to be used. This is due to the steep curve followed by a bend which are not clearly visible to get the cutoffs of the number of principal components. When the scree plot is not diagnostic, Kaiser’s rule may come in handy. This method retains regarding the amount of variance accounted for those components. In this rule, eigenvalues greater than average eigenvalue (i.e. \( \lambda > 1 \)) are retained because these axes summarize more information than any single original variable [17]. Hence, those components with \( \lambda > 1 \) is obtained to determine the number of principal components. Occasionally in the data set, some eigenvalues are close to 1, thus it also might be consider as a significance of principal components to obtain. As a result, this method has been criticized by [18, 19]. For high dimensional data set, [5] recommended using 70% cumulative percentage of variance as a rough guide to cutoff the number of principal components. To test whether 70% cumulative percentage of variance is appropriate in defining region, the range of the cumulative percentage (65% until 90%) is tested to choose a suitable range to cutoff the principal components to define climate region. However in defining region, extracting the correct number of component is crucial because it dictates the true regional boundaries. As far as we know, there is no literature showing how to choose the appropriate number of components based on the breakdown point of the number of clusters.

The cutoffs of the number of components depends on the structure of the data set. In climate data, especially rainfall data set in Peninsular Malaysia, it involves many zero bound data which signifies that the observation is less than 1.0 mm [7]. These zero bound data might influence the choice of cumulative percentage of variance. It can be seen clearly when cluster analysis is employed to standardized and unstandardized principal component score, it demonstrates that the number of clusters are sensitive to the standardization data.

In this study, we establish a procedure to choose the best cumulative percentage of variance to obtain in defining region. We also need to investigate the effect of standardized and unstandardized principal component score to mitigate the effect of zero bound data.

2 Data

Daily rainfall totals for 33 years period 1975–2007 were obtained from 75 stations across Peninsular Malaysia. The rainfall data set considered for the purpose of this study is a matrix, comprising data from 75 stations and 365 days which constitute enough data to allow for defining region. In this present study, a wet day is defined as a day with at least 1 mm of rainfall [7]. Figure 1 shows the geographical coordinates of the stations in this study.

Fig. 1.
figure 1

The location of 75 rainfall stations in Peninsular Malaysia

Before clustering process were employed, the needed standardization of the daily rainfall data was examined. Standardization is an important part in this analysis since mean and variance are likely to be small in consequence of zero bound data. The standardization will affect the result of clustering analysis where the rainfall stations are likely to be clustered together even if the stations are poorly correlated. Some adjustments are required in the usual standardization method due to a problem in zero bound data where the data were standardized by dividing the daily mean for that station, as given by

$$ x_{ij}^{*} = \frac{{x_{ij} }}{{\frac{365}{p}\sum\nolimits_{j = 1}^{p} {x_{ij} } }} $$
(1)

where the denominator represents the daily mean rainfall at station i, calculated from the p = 365 × 33 that is the daily observations. Daily rainfall is then expressed as a proportion of the mean of the daily total [6].

To validate the results in this study, we analyze another rainfall data set from other country which has same characteristic of rainfall data in Peninsular Malaysia. Daily rainfall data from 11 rainfall stations were obtained from Indonesia. The data were recorded from 2003 until 2005. The data set was assembled as data matrix of Peninsular Malaysia where the rows in the matrix represent the rainfall observation i.e. 365 rainfall days and 11 stations were represented in the columns of the data matrix. Standardization is also necessary to overcome the problem in dealing with zero bound data which is similar to the rainfall data in Peninsular Malaysia.

3 Methods

3.1 Principal Component Analysis

Principal components of the scaled rainfall data were computed based on the correlation matrix in order to extract the main modes of variation of the data and to reduce the from large dimension to low dimension. This procedure requires that several decisions be made in obtaining the best cumulative percentage of variance or in other words, the best number of extracted components to retain. As mentioned previously in the introduction, several methods have been obtained in extracting the number of components to retain. In this study, we used explained variance to determine the best number of components to obtain. When using this method, the challenge lies in selecting an appropriate threshold percentage. If we choose the higher percentage such as 90% and above, we may encounter difficulties such as inflating the importance of noise and results in poorly defined regions. On the contrary, if we choose low variation of cumulative percentage, the observations that are not well represented will be clustered together due to low scores for all of the components. Therefore, we construct this study in order to determine the best range of cumulative percentage of variance in the defined region.

3.2 Calinski and Harabasz Index

Cluster analysis using k means method was then performed on principal component score matrix. The drawback of the k means method is the requirement for the number of clusters must be specified before the algorithm is applied. To counter this issue, we apply Calinski and Harabasz Index as a guide for us in quantifying the best number of clusters for our data set. Calinski and Harabasz Index is computed as

$$ [{\text{trace}}\,{\text{B}}/\left( {{\text{k}} - 1)} \right]/[{\text{trace}}\,{\text{W}}/\left( {{\text{n}} - {\text{k}})} \right] $$
(2)

where:

n = total number of items k = number of clusters

B = between pooled within cluster sum of square

W = cross product matrix

The maximum value of the index was used to indicate the correct number of partitions in the data set.

4 Results and Discussion

In this section, we will discuss on the choice of cumulative percentage to cut off the number of principal components and the sensitivity of the number of cluster to the choice of cumulative percentage. We also show the effect of clustering result when using standardized and unstandardized principal component score. To validate the results of defining region in Peninsular Malaysia, we had compared the results with the rainfall data of Indonesia that has similar characteristic with Peninsular Malaysia.

The choice of cumulative percentage of variance will reflect the number of components to retain. As an example, we can see clearly in Table 1 where when 65% cumulative percentage was chosen, the number of components to retain is nine while when we chose 70% cumulative percentage, the number of components to retain is 13. The most significant effects shown on the choice of cumulative percentage of variance where it is sensitive to the number of cluster obtained. For instance, in Fig. 2, when we had selected effect 65% cumulative percentage, the number of cluster to retain was three. When 5% additional cumulative percentage of variance is retained, the number of cluster changed from three to five. If we look through the Fig. 3, it gave the same result where 65% cumulative percentage of variance obtained two number of cluster. Meanwhile when 70% cumulative percentage of variance was retained, the number of cluster became six. However, the defined regions the selection of cumulative percentage above than 70% was not a good decision as a cut off for the number of principal components. As clearly presented in Figs. 1 and 2, the resulting number of cluster remained the same even with the additional 5% of the variance in every phase. Moreover, the number of cluster obtained from Figs. 1 and 2 is too small because in the defined region, we need more clusters to allow regions to benchmark their cluster against other regions [20]. This result is supported by [14] which stated that a few number of clusters i.e. two clusters would be insufficient to define region when dealing with analysing considerable extent of regions. This statement is proved by [11] where a sensitivity of the clustering results to the number of principal components retained has been noted elsewhere [21, 22].

Table 1. Results of standardized principal component score and number of clusters obtained using Calinski and Harabasz Index for Peninsular Malaysia
Table 2. Results of standardized principal component score and number of clusters obtained using Calinski and Harabasz Index for Indonesia
Fig. 2.
figure 2

Determined number of cluster for standardized principal component score for Peninsular Malaysia

Fig. 3.
figure 3

Determined number of cluster for standardized principal component score for Indonesia

Because of the sensitivity of the clustering results to the number of retained principal components, the correct number of components to retain needs to be identified. It is important that the variation between the clusters is represented in the direction of at least one of the principal components [12]. Accordingly, it is best to err towards retaining significantly more principal components rather than too few [13]. If there are too few components to retain, the observations that are not well represented will cluster together because they have low scores for all the components meanwhile inclusion of too many principal components inflates the importance of noise and results in poorly defined regions [6]. Clustering results are not as sensitive to the choice of cumulative percentage of variance when the component scores are left unstandardized compared to when they are standardized. If we look through Tables 3 and 4, the number of cluster remains the same even though we have increased 5% cumulative percentage of variance in every phase. This situation happened due to the lowest-order modes, which define the noise element of the data, are given minimal weighting. Therefore, we need standardized principal component score to ensure all the temporal modes are given equal weight and rainfall distribution patterns that occur frequently are treated as equal to unusual patterns and to noise components.

Table 3. Results of unstandardized principal component score and number of clusters obtained using Calinski and Harabasz Index for Peninsular Malaysia
Table 4. Results of unstandardized principal component score and number of clusters obtained using Calinski and Harabasz Index for Indonesia

In order to obtain the best number of cluster, Calinski and Harabasz index was employed in principal component score matrix. According to Table 1 for Peninsular Malaysia, the values for cluster numbers run from two to five while for Indonesia in Table 2, the cluster was obtained from two to six. The optimum number of cluster was established as three for Peninsular Malaysian and four for Indonesia where each recorded the maximum value of index among the others.

5 Conclusion and Recommendations

This study has shown that PCA method is particularly well adapted to the regionalization of rainfall region. It allows the grouping of stations with similar characteristics and recognition of climatic regions in the alpine domain [21]. Typically, in defining climate region, it will need the largest cluster to retain. If we only have fewer groups, we have to face the problem in differentiating the new region defined and it will give us difficulty to analyze it. Hence, the following recommendations are made for cluster analysis cum PCA to define new region:

  1. (1)

    If there are too few components, observations that are not well represented will be clustered together due to low scores on all of the components but if more number of components to retain or more cumulative percentage are taken, the result of defining region become poor as it will inflate the importance of noise. Therefore, the most suitable cumulative percentage to define region is between 65% until 70%.

  2. (2)

    The principal component scores should be standardized as it will make the clustering result become sensitive to the number of component to retain.

  3. (3)

    Validity index is recommended to be used when determining the best number of cluster to define region.

Generally, we have a lot of methods in defining region such as modeling method and regression method. Our proposed method may also be used by researchers to define climate region in their countries. All of the recommendations above can be used as guideline for other researchers with similar topics related to this paper. Having mentioned this, it is not a claim that all of the result is entirely accurate for all cases as it is based on rainfall data in Peninsular Malaysia and Indonesia. Both of these countries are part of the Asia, hence the weather and seasons are different compared to the other zones.