Introduction

Landslide is a geo-environmental hazard of mainly hilly and mountainous regions. It is often triggered by natural causes such as heavy and prolonged rainfall or snowmelt, earthquakes or volcanic eruptions, and anthropogenic factors such as ground excavation, deforestation and land use changes (Guzzetti et al. 2005). Out of these factors, rainfall is considered one of the most prominent triggering factors of landslide occurrences in Viet Nam (Kjekstad and Highland 2009). Landslides in this area often cause hundreds of fatalities and loss of millions of US dollar every year (Pham et al. 2016a). Therefore, now landslide studies have been turning into urgent tasks requiring landslide hazard assessment.

Landslide hazard is defined as the probability of occurrences of potential landslide in specified period of time (Varnes 1984). Therefore, in the landslide hazard assessment, spatial prediction and temporal probability of landslide occurrences are considered. Most of these studies are related to spatial prediction of landslides (Alkhasawneh et al. 2014; Pham et al. 2016a). However, only few attempts have been made to establish temporal probability of landslides (Guzzetti et al. 2005; Tien Bui et al. 2013) mainly due to limitation of availability of site specific data such as exact time, magnitude and velocity of mass movements.

For spatial prediction of landslides, three approaches are generally considered namely analytical methods, expert’s opinion based methods, and machine learning methods (Pradhan 2013). Out of these, machine learning methods which are based on statistical analysis of the spatial relationship between a set of geo-environmental factors and landslide occurrences are known more effective for spatial prediction of large regions (Pradhan 2013). Kavzoglu et al. (2014), and Pradhan (2013) used support vector machines as an effective machine learning method for landslide prediction. Moreover, Pham et al. (2015)and Choi et al. (2012) applied successfully artificial neural networks machine learning method for landslide susceptibility assessment. Other machine learning methods such as decision trees (Alkhasawneh et al. 2014), logistic regression (Akgun 2012) have also been applied widely for spatial prediction of landslides. In general, performance of these methods is good; however, it can be further improved using ensemble techniques as it uses machine learning algorithms to combine multiple classifiers (Pham et al. 2016b).

Regarding temporal probability of landslides, two approaches can be applied such as analysis of potential slope instability and statistical analysis of past landslide events (Saez et al. 2012). The first approach is based on evaluation of impact of the current slope conditions to the potential instability of slopes; however, this approach is not applicable for large investigated regions (Tien Bui et al. 2013). The second approach is based on the probability analysis of past landslide events based on historical landslides records especially of rainfall induced landslide events (Corominas and Moya 2008). Out of these, statistical analysis approach based on information of rainfall induced landslide events is considered more suitable for temporal probability of landslides (Tien Bui et al. 2013).

In term of landslide hazard assessment, Guzzetti et al. (2005) employed probabilistic model, Terlien et al. (1995) used the deterministic model. In another study, Tien Bui et al. (2013) utilized machine learning methods such as support vector machines, logistic regression, evidential belief functions, bayesian neural networks, and neuro-fuzzy integrated with probability analysis of rainfall data to assess landslide hazard in Hoa Binh province (Viet Nam), and stated that machine learning methods integrated with probability analysis of rainfall data is a promising approach for landslide hazard assessment in landslide prone areas. Therefore, main objective of the present study is to assess landslide hazard at the Mu Cang Chai district, Yen Bai province (Viet Nam) using a novel machine learning method of Random SubSpace fuzzy rules based Classifier Ensemble (RSSCE) in conjunction with probability analysis of rainfall data. RSSCE based on Fuzzy Unordered Rules Induction Algorithm (FURIA) classifier and Random SubSpace (RSS) ensemble has been proposed to predict spatially landslide occurrences. Probability analysis of available rainfall data for the period 2008–2013 has been considered for the prediction of temporally landslide occurrences.

Random SubSpace Fuzzy Rules Based Classifier Ensemble (RSSCE) Method for Landslide Spatial Prediction

Spatial prediction of landslide harzards in the present study has been carried out by RSSCE method. It is a combination of Fuzzy Unordered Rules Induction Algorithm (FURIA) classifier and Random SubSpace (RSS) ensemble. FURIA was first introduced by Hühn and Hüllermeier (2009) which is an extension of well-known RIPPER algorithm (Cohen 1995). It uses fuzzy rules and unordered rule sets to learn the classified algorithm. In addition, FURIA also uses rule stretching method to solve uncovered cases (Hühn and Hüllermeier 2009). Therefore, it usually results in higher accuracy than the RIPPER algorithm, and C4.5 classifier (Hühn and Hüllermeier 2009). Meanwhile, RSS is known as one of the most efficient ensemble techniques that have been utilized to improve performance of the individual classifiers (Onan 2015). RSS was first proposed by Ho (1998) that could combine multiple classifiers for training in modified feature space. It also creates optimally the number of training subsets that are employed to train base classifiers (Ho 1998). Therefore, RSS is known as an efficient ensemble method in dealing with datasets of many redundant features and over-fitting problems (Onan 2015). The proposed RSSCE method in the present study takes advantages of these two techniques that could result desire outcomes for spatial prediction of landslide hazards which involves three main steps (1) initiation, (2) optimization, and (3) classification.

  • Initiation The imitative step is to generate the input data which has been generated from data collected from the Mu Cang Chai district. It included training dataset and testing dataset. Meanwhile, training dataset has been created using 174 landslide pixels (70 % historical events) and 174 non-landslide pixels, testing dataset has been generated using 74 landslide pixels (30 % other historical events) and 74 non-landslide pixels. All fifteen landslide influencing factors have been utilized to overlay with these landslide and non-landslide pixels for creating final datasets. Basically, training dataset is used to construct the RSSCE model whereas testing dataset is employed to validate predictive capability of the RSSCE model.

  • Optimization In this step, the RSS method has been applied to divide training dataset into optimal sub-training datasets that are then used to train the base classifier. Main theory of the RSS method is based on stochastic discrimination (Kleinberg 1990) that is applied to partition of the feature spaces, and then to construct classifiers based on the combination of many components that have weak discriminative capability but good penalization (Gao and Wang 2006). In final step, the RSS method has been then used to combine results of all classifiers that use sub-training datasets to give final outcomes of the RSSCE model.

  • Classification This step is carried out to classify landslide or non-landslide variables for spatial prediction of landslide hazards, it uses the FURIA method to analyze spatial relationship between landslide occurrences and a set of geo-environmental factors using optimal sub-training datasets obtained from the optimization step. In this step, fuzzy rules have been applied using trapezoidal membership function as following equation (Hühn and Hüllermeier 2009).

$$ {\text{I}}^{\text{F}} \left(\upmu \right)\mathop = \limits^{\text{df}} \left\{ {\begin{array}{*{20}l} 1 \hfill &\quad {\uppsi^{\text{c,L}} \le \,\upmu \, \le\uppsi^{\text{c,U}} } \hfill \\ {\frac{{\upmu - \,\Phi ^{\text{s,L}} }}{{\Phi ^{\text{c,L}} - \,\Phi ^{\text{s,L}} }}} \hfill &\quad {\Phi ^{\text{s,L}} \le \,\upmu \, \le\Phi ^{\text{c,L}} } \hfill \\ {\frac{{\Phi ^{\text{s,L}} - \,\upmu}}{{\Phi ^{\text{c,U}} - \,\Phi ^{\text{s,U}} }}} \hfill & \quad {\Phi ^{\text{c,U}} \le \,\upmu \, \le\Phi ^{\text{s,U}} } \hfill \\ 0 \hfill & \quad {\text{else}} \hfill \\ \end{array} } \right. $$
(1)

where \( \Phi ^{c,U} \) and \( \Phi ^{c,L} \) are the upper and lower bounds of the fuzzy set corresponding to unit elements membership, respectively. Also, \( \Phi ^{s,U} \) and \( \Phi ^{s,L} \) are the upper and lower bounds of the fuzzy set respective with elements membership bigger than zero.

Notably, while applying the RSSCE model, it can be observed that performance of the RSSCE model depends significantly on the selection of learning parameters such as number of folds that is used to determine the amount of data for reduced-error pruning of the FURIA classifier, and the number of iterations that is used to learn the RSSCE model. Therefore, optimization task of these parameters has been carried out to obtain the best performance of the RSSCE model using the trial-and-error process (Pham et al. 2015). As a result, in the present study the number of folds is set to 5, and the number of iterations is set to 13 to train the RSSCE model.

Study Area

The Mu Cang Chai district (Lat. 21°39′00″N–21°50′00″N; Long. 103°56′00″E–104°23′00″E) affected by landslides have been selected as the study area. It is located in the northwest of Yen Bai province of Viet Nam (Fig. 1), covering an area of about 1196.47 km2. The area is situated in a tropical monsoon region having heavy rainfall (average 3700–5490 mm) during the months May–October. The annual mean humidity in this area is about 81 % and mean temperature 14.3 °C. Most of the study region is covered by forests (61.76 %).

Fig. 1
figure 1

Location map and landslide photos of the study area (courtesy: Yen Bai and Dan Tri newspapers)

Topography of the area is hilly with intervening deep valleys. Mountains in the region occupy steep slopes up to 88 degrees at places and elevation of the area varies from 280 to 2820 m with the average elevation of about 1515 m (above standard sea level). The area is occupied by igneous, sedimentary and metamorphic rocks. The volcanic extrusive rocks of Tu Le and Ngoi Thia complexes and intrusive igneous rocks of Phu Sa Phin complex and Tram Tau formation are predominant in the area. The area is tectonically active and dissected by three main faults namely Phong Tho—Van Yen, Nam Co—Minh An, and Nghia Lo.

Landslide Inventory

Data of landslide inventory is essential in landslide hazard assessment (Tien Bui et al. 2013). In this study, landslides data has been obtained from the Vietnam Institute of Geosciences and Mineral Resources under the national project namely “Survey, assessment and zoning of landslide warning in the mountainous region of Vietnam”. In all 248 landslide locations have been identified by interpreting air photos (Year 2013) on 1:33,000 scales. These landslides have been subsequently validated by field investigations. Landslides in study area are having varying sizes small (<200 m3), average (200–1000 m3), large (1000–20,000 m3), very large (20,000–100,000 m3). The biggest landslide event occurred at the Che Cu Na commune in February, 2011 having volume size of 10,000 m3. The volume of landslides has been determined through field investigation and spatial analysis. In the study area translational (35 locations), rotational (124), toppling (45), debris and mixed (36) type of landslides have been observed. Maximum number of landslides is of rotational type (124). Analysis of the landslide inventory shows that most of landslides in the study region occurred during rainy season (May–October). Specific date of land slide occurrence is available only of 42 landsides in the available record. Therefore, these landslide events have been employed for temporal prediction of landslides and 248 landslide events have been utilized for spatial prediction of landslides.

Spatial Prediction of Landslides Hazards

Geo-environmental Factors in Relation with Landslide Occurrences

In spatial prediction of landslides, spatial relationship between geo-environmental factors and landslide occurrences is often analyzed based on the assumption that future landslides will occur under identical conditions of past landslides (Pham et al. 2015). Thus determination of geo-environmental factors that affected past landslide occurrences is very important. Based on the analysis of mechanism of landslide occurrences and geo-environmental characteristics of the study region, a total of fifteen geo-environmental factors (slope, aspect, curvature, plan curvature, profile curvature, elevation, land use, lithology, rainfall, distance to faults, fault density, distance to roads, road density, distance to rivers, and river density) have been considered as landslide affecting factors in the present study. Maps of these affecting factors have been constructed as raster data (pixel size of 20 × 20 m) with different classes (Table 1; Fig. 2) based on the degree of susceptibility of each class to landslide occurrences. These classes are based on the experience of analysis of adjacent area carried out by authors Pham et al. (2016a) and other workers Cevik and Topal (2003) and Dai and Lee (2002).

Table 1 Geo-environmental factors and their classes
Fig. 2
figure 2

Thematic maps in the study area: a slope map, b distance to roads map, c lithology map, and d land use map

Evaluation of Predictive Capability of the RSSCE Model

In literature, Receiver Operating Characteristic (ROC) curve method has been utilized as a standard quantitative method to evaluate the predictive capability of landslide models (Pham et al. 2016a). Therefore, in the present study, the ROC curve has been selected to validate the performance of the RSSCE model. Basically, the ROC curve is generated by plotting pairs of two statistical indexes such as “sensitivity” and “100-specificity” (Tien Bui et al. 2016). The AUC value is area under the ROC curve that is employed to evaluate quantitatively the performance of landslide models (Pham et al. 2015). Additionally, statistical indexes namely accuracy (ACC), kappa (k), and root mean squared error (RMSE) have also been used to evaluate the performance of landslide models (Bennett et al. 2013).

Furthermore, other benchmark landslide models such as Support Vector Machines (SVM) (Vapnik 1995), Multiple Perceptron Neural Network (MLPN Nets) (Zare et al. 2013), and Logistic Regression (LR) (Akgun 2012) have been utilized for comparison with the RSSCE model. More specifically, SVM is known as one of the most efficient machine learning methods for landslide prediction, it is based on the statistical approach of finding an optimal hyper-plane for separating two classes (landslide and non-landslide) (Pourghasemi et al. 2013). Meanwhile, MLPN Nets is one of artificial neural networks which are known as a branch of artificial intelligence has been applied efficiently in landslide problems (Pham et al. 2015). LR is known as a more accurate machine learning method compared to conventional methods (Akgun 2012; Choi et al. 2012).

Results of the performance of the RSSCE model and other benchmark landslide models are shown in Fig. 3 and Table 2. It can be observed that the RSSCE model has the highest predictive capability of spatial prediction of landslides compared to other benchmark landslide models including the LR model, the SVM model, and the MLPN Nets model. It proves that the RSSCE model is the best choice for spatial prediction of landslides in the present study. Therefore, the results of spatial prediction of landslides from the RSSCE model have been used for landslide hazard assessment.

Fig. 3
figure 3

Analysis of the ROC curve of different landslide models

Table 2 Statistical index values of different landslide models

Temporal Prediction of Landslide Hazards

Rainfall Data Analysis

In the present study, rainfall data have taken into account as a time related factor to analyze temporally landslide occurrences. Rainfall data have been collected from the rainfall gauge located in the Mu Cang Chai district, Yen Bai province (Viet Nam). Avialabe Rainfall data for the period 2008–2013 obtained from Global Weather data for SWAT (NCEP 2014) has been analysed. The daily rainfall, in the study area, is shown in Fig. 4a. It can be observed that most intense rainfall usually occurs for short period that is within few days (Fig. 4a). Analysis also shows that the highest annual rainfall occurred in the year 2008 (4362 mm), followed by 2009 (3522 mm), 2010 (3493 mm), 2012 (2160 mm), and 2013 (1950 mm) 2011 (1748 mm), respectively. Out of 248 landslide locations, at 42 locations intense rainfall (more than 100 mm) occurred in a single day (Table 3).

Fig. 4
figure 4

Rainfall analysis at Mu Cang Chai district: a daily rainfall for the period of 2008–2013, b the rainfall threshold, c validation of the rainfall threshold (the red mark indicates threshold exceeded rainfall) (color figure online)

Table 3 Temporal occurrence of rainfall triggered landslides in the Mu Cang Chai district from 2008 to 2013

Determination of Rainfall Threshold

In general, determination of rainfall threshold is required for the temporal prediction of landslide (Tien Bui et al. 2013). Rainfall threshold is the minimum rainfall at which a landslide might happen in a certain region (Guzzetti et al. 2007). In recent decades, many methods have been developed to calculate the rainfall threshold for landslide study, these methods can be grouped into five approaches namely (1) physical-based approach, (2) empirical based approach, (3) intensity–duration based approach, (4) normalized intensity–duration based approach, and (5) antecedent rainfall based approach (Guzzetti et al. 2008). Out of these approaches, the intensity–duration based approach is known as the most widely used method (Larsen and Simon 1993; Aleotti 2004). This approach requires the data of intensity of rainfall during the day on which landslide occurred (Larsen and Simon 1993). This data is generally not available in most of the cases. The antecedent rainfall plays important role in the initiation of landslides as it increases the pore-water pressure in the slope formed materials (Tien Bui et al. 2013). Therefore, the antecedent rainfall based approach has been utilized to determine the rainfall threshold for temporal prediction of landslides in the present study.

Rainfall threshold in the study area has been determined based on the experience of other adjacent areas even though correlation between the numbers of days for the antecedent rainfall and the triggering of a landslide is relatively complex (Guzzetti et al. 2007). Aleotti (2004) considered rainfall of 10 and 15 days. In general, no agreement has been reached to select the exact number of days for the antecedent rainfall in determining the rainfall threshold. Therefore, the selection of the number of days is often based on the analysis of the rainfall data at the time landslide occurred for different number of days (Tien Bui et al. 2013). In the present study, using the results from the study carried out in the Hoa Binh province which is an adjacent region of the study area (the Mu Cang Chai district), the number of days has been utilized as 15 for analyzing the rainfall threshold (Tien Bui et al. 2013). Data of the rainfall-induced landslides of 5 years (2008, 2010, 2011, 2012, and 2013) has been utilized to determine the rainfall threshold whereas the rainfall-induced landslides of 2009 have been utilized for the validation of the rainfall threshold. Finally, the envelope line for landslide occurrences (Fig. 4b) has been determined using two lowest points in the scattered graph (Tien Bui et al. 2013) which is expressed as following mathematical equation:

$$ {\text{R}}_{\text{TH}} = 117.52 - 0.024R_{{15{\text{d}}}} $$
(2)

where RTH is defined as the rainfall threshold; R15d is inferred as the antecedent rainfall of 15 days.

Evaluation of the Rainfall Threshold

Evaluation of the performance of landslide models is to be done by dividing them into two subsets for training and validation of the models (Chung and Fabbri 2003). Therefore, the recorded rainfall induced landslide events, in the present study, have been divided into two parts. One part includes induced landslide events of the years of 2008, 2010, 2011, 2012, and 2013 which have been utilized to determine the rainfall threshold. Another part includes landslide events occurred in the year 2009 which have been used for of the validation of rainfall threshold.

Analysis of results shows that there is only one day rainfall exceeded the threshold value that is on July 5, 2015 (Fig. 4c). This result can be correlated with the landslide events recorded during the year 2009 (Table 3). Thus rainfall threshold obtained from the present study can be used to analyze the temporal prediction of landslides for landslide hazard assessment in this area.

Temporal Probability of Landslide Occurrences for Landslide Hazard Assessment

The temporal probability of landslide occurrences is determined based on the assumption that the past landslide events could be considered as independent random point-events in time (Guzzetti et al. 2005). Therefore, the exceeded probability of landslide occurrences during time “t” is expressed as below (Guzzetti et al. 2005):

$$ P_{L} = \, P[L\left( t \right) \ge 1] $$
(3)

where \( L\left( t \right) \) is the number of landslide events that occur during time “t” in the study region.

In order to determine the exceeded probability of landslide occurrences during time “t”, there are two common methods utilized namely poisson and binomial methods (Crovelli 2000). Out of these methods, poisson is a continuous time method that is based on the independent relationship between the occurrences of random-point events and time (Coe et al. 2000) whereas binomial is a discrete-time method that consists of the occurrence of random-point events in certain time (Coe et al. 2000). In comparison of these two methods, Crovelli (2000) stated that these are quite different in the case of short periods (t is short), but quite coincided in case of long periods (t is long). Moreover, Poisson method is more commonly used for landslide hazard assessment (Guzzetti et al. 2005; Coe et al. 2000). Thus in the present study the poisson method has been adopted to analyze the temporal probability of landslide occurrences.

In the Poisson method (Guzzetti et al. 2005), the probability of “m” landslides during time “t” can be estimated as below:

$$ P [L(t) = m ]= \exp ( - \beta t)\frac{{(\beta t)^{n} }}{m!};\quad m = 1, \, 2, \, 3, \ldots $$
(4)

where \( \beta \) is the estimated rate of occurrence of landslides, it could be obtained from the catalogue of historical landslide events (Guzzetti et al. 2005).

Based on the Eq. (4), the exceeded probability (the probability of one or more landslides happened during time “t”) can be estimated as following equation:

$$ P [L(t) \ge 1 ]= 1 - \exp \left( {\frac{t}{\eta }} \right) $$
(5)

where \( \eta \) is the future average recurrence interval, t is a period of time in the future at which the exceeded probability is calculated.

Using the poisson method, the temporal probability of landslide hazards for the Mu Cang Chai district have been calculated for the return period of 1, 3, and 5 years. The results show that the number of times at which rainfall exceeded the rainfall threshold is 10, and the probability of landslide hazards increases with the return period that is 0.865 in 1 year, 0.998 in 3 years, and 1.000 in 5 years.

Landslide Hazard Assessment

Landslide hazard assessment has been carried out in consideration of both the spatial prediction and temporal prediction of landslides in the study area. Based on this, landslide hazard maps have been constructed in three main steps such as (1) generating landslide susceptibility indexes, (2) calculating landslide hazard indexes by multiplying landslide susceptibility indexes with temporal probability of landslides for different periods, and (3) reclassifying landslide hazard indexes. In the first step, using the spatial prediction results by applying the RSSCE model, landslide susceptibility indexes have been generated for all pixels of whole study area. In the next step, the temporal probability in different return periods (1, 3, 5 years) obtained from the temporal prediction have been multiplied individually with landslide susceptibility indexes for obtaining landslide hazard indexes for all pixels of entire study area. In the final step, landslide hazard indexes have been reclassified into five intervals using natural breaks method which is widely applied in landslide studies (Pham et al. 2015). Based on the hazard index intervals, hazard has been classified from very high, high, low and very low hazards. In the development of landslide hazard map highest temporal probability observed in the 5 years period has been used for the study area (Fig. 5).

Fig. 5
figure 5

Landslide hazard map of the study area for the return period of 5 years

Discussions

Landslide hazard assessment has been carried out in the present study using Random SubSpace fuzzy rules based Classifier Ensemble (RSSCE) method and probability analysis of rainfall data. Out of these, RSSCE is novel hybrid approach of Fuzzy Unordered Rules Induction Algorithm (FURIA) classifier and Random SubSpace (RSS) ensemble which has been proposed to predict spatially landslide occurrences. Probability analysis of rainfall data has been utilized to predict temporally landslide occurrences for the Mu Cang Chai district, Yen Bai province (Viet Nam) which is highly landslide prone area.

It has been observed that for the spatial prediction of landslides occurrences, the RSSCE model performed well in the present study (AUC = 0.840) in comparison to other models such as LR model (AUC = 0.810), SVM model (AUC = 0.804) and the MLPN Nets model (AUC = 0.804). This result is expected as the classifier ensemble method of RSSCE uses the RSS ensemble which has capability in improvement of the performance of individual classifiers (Onan 2015). Moreover, RSSCE uses learning dataset which are optimized during training process for classification thus significantly improving its predictive capability in comparison to other landslide models.

For temporal probability of landslide occurrences, time factors such as rainfall and landslide frequency usually take into account (Tien Bui et al. 2013). In the present study, landslide frequency data is not available in landslide inventory. Therefore, rainfall has been utilized to analyze the temporal prediction of landslides for the study area. Based on the analysis of temporal relationship between the rainfall data and the historical landslide events during the period 2008–2013, temporal probability of landslide occurrences has been analyzed for different periods (1, 3, 5 years). The temporal prediction of landslide occurrences has been done in four main steps (1) determining the rainfall threshold by analysis the temporal relationship between the antecedent rainfall of 15 days and past landslide events, (2) evaluating the rainfall threshold and (3) calculating the temporal probability of landslide occurrences using poison method. This methodology can be applied in other areas also where multi-temporal landslide inventory is not properly available.

Landslide hazard assessment has been accomplished in the study area by the consideration of both spatial and temporal prediction of landslide occurrences. Past and present landslide events and geoenvironmental conditions have been considered in the study. In general, geoinfluencing factors for short period of 5 years may not be of significance. However, it is desirable to consider some dynamic factors such as changes in the slope due to road cutting, and change in the land use pattern for development even for short period analysis.

Conclusions

Landslide is a common geo-environmental hazard in hilly and mountainous areas in Viet Nam especially during rainy season. Therefore, rainfall is considered as main triggering factors to landslide occurrences in the study area. Landslide hazard assessment for the Mu Cang Chai district, Vietnam has been carried out in consideration of both spatial and temporal prediction of landslide occurrences. In the present study, a novel classifier ensemble method called Random SubSpace fuzzy rules based Classifier Ensemble (RSSCE) has been applied to predict spatially landslide occurrences, and probability analysis of rainfall data for the period 2008–2013 has been considered to predict temporally occurrences of landslides in the area.

Result of present study show that a hybrid approach of RSSCE method and the probability analysis of rainfall data is a promising approach for landslide hazard assessment which can be applied in other landslide prone areas where multi-temporal landslide inventory is not adequately available. Landslide hazard map developed for the study area would be of use for land use planning and proper landslide hazard management.