Abstract
The main objective of this study is to propose and verify a novel ensemble methodology that could improve prediction performances of landslide susceptibility models. The proposed methodology is based on the functional tree classifier and three current state-of-the art machine learning ensemble frameworks, Bagging, AdaBoost, and MultiBoost. According to current literature, these methods have been rarely used for the modeling of rainfall-induced landslides. The corridor of the National Road 32 (Vietnam) was selected as a case study. In the first stage, the landslide inventory map with 262 landslide polygons that occurred during the last 20 years was constructed and then was randomly partitioned into a ratio of 70/30 for training and validating the models. Second, ten landslide conditioning factors were prepared such as slope, aspect, relief amplitude, topographic wetness index, topographic shape, distance to roads, distance to rivers, distance to faults, lithology, and rainfall. The model performance was assessed and compared using the receiver operating characteristic and statistical evaluation measures. Overall, the FT with Bagging model has the highest prediction capability (AUC = 0.917), followed by the FT with MultiBoost model (AUC = 0.910), the FT model (AUC = 0.898), and the FT with AdaBoost model (AUC = 0.882). Compared with those derived from popular methods such as J48 decision trees and artificial neural networks, the performance of the FT with Bagging model is better. Therefore, it can be concluded that the FT with Bagging is promising and could be used as an alternative in landslide susceptibility assessment. The result in this study is useful for land use planning and decision making in landslide prone areas.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
During the recent decades, assessment of landslide-susceptible zones has become one of the most discussed topics in literature because prediction of landslide events is particularly difficult due to the complex natures of landslides (Tien Bui et al. 2016e). Consequently, various methods and techniques have been proposed for landslide modeling and they can be classified into three main groups such as physical-based, statistical, and soft computing methods. Since physical-based methods are not suitable for large areas, statistical and soft computing methods have received huge attention. In the statistical methods, bivariate analysis (Suzen and Doyuran 2004; Yalcin et al. 2011), multivariate analysis (Chung et al. 1995; Suzen and Doyuran 2004) and logistic regression (Costanzo et al. 2014; Felicisimo et al. 2013; Kavzoglu et al. 2015; Lee et al. 2014; Pradhan and Lee 2010; Tien Bui et al. 2011) are considered to be the most suitable methods for landslide susceptibility assessment on medium and regional scales. However, prediction capability of these landslide models is still not satisfied; therefore, data mining methods have been proposed (Tien Bui et al. 2016e).
Data mining, which is a branch of applied artificial intelligence, is defined as the exploration of observational datasets to find internal relationships and represent the data in understandable ways (Mennis and Guo 2009). They include multiple steps such as data selection and preprocessing, transformation, incorporation of prior knowledge, analysis with computational algorithms, interpretation and evaluation of the results (Fayyad et al. 1996). Literature review shows that data mining is suitable to deal with nonlinear real-world problems with high accuracy, including landslide modeling (Hoang and Tien Bui 2016; Hoang et al. 2016; Tien Bui et al. 2016a; Were et al. 2015).
Among data mining methods and techniques, neuro-fuzzy (Pradhan et al. 2010; Tien Bui et al. 2012d), artificial neural networks (Gomez and Kavzoglu 2005; Hong et al. 2015b; Lee et al. 2003; Tien Bui et al. 2012c; Yilmaz 2009), and support vector machines (Kavzoglu et al. 2014; Yao et al. 2008) may be the most widely used. Several studies have compared the prediction performance of these methods with conventional methods and concluded that the performance of data mining models is better than that of conventional methods (Cheng and Hoang 2015; Pham et al. 2015, 2016a; Pradhan 2013; Tien Bui et al. 2012a, 2013a; Were et al. 2015; Yilmaz 2009).
The recent developments of geographic information systems (GIS) technology in combination with soft computing tools (such as in Weka, R programming, and MATLAB) have provided new and powerful techniques for landslide modeling (Tien Bui et al. 2016e) such as rule-based systems, probabilistic reasoning, decision tables, J48 decision trees, logistic model trees, and functional trees (Kumar et al. 2012). The main advantage of these methods is that they provide not only a more transparent calculation in the modeling process but also better accuracy (Hong et al. 2015a; Park and Lee 2014; Pham et al. 2016b; Tien Bui et al. 2014; Tsangaratos and Ilia 2015). Therefore, exploration of new methods and techniques for landslide modeling are highly necessary (Tien Bui et al. 2012e). This is because a few percentage of increment of the spatial accuracy could affect the spatial distribution of landslide-susceptible areas (Jebur et al. 2014; Kavzoglu et al. 2014; Tien Bui et al. 2012b, 2013a, 2014).
More recently, ensemble frameworks have received much attention in many fields due to their abilities to improve the prediction performance of models as well as dealing with complex and high-dimensional data (Lee et al. 2012; Rokach 2010). Various ensemble frameworks have been proposed such as Stacking, Random subspace, Random forests, and Rotation forests (Rodriguez et al. 2006), Bagging (Breiman 1996), AdaBoost (Freund and Schapire 1997), MultiBoost (Webb 2000), and they can group into two main categories: heterogeneous and homogeneous (Shun and Wenjia 2006). The first one incorporates models from different algorithms to form the final ensemble classifier, for example in Lee et al. (2012), whereas in the second one, only one algorithm is used but the original training data is split into several subsets to build classifiers, and then, a committee is constructed (Maudes et al. 2012). Nevertheless, exploration of ensemble frameworks for landslide susceptibility modeling has seldom been carried out.
This study fills this gap in literature by proposing and verifying a novel ensemble methodology for landslide susceptibility modeling. In the proposed approach, functional trees (Gama 2004) and three ensemble techniques such as AdaBoost, Bagging, and MultiBoost were used. The functional trees (FT) are classification trees that use linear functions at the leaves, whereas AdaBoost, Bagging, and MultiBoost are homogeneous ensemble frameworks that have ability to improve performances of prediction models significantly (Pham et al. 2016b; Tien Bui et al. 2013a, 2014). The prediction performances of the ensemble models were assessed using the training and validation datasets, statistical evaluation measures, the receiver operating characteristic (ROC) curve, and area under the curve (AUC). In addition, landslide models derived from J48 decision trees and artificial neural networks were included for comparison, and finally, concluding remarks were given. It is noted that the data processing was carried out using Microsoft Excel 2013, ArcGIS 10.2, and IDRISI Selva 17.01. The modeling process was carried out using the R programming environment and Weka 3.7.
Study area and data used
Geographic setting of the study area
The corridor of the National Road No. 32 section, between the Yen Bai and the Lao Cai provinces (Fig. 1), is selected as the study area. The area is located in the northwestern region of Vietnam and covers an area of around 3164 km2, between longitudes 103°33′23″E and 104°52′58″E, and between the latitude 21°19′53″N and 22°20′18″N. The total length of the road section is about 250 km.
The altitude of the study area ranges from 120 to 3140 m a.s.l, with an average altitude of 1078 m and SD is 555.9 m. Areas with slope group 0°–15° account for 22.3 % of the total area. About 52.9 % of the study area falls within slope greater than 25°, whereas areas in the slope category 15°–25° account for 24.8 % of the total area. Topographically, around 30.6 % of the total area is saddle hillside, whereas ridge areas account for 18.2 %. Approximately 17.8 % of the total area is ravine. Convex and concave areas account for 13.1 and 12.0 % of the total area.
The climate in the areas is characterized by the tropical monsoon with hot, rainy, and dry seasons. The average temperature is 22–23 °C and the average humidity 83–87 %. Rainfall is mainly concentrated in the rainy season from March to November, with an annual average rainfall is around 1500–2200 mm. Rainfall is generally low from December to February. The highest temperature can peak 41°, whereas the lowest one is around 0° (Ho et al. 2010).
Three main fault zones pass through the study area that causes weakness in the rock mass: Fansipan, Tu Le, and Song Da. There are 34 lithological formations outcrop in the study area, and among them, 10 formations (Fig. 1) are dominant and account for 88.9 % of the total area. They are Sinh Quyen (2.2 %), Bac Son (1.5 %), Suoi Bang (6.6 %), Muong Trai (2.6 %), Nam Mu (6.1 %), Tu Le complex (22.4 %), Phu Sa Phin complex (9.0 %), Ngoi Thia (12.0 %), Tram Tau (13,2 %) and Phu San Cap complex (13.2 %). Our analysis of these formations shows that tuff, sandstone, clay shale, clayey limestone, siltstone, limestone, trachyte porphyry, rhyolite, and granite are the main lithologies. Landslides are highly concentrated in Tu Le complex and Tram Tau formation (Ho et al. 2010).
Data collection and processing
Landslide inventory map
In this study, data collection and processing was carried out by means of a geographic information system. Landslide modeling is carried out using the statistical hypothesis that landslides will occur in future under the same conditions that produced them in the past and present (Guzzetti et al. 1999); therefore, a landslide inventory map is highly necessary to understand the conditioning factors that trigger slope failures and their mechanisms (Dai et al. 2002). In this study, a landslide inventory map (Fig. 1) with 262 landslide locations which have occurred during the last 20 years was used. These landslides were collected and interpreted using aerial photographs with resolution of 1 m, and these works were carried out in a national project by Ho et al. (2010). These landslides including 16 translational slides and 246 soil-mixed-boulder slides are depicted by polygons where the maximum size is 37,326 m2, while the minimum size is about 476 m2.
Around 14.5 % of the total landslides have sizes lager than 10,000 m2, whereas only 1.5 % of the total landslides have sizes less than 1000 m2. Landslide sizes between 1000 and 5000 m2 account for 56.5 % of the total landslides. The other landslides (27.5 %) have sizes from 5000 to 10,000 m2. It is important to note that some types of failures such as rock falls and topples were eliminated because their failure mechanisms are different. Our extensive field works showed that landslides were mainly triggered by heavy rainfalls that caused saturation of soils. Photographs of some landslides in this study area are shown in Fig. 2, and detailed explanations of these landslides can be seen in Ho et al. (2010).
Landslide conditioning factors
Since landslide susceptibility assessments employing soft computing techniques are considered as indirect approaches, therefore a large number of input parameters should be considered (Tien Bui et al. 2016b), though a model with too many factors does not necessarily resulting in higher prediction capability (Floris et al. 2011). Lithology, slope, and aspect are most widely used conditioning factors (Tien Bui et al. 2015, 2016c), whereas effectiveness of other factors such as soil type, land use, road and river networks may still debatable among landslide researchers. Conditioning factors should be selected based on the landslide typology and failure mechanism, the characteristics of the study area, the scale of analysis, the available data sets, and the methodology used (Ercanoglu 2005; Manzo et al. 2013).
Investigated relationships between landslide inventory map and related conditioning factors for this study area have been carried out by Ho et al. (2010), and based on their findings, a total of ten conditioning factors were selected, constructed, and converted to a raster format with a resolution of 20 m. They are lithology, distance to faults, slope, aspect, relief amplitude, toposhape, topographic wetness index (TWI), distance to roads, distance to river, and rainfall. The detail classes of these factors are shown in Table 1.
Lithology is considered as one of the most important factor (Ilia and Tsangaratos 2016) because it influences the geomechanical and hydraulic characteristics of terrain, therefore controlling types and mechanism of landslides (Dai et al. 2001; Ercanoglu 2005). Faults are considered a critical factor that influences distributions of landslides (Dou et al. 2015; Hong et al. 2016); therefore, distances to faults are also selected. In this study, the lithology and faults area were extracted from the Geological and Mineral Resources Map of Vietnam at a scale of 1:200,000. The lithologic map with 12 groups (Fig. 3) that compiled by Ho et al. (2010) was used. Distance to faults map with four classes (Fig. 4a) was constructed.
It is well known that slope failures are directly linked to types of terrain; therefore, a digital elevation model (DEM) with a resolution of 20 m for the study area was constructed using national topographic maps at the scale of 1:50,000. Based on the DEM, five geomorphometric factors were extracted: slope, aspect, relief amplitude, toposhape, topographic wetness index (TWI). Slope is selected for instability analysis because it is subject to shear stresses acting on the displacement of hill slopes (Dai et al. 2001). Aspect is a factor that indirectly influences slope failure because slope directions relate to the exposition of the terrain to solar radiation and rainfall that control the concentration of the soil moisture (Magliulo et al. 2008) and therefore influencing landslides. In this study, the slope map (Fig. 4b) was constructed with six classes, whereas the aspect map (Fig. 4c) with nine classes was built.
Relief amplitude that represents differences between the highest and lowest points in the terrain is considered as a highly sensitive factor to landslide occurrences (Tang et al. 2010; Vergari et al. 2011). The relief amplitude map with six classes (Fig. 4d) was compiled for the study area. Since the landslide occurrences are closely related to topographic attributes (Lineback Gritzner et al. 2001; Zhang et al. 2014); therefore, topographic shape is used in landslide susceptibility assessment (Caniani et al. 2008; Ercanoglu 2005). The toposhape map in this study (Fig. 5a) was constructed with ten classes. TWI that was developed by Beven and Kirkby (1979) is a combination of local upslope contributing area. TWI could quantify the effect of topography on hydrological processes and characterize the distribution of soil moisture and surface saturation (Sørensen et al. 2006); therefore, it is used in landslide susceptibility analysis. In this study, the TWI map (Fig. 5b) with five classes was constructed.
Anthropogenic factor such as distance to roads is used for the assessment of landslide susceptibility because excavations for road cuts may induce slope failures (Lay 2009). For the case of distance to rivers, water may influence the saturation of slopes when it undercuts banks of streams (Highland and Bobrowsky 2008); therefore, the distance to rivers should be used for landslide modeling. In this study, road and river networks were obtained from the national topographic maps at the scale of 1:50,000, and then, road and river sections that undercut slopes larger than 15o were extracted. The distance to road map (Fig. 5c) and distance to river map (Fig. 5d) were constructed by buffering the road and river sections. Regarding rainfall, the rainfall map (Fig. 5e) with five classes that was constructed by Ho et al. (2010) is used. This map was constructed based on the average rainfall from the year 1980–2008 using the Inverse Distance Weighed method (Tien Bui et al. 2011). The rainfall data were obtained from the Institute of Meteorology and Hydrology in Vietnam.
Methodology
Sampling strategy and preparation of training and validation data
In order to build landslide models and evaluate their performance, the landslide inventory and ten conditioning factor maps were converted to a grid cell format with a cell-size of 20 m. Since the dates of these landslides are not known, these landslide polygons were randomly split in two subsets with a ratio of 70/30 (Tien Bui et al. 2012d). The first subset (2781 landslide pixels) was used for building models, whereas the second one (1011 landslide pixels) was used for model validation.
The assessment of landslide susceptibility using data mining methods can be considered as a binary classification; therefore, they require both the positive data (e.g., in current case, the presence of landslides) and negative data (e.g., the absence of landslides). Because number of the landslide pixels (3792 pixels) are much smaller than total number of pixels of the study area (7,871,195 pixels), therefore, we used the under sampling method (Pradhan 2013; Tien Bui et al. 2016d) in this study. For this reason, the same non-landslide pixels were randomly sampled in the free-landslide area. The landslide pixels were assigned value of “1”, whereas the non-landslide pixels were assign value of “0”. Finally, values for the ten landslide conditioning factors were then extracted to build the training and validation datasets.
Feature selection and correlation analysis
Overall performance of landslide models using soft computing methods may be improved with the use of feature selection (Doshi and Chaturvedi 2014). This is because the training dataset may have some noisy features that cause confusions to the models; therefore, the feature selection is used in this study. Various methods and techniques for the selection of feature have been proposed for this task such as Information Gain (Quinlan 1993), Symmetrical uncertainty (Senthamarai Kannan and Ramaraj 2010), fuzzy rough set (Dai and Xu 2013), and PSO-based feature selection (Ajit Krisshna et al. 2014). In this study Information Gain was used because it is considered as one of the widely used techniques in feature selection in soft computing (Martínez-Álvarez et al. 2013; Witten et al. 2011), including landslide modeling (Tien Bui et al. 2016e). In addition, Information Gain helps to identify the importance of the input variables (Yang et al. 2011).
The Information Gain value for landslide conditioning factor L i corresponding to the out class Y (landslide and non-landslide) is measured (Eq. 1) by calculating the reduction of the information (entropy) in bits.
where H(Y) is the entropy value of Y i and is calculated by using Eq. (2); H(Y|L i ) is the entropy of Y after associating values of landslide conditioning factor L i and is estimated using Eq. (3)
where P(Y i ) is the prior probability of the out class Y and P(Y i |L i ) is the posterior probabilities of Y given the values of conditioning factor L i .
The prediction performance of landslide susceptibility models may have negative effects if it has an existing dependence between conditioning factors; therefore, the correlation degree of these factors should be checked. In this study, Spearman’s rank correlation (Myers and Sirois 2014) was used to analyze the relationships between these conditioning factors. The main advantage of using Spearman’s rank correlation is that it is not affected by the distribution of the data. In addition, it can still be efficient with small sample sizes (Gautheir 2001).
The strength of correlation given the Spearman’s rank is: very strong (0.9–1.0); strong, high correlation (0.7–0.9); moderate correlation (0.4–0.7); low correlation (not very significant) (0.2–0.4); very weak to negligible correlation (0.0–0.2) (Passman et al. 2011).
Functional trees classifier
Decision tree is a hierarchical model composed of decision rules that can be used for both regression and classification problems. Decision tree comprises a large number of algorithms and some of them have been proposed for landslide modeling with promising results such as Classification and Regression Trees (Felicisimo et al. 2013), Chi-square Automatic Interaction Detector Decision Trees (Althuwaynee et al. 2014), C4.5 or J48 (Tien Bui et al. 2013a), and Random forests (Trigila et al. 2015), Alternating decision tree (Hong et al. 2015a), and Logistic model trees (Tien Bui et al. 2016e). New algorithm such as functional trees (FT) (Gama 2004) has shown promising results in other fields (Witten et al. 2011) but has seldom been explored for landslide modeling and therefore was selected in this study.
Consider a training dataset D with n samples (X i , Y i ) with X i ∊ R n, \(Y_{i} \in \left\{ {\text{1,0}} \right\}\). X i is a input vector comprising the ten landslide conditioning factors (slope, aspect, relief amplitude, topographic wetness index, topographic shape, distance to roads, distance to rivers, distance to faults, lithology, and rainfall), Y i is the output that consists of two classes, landslide and no-landslide. The aim of FT is to build a decision tree that separates the two classes from the mentioned set of training data. The main difference between traditional decision tree algorithms and FT is that these traditional algorithms divide the input data at tree nodes by comparing the value of some input attributes with a constant, whereas FT uses logistic regression functions for the splitting in the inner nodes (called oblique split) and prediction at the leaves (Witten et al. 2011). There are three variants of FT: (1) the full FT that uses regression models for both the inner nodes and the leaves; (2) FT inner uses regression models for only the inner nodes; and (3) FT leaves used regression models for only leaves. In this study, the FT leaves was used.
The FT use (1) the gain ratio as the splitting criterion is to select an input attribute to split on; (2) standard C4.5 pruning (Quinlan 1996) to prevent the problem of over-fitting; and (3) the LogitBoost (iterative reweighting) for fitting the logistic regression functions at leaves with least-squares fits (Doetsch et al. 2009) for each class \(Y_{i}\) (Eq. 4).
where \(P\text{(}x\text{)}\) is the probability predicted value; β i is the coefficient of the ith component in the input vector X i . The posterior probabilities in the leave, P(X), are calculated as follows (Landwehr et al. 2005):
Ensemble learning algorithms
This section describes briefly three ensemble learning algorithms, Bagging, AdaBoost, and MultiBoost that were used to established ensemble models for landslide susceptibility in this study.
Bagging
Bagging (known as bootstrap aggregation) that is a machine ensemble learning method proposed by Breiman (1996) is used in this study for obtaining more robust and accurate landslide models. Bagging has shown to be useful in landslide susceptibility models because it is sensitive to small changes in the training data, therefore may have ability to improve the prediction capability of the model (Tien Bui et al. 2014). The procedure of the bagging algorithm consists of three steps: (1) first, bootstrap samples are obtained by randomly resampling from the training dataset to form a set of training subsets; (2) then, multiple classifier-based models are constructed based on each of the subset; and (3) lately, the final model is formed by aggregating all classifier-based models.
AdaBoost
AdaBoost (known as adaptive boosting) is a relative new machine learning ensemble algorithm proposed by Freund and Schapire (1997). In contrast to the Bagging, where training subsets are randomly sampled independently from the previous step, training subsets are obtained sequentially in the adaptive boosting ensemble. Compared to the Bagging, the AdaBoost provides controls for both bias and variance; however, bagging has better variance reduction (Ganjisaffar et al. 2011). The procedures of the AdaBoost algorithm are: (1) first, a subset is generated from the training dataset and an initial classifier-based model is then constructed where the instances are assigned equal weights; (2) the initial model is used to predict all instances in the training dataset and the misclassified instances will be embedded higher weights, whereas the weights of the correctly classified instances are remained; (3) in the next step, the weights of all instances in the training dataset are normalized and a new subset is then randomly sampled to build a next classifier-based model. This process continues until it reaches a terminated condition (Tien Bui et al. 2013a). The final model is obtained based on a weighted sum of all the classifier-based models.
MultiBoost
Multiboost is an extension of the AdaBoost algorithm that combines the strengths of Boosting and Wagging to prevent overfitting problem (Webb 2000). Wagging is a variant of Bagging, but Wagging does not use random bootstrap samples to form a set of training subsets; it assigns random weights to the cases in each training subset. The procedures of the Multiboost algorithm are: (1) using the training dataset, random selection with replacement is carried out to build a set of training subsets, and then, uses them to build classifier-based models; (2) resetting the instance weights according to overall accuracy performance of the classifier-based models; (3) new subsets is continuous sampling on the instance weighting to train the newer classifier-based models and the result is a committee of classifiers.
Performance assessment and comparison of landslide susceptibility models
Accuracy, Sensitivity, and Specificity are the three statistical evaluation measures generally used to assess the overall performance of the landslide susceptibility models (Tien Bui et al. 2016b). Accuracy is the proportion of pixels that are classified correctly. Sensitivity is the proportion of landslide pixels that are correctly classified whereas Specificity is the proportion of the non-landslide pixels that are correctly classified.
where true positives (TP) and true negatives (TN) are the number of pixels that are correctly classified. False positives (FP) and false negatives (FN) are the numbers of pixels that are erroneously classified.
The overall performance of the landslide susceptibility models is assessed through receiver operating characteristic (ROC) curve. The ROC curve graphs are constructed using the true positives versus the false positives in a two-dimensional space (Fawcett 2006). The ROC curve technique is attractive because it is insensitive to changes in class distribution. It means that if the proportions of landslide and non-landslide pixels in the validation dataset are varied, the ROC curve still remains. The area under the ROC curve (AUC) is a summary measure of the ROC analysis result that quantifies (1) the goodness-of-fit of the landslide models on the training dataset and (2) prediction capability of the landslide models using the validation data. A perfect model will be if AUC value is equal 1, whereas when AUC is equal 0, it indicates a non-informative model. The closer the AUC value to 1, the better is for the landslide model.
The assessment of performance of models using only the ROC curve analysis may not be the best approach. This is because the models with a high AUC value may not be necessarily associated with a high spatial accuracy of the models in some cases (Aguirre-Gutiérrez et al. 2013). Therefore, in this study, the prediction–rate curve method (Chung and Fabbri 2003) was further used. The prediction–rate results were obtained by overlaying the landslide pixels of the validation dataset with landslide susceptibility maps, and then the prediction–rate curve was constructed by plotting the cumulative percentage of landslide susceptibility maps and the cumulative percentage of the landslide pixels. The area under the prediction–rate curve (AUC_P) was used to quantify the prediction capability of the landslide models and when the AUC_P is equal to 1, it indicates perfect prediction accuracy.
Results and analysis
Feature selection and correlation analysis
Using Information Gain, the predictive ability of the ten conditioning factors was quantified and the result is shown in Table 2 in which the average merit is the average Information Gain and its SD with ten-fold cross-validation. It could be seen that the distance to roads has the highest Information Gain (0.266), followed by the slope (0.09), the aspect (0.048), the toposhade (0.045), the TWI (0.043), the relief amplitude (0.04), the distance to rivers (0.038), the rainfall (0.031), the lithology (0.029), and the distance to faults (0.014). Since ten factors have positive Information Gain, all of them were included in this analysis.
The result of the Spearman correlation analysis of the ten conditioning factors for this study is shown in Table 3. It could be observed that there is low correlation between these factors because the highest correlation value of 0.497 is for the correlation between the slope and the relief amplitude. This value is less than the critical value of 0.7 (Martín et al. 2012); therefore, none of the ten factors was eliminated in this analysis.
Performance assessment of landslide susceptibility models
The performance of the FT model may be influenced by minimum number of instances per leaf; therefore, a test is carried out by varying number of instances per leaf versus classification accuracy on both the training and validation data (Tien Bui et al. 2012a). The result showed that 30 instances per leaf are the best for this study. For building the FT model, LogitBoost with 15 iterations (default parameter) is used. Using tenfolds cross-validation, the FT model was constructed using the standard top-down approach. Accordingly, in each internal node, the splitting was carried out using the gain ratio, and then, logistic regression models were constructed for the leaves of the FT model.
The resulting FT model for the assessment of landslide susceptibility is shown in Fig. 6. It can be seen that the size of the tree is 71, including (1) the root node (orange color); (2) 34 internal nodes (purple color); and (3) 36 leaves (green rectangular boxes). In the leaves, LS denotes the landslide class, No-LS denotes the non-landslide class, and FT indicates FT number. The highest number of instances in a leaf node in the FT model is 508, whereas the smallest number of instances in a leaf node is 62.
Example of the FT25:15/210(152) in Fig. 6 is explained as follows: (1) the first number (15) is the numbers of LogitBoost iterations performed at this node; (2) the second number (210) is the total numbers of LogitBoost iterations performed, including iterations at the higher levels in the tree and the number of training examples at this node; and (3) the number in the parentheses (152) is the number of training instances used (Fig. 6). The functional trees for the node 25 are:
Since the aim of this study is to propose and verify three novel ensemble frameworks (Bagging, AdaBoost, and MultiBoost) for landslide susceptibility modeling, therefore three ensemble models used FT as a base classifier are constructed and the results are shown in Table 4. It could be observed that all three ensemble algorithms improved the model performance and have higher goodness-of-fit to the training data than the FT model does. The highest fit of the training data with a model is the FT with AdaBoost model (96.1 %) and the FT with MultiBoost model (95.9 %), followed by the FT with Bagging model (94.6 %), and the FT model (91.5 %). The FT with AdaBoost model has also the highest overall classification accuracy (90.919 %), followed by the FT with MultiBoost model (90.685 %), the FT with Bagging model (88.563 %), and the FT model (87.7 %).
The FT with AdaBoost model has the highest sensitivity of 93.492 % indicating that 93.492 % of the landslide pixels are correctly classified to the landslide class. It is closely followed by the FT with MultiBoost model (92.844 %), the FT model (90.076 %), and the FT with Bagging model (89.824 %). Regarding specificity, three ensemble models have almost equal values that the probability to classify the non-landslide pixels to the non-landslide class is almost the same. Kappa index of the four susceptibility models is varied from 0.754 (the FT model) to 0.818 (the FT with AdaBoost model) indicating good agreement between the models and the training data.
Once the FT and three ensemble models were successfully built in the training phase, these models were then used to calculate the susceptibility index for all the pixels in the study area. These indices were exported into a GIS format using an application developed in C++ programming, and then opened in ArcGIS 10.2 software. For visualization of the landslide susceptibility maps, these indexes were visualized by means of five susceptibility levels such as very high, high, moderate, low and very low (Chung et al. 1995). Although various methods can be used for the classification of susceptibility indexes such as the equal interval method, the natural break method and the SD (Ayalew and Yamagishi 2005), the classification method based on the graphical curve (Chung and Fabbri 2008; Tien Bui et al. 2012e; Van Westen et al. 2003) is considered the most widely used and was used in this study.
In this method, first, all landslide pixels were overlaid on the four landslide susceptibility maps. Then, cumulative percentages of the landslide pixels versus percentage of landslide susceptibility indexes were calculated, and finally, the graphical curve was derived. Detailed explanation on how to build the graphical curve can be seen in Chung et al. (1995) and Chung and Fabbri (2008). Based on the graphical curves (Fig. 7), five susceptibility classes were determined as very high 5 %, high 10 %, moderate 15 %, low 20 %, and very low 50 % (Fig. 7).
Model validation and comparison
The prediction capability of four susceptibility models is evaluated and compared using the validation dataset that was not used in the training phase. The results are shown in Table 5 and Fig. 8. It could be seen that AUC of 0.917 is for the FT with Bagging model indicating that the prediction accuracy is 91.7 %, followed closely by the FT with MultiBoost model (91 %), the FT model (89.8 %), and the FT with AdaBoost model (88.2 %). The FT with AdaBoost model has the lowest Kappa index (0.604), whereas the FT with Bagging model has the highest one (0.711) (Table 5).
The detailed statistical measures of the validation results are shown in Table 5. It reveals that the highest classification accuracy is for the FT with Bagging model (85.552 %), whereas the lowest one is for the FT with AdaBoost model (80.208 %). The classification accuracy is almost equal for the FT with MultiBoost model (83.869 %) and the FT model (83.671 %). The FT with Bagging model has the highest sensitivity (81.998 %) indicating the probability to correctly classify the landslide pixels to the landslide class is 81.998 %, followed by the FT model (81.503 %), the FT with MultiBoost model (76.855 %), and the FT with AdaBoost (68.447 %). The highest specificity is for the FT with AdaBoost model (91.98 %) indicating 91.98 % non-landslide pixels are correctly classified to the non-landslide class. It is closely followed by the FT with MultiBoost model (90.891 %), and the FT with Bagging model (89.109 %). The lowest specificity is the FT model (85.842 %) indicating that the probability to classify the non-landslide pixels to the non-landslide class correctly is 85.842 %.
The prediction rate of the four susceptibility models is assessed using the spatial cross-validation procedure as mentioned in the Sect. 3.5. The areas under the prediction–rate curves (AUC_P) were then estimated and shown in Fig. 9. It shows that the FT with Bagging has highest prediction capability (89.7 %) is for the FT with Bagging and the FT with MultiBoost models. They are followed by the FT model (86.2 %) and the FT with AdaBoost model (85.6 %).
Based on the aforementioned results, it could be concluded that the FT with Bagging is the best model for landslide susceptibility mapping in this study.
Similarities and dissimilarities of the four landslide susceptibility maps and their classes
In order to evaluate similarities and dissimilarities of the geographic patterns in five classes of the four landslide susceptibility maps, three Kappa statistics (Kappa index, Kappa location, and Kappa histogram) were used. It is noted that this task was carried out using the Map Comparison Kit (Visser and de Nijs 2006). Kappa (Cohen 1960) that based on the level of agreement is widely used to measure similarity between a pair of landslide susceptibility maps. Kappa location (Pontius 2000) and Kappa histogram (Hagen 2002) are extensions of Kappa index. Kappa location compares the actual to expected success rate due to chance, to assess the similarity of location regarding the spatial distribution of categories on the maps (Pontius 2000). Kappa histogram measures similarity of quantitative (fraction of pixels) based on the histograms of the two maps (Prasad et al. 2006). The values of Kappa statistics are varied from 0 to 1. Value of 1 indicates two classes are identical (total agreement), while a value of 0 indicates that the no agreement between two classes. The degree of agreement between two classes given the Kappa is for 0.8–1.0 almost perfect, 0.6–0.8 substantial, 0.4–0.6 moderate, 0.2–0.4 fair, 0–0.2 slight, and ≤0 poor (Landis and Koch 1977).
Table 6 shows the results of the comparison of four landslide susceptibility maps in terms of Kappa statistics. The results show that Kappa indexes for the four susceptibility maps varied from 0.246 to 0.423 indicates that the similarity between the four susceptibility maps is low. Looking at the Kappa index values for susceptibility classes (Table 6), the highest similarity is in the very high class obtained from the FT and the FT with Bagging models (Kappa index of 0.810). The largest dissimilarity is for the low susceptibility classes produced by the FT and the FT with MultiBoost models (Kappa index of 0.057). The highest value of Kappa location is 0.482 for two maps obtained from The FT with AdaBoost and the FT with MultiBoost models indicating that the spatial distributions of susceptibility indexes over the two maps are moderate, whereas the very high classes of the FT and the FT with Bagging models has the highest similarity in terms of spatial distributions. The largest dissimilarity in the spatial distributions is for the low susceptibility classes obtained from the FT and the FT with AdaBoost models (Kappa location of 0.073). The values of Kappa histogram are general high when comparing four susceptibility maps indicates a perfect quantitative similarity. An interpretation of Kappa histogram values for five susceptibility classes shows that the highest quantitative dissimilarities (Kappa histogram of 0.521) is for the pair low susceptibility classes obtained from the FT and the FT with MultiBoost models, and the FT with Bagging and the FT with MultiBoost models.
Discussion and conclusion
Landslide susceptibility maps are of great help in land use planning, hazard management, and mitigations (Burby 1998); therefore, these maps should be constructed using prediction models with high accuracy. However, a perfect landslide model with no error is almost impossible; therefore, new algorithms and frameworks that may help to increase prediction performances of landslide models should be explored and verified. We address this issue in this paper by proposing and verifying a new ensemble methodology for landslide susceptibility modeling based on FT and three ensemble frameworks, AdaBoost, Bagging, and MultiBoost. Three main aims are focused on: (1) feature selection and variable importance for landslide conditioning factors using the Information Gain technique; (2) exploration in the first time the potential application of the FT and three ensembles techniques for the assessment of landslide susceptibility at the corridor of the national road No. 32 (Vietnam); and (3) assessment similarities and dissimilarities of the landslide susceptibility maps and their susceptibility classes using Kappa index, Kappa location, and Kappa histogram.
In landslide modeling, the predictive ability of a set of widely used conditioning factors should be quantified (Tien Bui et al. 2016c). Although various techniques and methods have been proposed for the feature selection such as linear correlation (Irigaray et al. 2007), Goodman-Kruskal and Kolmogorov–Smirnov test (Costanzo et al. 2012; Fernández et al. 2003), and GIS matrix combination method (Cross 2002), but none of them is widely accepted as the standard guideline for the assessment of landslide susceptibility. The result in this study shows that the Information Gain technique could be used for the feature selection. The main advantage of this technique is that the decrease in entropy of the output (landslide and non-landslide classes) when the output is associated with landslide conditioning factors, is measured and used to assess the importance of these factors. The higher the decreasing of entropy, the better is for the conditioning factor. This study shows that all ten conditioning factors have significant predictive ability, indicating that the collection, processing, and coding of these factors have been carried out successfully. Distance to roads and slope are the most important factors, indicating logical and reasonable result. This is because this study mainly investigated landslides occurred in the corridor of the national road No.32 and slope is widely accepted as the most important in literature (Costanzo et al. 2012; Van Den Eeckhaut et al. 2006).
Using the ten conditioning factors, four landslide susceptibility maps were produced using the FT and the three ensembles techniques. It was found that four susceptibility models performed reasonably well with high degree-of-fits and high prediction capabilities. The FT model with its visible structures provided useful insights on how the model works. The AUC for the FT model show a high degree-of-fits on the training dataset (91.5 %). The degree-of-fits is even improved when the FT was integrated with the three ensembles techniques. The AUC is improved significantly, 3.1 % for the FT with Bagging, 4.4 % for the FT with AdaBoost, 4.6 % the FT with MultiBoost. The prediction power of the FT with Bagging and the FT with MultiBoost models has also improved 1.9 and 1.2 % compared to the FT model, respectively. In contrast, the prediction power of the FT with AdaBoost is reduced 1.6 % compared to the FT model. Therefore, the Bagging and the MultiBoost ensemble frameworks should be used for landslide susceptibility modeling. In fact, the Bagging and the MultiBoost are more recently well-recognized techniques in the soft computing modeling that enable not only to improve single classifier but also to deal with complex and high-dimensional modeling problems (Trawiński et al. 2013). In general, the finding results in this study agree with Althuwaynee et al. (2014), Jebur et al. (2014), and Tien Bui et al. (2014) who state that ensemble models outperform the single model
The prediction powers of four susceptibility models were further estimated by using the prediction–rate method that using only the landslide pixels in the validation set. The FT with Bagging and the FT with MultiBoost models have the highest prediction powers (89.7 %), followed by the FT model (86.2 %) and the FT with AdaBoost model (85.6 %). It is clear that the prediction power of all the models checked by the prediction–rate method is slightly lower than those calculated using the ROC curve method. The highest difference is for the FT model (3.6 %), followed by the FT with AdaBoost model (2.6 %), the FT with Bagging model (2.0 %), and the FT with MultiBoost model (1.3 %). These differences are because the validation procedure using the ROC curve analysis using entire validation dataset (1011 landslide and 1011 non-landslide pixels), whereas the prediction–rate method used only 1011 landslide pixels in the validation dataset for the estimation of area under the curves in four susceptibility maps. In fact, the ROC curve and AUC in landslide susceptibility models are affected by several factors: (1) the methods or techniques used; (2) the selection of conditioning factors; (3) the landslides inventory map; and (4) characteristics of the study area. Consequently, the correlation between AUC values and the prediction capability of the susceptibility models may not correspond strictly; therefore, the prediction–rate method should be considered as well.
To evaluate geographic consistency of the susceptibility index distributions, Kappa index, Kappa location, and Kappa histogram should be used. These could help to reveal similarities and dissimilarities of the four landslide susceptibility maps and their classes. For example, although the performances of the FT with Bagging and the FT with MultiBoost models are almost the same, the similarities of spatial distributions of susceptibility indexes over the two maps are only moderate. However, a high degree of similarities is for the high landslide susceptibility classes, whereas dissimilarities are low susceptibility classes.
Overall, the result from this study clearly shows that the FT with Bagging model has the highest accuracy. Compared with the susceptibility models produced by the same authors using well-known soft computing algorithms such as J48 Decision Tree (Tien Bui et al. 2013a) and artificial neural networks (Tien Bui et al. 2013b), the prediction capability of the FT with Bagging model is better. Therefore, we conclude that the FT with Bagging is a promising technique that should be considered as an alternative for the assessment of landslide susceptibility. Since these results are representative of the currently implemented versions of these techniques, the performance of susceptibility models may be improved if having changes in coding the algorithms in the future. However, these results are only representative for the current study area. Investigations for other areas with different terrain and geological contexts should be further considered. As a final conclusion, these results from this study may useful for land use planning and decision making in areas prone to landslides.
References
Aguirre-Gutiérrez J, Carvalheiro LG, Polce C, van Loon EE, Raes N, Reemer M, Biesmeijer JC (2013) Fit-for-purpose: species distribution model performance depends on evaluation criteria—Dutch hoverflies as a case study. PLoS One 8:e63708. doi:10.1371/journal.pone.0063708
Ajit Krisshna NL, Deepak VK, Manikantan K, Ramachandran S (2014) Face recognition using transform domain feature extraction and PSO-based feature selection. Appl Soft Comput 22:141–161. doi:10.1016/j.asoc.2014.05.007
Althuwaynee OF, Pradhan B, Park H-J, Lee JH (2014) A novel ensemble decision tree-based CHi squared Automatic Interaction Detection (CHAID) and multivariate logistic regression models in landslide susceptibility mapping. Landslides 11:1063–1078
Ayalew L, Yamagishi H (2005) The application of GIS-based logistic regression for landslide susceptibility mapping in the Kakuda-Yahiko Mountains. Central Jpn Geomorphol 65:15–31. doi:10.1016/j.geomorph.2004.06.010
Beven KJ, Kirkby MJ (1979) A physically based, variable contributing area model of basin hydrology/Un modèle à base physique de zone d’appel variable de l’hydrologie du bassin versant. Hydrol Sci Bull 24:43–69. doi:10.1080/02626667909491834
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Burby RJ (1998) Cooperating with nature: confronting natural hazards with land-use planning for sustainable communities. Joseph Henry Press, Washington
Caniani D, Pascale S, Sdao F, Sole A (2008) Neural networks and landslide susceptibility: a case study of the urban area of Potenza. Nat Hazards 45:55–72. doi:10.1007/s11069-007-9169-3
Cheng M-Y, Hoang N-D (2015) A Swarm-Optimized Fuzzy Instance-based Learning approach for predicting slope collapses in mountain roads. Knowl Based Syst 76:256–263
Chung CJF, Fabbri AG (2003) Validation of spatial prediction models for landslide hazard mapping. Nat Hazards 30:451–472
Chung C-J, Fabbri AG (2008) Predicting landslides for risk analysis—spatial models tested by a cross-validation technique. Geomorphology 94:438–452. doi:10.1016/j.geomorph.2006.12.036
Chung CJF, Fabbri AG, Van Westen CJ (1995) Multivariate regression analysis for landslide hazard zonation. In: Carrara A, Guzzetti F (eds) Geographical information systems in assessing natural hazards, vol 5. Springer, New York, pp 107–133
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20:37–46. doi:10.1177/001316446002000104
Costanzo D, Rotigliano E, Irigaray C, Jiménez-Perálvarez JD, Chacón J (2012) Factors selection in landslide susceptibility modelling on large scale following the gis matrix method: application to the river Beiro basin (Spain). Nat Hazards Earth Syst Sci 12:327–340. doi:10.5194/nhess-12-327-2012
Costanzo D, Chacón J, Conoscenti C, Irigaray C, Rotigliano E (2014) Forward logistic regression for earth-flow landslide susceptibility assessment in the Platani river basin (southern Sicily, Italy). Landslides 11:639–653. doi:10.1007/s10346-013-0415-3
Cross M (2002) Landslide susceptibility mapping using the Matrix Assessment Approach: a Derbyshire case study. In: Griffiths JS (ed) Mapping in engineering geology, vol 15. The Geological society, Key Issue in Earth Sciences, London, pp 247–261
Dai J, Xu Q (2013) Attribute selection based on information gain ratio in fuzzy rough set theory with application to tumor classification. Appl Soft Comput 13:211–221. doi:10.1016/j.asoc.2012.07.029
Dai F, Lee C, Li J, Xu Z (2001) Assessment of landslide susceptibility on the natural terrain of Lantau Island. Hong Kong Environ Geol 40:381–391
Dai FC, Lee CF, Ngai YY (2002) Landslide risk assessment and management: an overview. Eng Geol 64:65–87
Doetsch P et al (2009) Logistic model trees with AUC split criterion for the KDD cup 2009 small challenge. In KDD Cup, pp 77–88
Doshi M, Chaturvedi SK (2014) Correlation based feature selection (CFS) technique to predict student performance. Int J Comput Netw Commun (UCNC) 6:197–206
Dou J et al (2015) Optimization of causative factors for landslide susceptibility evaluation using remote sensing and GIS data in parts of Niigata, Japan. PLoS One 10:e0133262. doi:10.1371/journal.pone.0133262
Ercanoglu M (2005) Landslide susceptibility assessment of SE Bartin (West Black Sea region, Turkey) by artificial neural networks. Nat Hazards Earth Syst Sci 5:979–992
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874
Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (1996) Advances in knowledge discovery and data mining. AAAI press, Menlo Park, California (USA)
Felicisimo A, Cuartero A, Remondo J, Quiros E (2013) Mapping landslide susceptibility with logistic regression, multiple adaptive regression splines, classification and regression trees, and maximum entropy methods: a comparative study. Landslides 10:175–189. doi:10.1007/s10346-012-0320-1
Fernández T, Irigaray C, El Hamdouni R, Chacón J (2003) Methodology for landslide susceptibility mapping by means of a GIS. Application to the Contraviesa Area (Granada, Spain). Nat Hazards 30:297–308. doi:10.1023/B:NHAZ.0000007092.51910.3f
Floris M, Iafelice M, Squarzoni C, Zorzi L, Agostini AD, Genevois R (2011) Using online databases for landslide susceptibility assessment: an example from the Veneto Region (northeastern Italy). Nat Hazards Earth Syst Sci 11:1915–1925
Freund Y, Schapire R (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139. doi:10.1006/jcss.1997.1504
Gama J (2004) Functional trees. Mach Learn 55:219–250
Ganjisaffar Y, Caruana R, Lopes CV (2011) Bagging gradient-boosted trees for high precision, low variance ranking models. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 85–94
Gautheir TD (2001) Detecting trends using Spearman’s rank correlation coefficient. Environ Forensics 2:359–362. doi:10.1080/713848278
Gomez H, Kavzoglu T (2005) Assessment of shallow landslide susceptibility using artificial neural networks in Jabonosa River Basin, Venezuela. Eng Geol 78:11–27. doi:10.1016/j.enggeo.2004.10.004
Guzzetti F, Carrara A, Cardinali M, Reichenbach P (1999) Landslide hazard evaluation: a review of current techniques and their application in a multi-scale study, Central Italy. Geomorphology 31:181–216
Hagen A (2002) Multi-method assessment of map similarity. In: Proceedings of the fifth AGILE conference on geographic information science, Palma, Spain, pp 171–182
Highland L, Bobrowsky PT (2008) The landslide handbook: a guide to understanding landslides. US Geological Survey Reston
Ho TC et al (2010) Combination of structural geology, remote sensing, and GIS for the study of current status and prediction of flash floods and landslides at the National Road No. 32 section from the Yen Bai to the Lai Chau Provinces. Vietnam Institute of Geosciences and Mineral Resources, Hanoi
Hoang N-D, Tien Bui D (2016) A novel relevance vector machine classifier with cuckoo search optimization for spatial prediction of landslides. J Comput Civil Eng. doi:10.1061/(ASCE)CP.1943-5487.0000557
Hoang N-D, Tien Bui D, Liao K-W (2016) Groutability estimation of grouting processes with cement grouts using Differential Flower Pollination Optimized Support Vector Machine. Appl Soft Comput 45:173–186. doi:10.1016/j.asoc.2016.04.031
Hong H, Pradhan B, Xu C, Tien Bui D (2015a) Spatial prediction of landslide hazard at the Yihuang area (China) using two-class kernel logistic regression, alternating decision tree and support vector machines. Catena 133:266–281. doi:10.1016/j.catena.2015.05.019
Hong H, Xu C, Revhaug I, Tien Bui D (2015b) Spatial prediction of landslide hazard at the Yihuang Area (China): a comparative study on the predictive ability of backpropagation multi-layer perceptron neural networks and radial basic function neural networks. In: Robbi Sluter C, Madureira Cruz CB, Leal de Menezes PM (eds) Cartography—maps connecting the world. Lecture notes in geoinformation and cartography. Springer, Cham, pp 175–188. doi:10.1007/978-3-319-17738-0_13
Hong H, Chen W, Xu C, Youssef AM, Pradhan B, Tien Bui D (2016) Rainfall-induced landslide susceptibility assessment at the Chongren area (China) using frequency ratio, certainty factor, and index of entropy. Geocarto Int. doi:10.1080/10106049.2015.1130086
Ilia I, Tsangaratos P (2016) Applying weight of evidence method and sensitivity analysis to produce a landslide susceptibility map. Landslides 13:379–397
Irigaray C, Fernández T, El Hamdouni R, Chacón J (2007) Evaluation and validation of landslide-susceptibility maps obtained by a GIS matrix method: examples from the Betic Cordillera (southern Spain). Nat Hazards 41:61–79. doi:10.1007/s11069-006-9027-8
Jebur MN, Pradhan B, Tehrany MS (2014) Optimization of landslide conditioning factors using very high-resolution airborne laser scanning (LiDAR) data at catchment scale. Remote Sens Environ 152:150–165. doi:10.1016/j.rse.2014.05.013
Kavzoglu T, Sahin E, Colkesen I (2014) Landslide susceptibility mapping using GIS-based multi-criteria decision analysis, support vector machines, and logistic regression. Landslides 11:425–439. doi:10.1007/s10346-013-0391-7
Kavzoglu T, Kutlug Sahin E, Colkesen I (2015) An assessment of multivariate and bivariate approaches in landslide susceptibility mapping: a case study of Duzkoy district. Nat Hazards 76:471–496. doi:10.1007/s11069-014-1506-8
Kumar YJ, Salim N, Raza B (2012) Cross-document structural relationship identification using supervised machine learning. Appl Soft Comput 12:3124–3131. doi:10.1016/j.asoc.2012.06.017
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174
Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 59:161–205. doi:10.1007/s10994-005-0466-3
Lay MG (2009) Handbook of road technology. CRC Press, Boca Raton
Lee S, Ryu JH, Min KD, Won JS (2003) Landslide susceptibility analysis using GIS and artificial neural network. Earth Surf Proc Land 28:1361–1376. doi:10.1002/esp.593
Lee M-J, Choi J-W, Oh H-J, Won J-S, Park I, Lee S (2012) Ensemble-based landslide susceptibility maps in Jinbu area. Korea Environ Earth Sci 67:23–37. doi:10.1007/s12665-011-1477-y
Lee S, Won J-S, Jeon SW, Park I, Lee MJ (2014) Spatial landslide hazard prediction using rainfall probability and a logistic regression model. Math Geosci 47:565–589
Lineback Gritzner M, Marcus WA, Aspinall R, Custer SG (2001) Assessing landslide potential using GIS, soil wetness modeling and topographic attributes, Payette River, Idaho. Geomorphology 37:149–165. doi:10.1016/S0169-555X(00)00068-4
Magliulo P, Di Lisio A, Russo F, Zelano A (2008) Geomorphology and landslide susceptibility assessment using GIS and bivariate statistics: a case study in southern Italy. Nat Hazards 47:411–435
Manzo G, Tofani V, Segoni S, Battistini A, Catani F (2013) GIS techniques for regional-scale landslide susceptibility assessment: the Sicily (Italy) case study. Int J Geogr Inf Sci 27:1433–1452
Martín B, Alonso JC, Martín CA, Palacín C, Magaña M, Alonso J (2012) Influence of spatial heterogeneity and temporal variability in habitat selection: a case study on a great bustard metapopulation. Ecol Model 228:39–48
Martínez-Álvarez F, Reyes J, Morales-Esteban A, Rubio-Escudero C (2013) Determining the best set of seismicity indicators to predict earthquakes. Two case studies: Chile and the Iberian Peninsula. Knowl Based Syst 50:198–210. doi:10.1016/j.knosys.2013.06.011
Maudes J, Rodriguez JJ, Garcia-Osorio C, Garcia-Pedrajas N (2012) Random feature weights for decision tree ensemble construction. Inf Fusion 13:20–30. doi:10.1016/j.inffus.2010.11.004
Mennis J, Guo D (2009) Spatial data mining and geographic knowledge discovery—an introduction Computers. Environ Urban Syst 33:403–408. doi:10.1016/j.compenvurbsys.2009.11.001
Myers L, Sirois MJ (2014) Spearman correlation coefficients, differences between. In: Wiley StatsRef: statistics reference online. Wiley. doi:10.1002/9781118445112.stat02802
Park I, Lee S (2014) Spatial prediction of landslide susceptibility using a decision tree approach: a case study of the Pyeongchang area. Korea Int J Remote Sens 35:6089–6112
Passman MA et al (2011) Validation of Venous Clinical Severity Score (VCSS) with other venous severity assessment tools from the American Venous Forum, National Venous Screening Program. J Vasc Surg 54:2S–9S. doi:10.1016/j.jvs.2011.05.117
Pham B, Tien Bui D, Pourghasemi H, Indra P, Dholakia MB (2015) Landslide susceptibility assessment in the Uttarakhand area (India) using GIS: a comparison study of prediction capability of naïve bayes, multilayer perceptron neural networks, and functional trees methods. Theor Appl Climatol. doi:10.1007/s00704-015-1702-9
Pham BT, Pradhan B, Tien Bui D, Prakash I, Dholakia MB (2016a) A comparative study of different machine learning methods for landslide susceptibility assessment: a case study of Uttarakhand area (India). Environ Model Softw. doi:10.1016/jenvsoft201607005
Pham BT, Tien Bui D, Prakash I, Dholakia MB (2016b) Rotation forest fuzzy rule-based classifier ensemble for spatial prediction of landslides using GIS. Nat Hazards. doi:10.1007/s11069-016-2304-2
Pontius RG (2000) Quantification error versus location error in comparison of categorical maps. Photogramm Eng Remote Sens 66:1011–1016
Pradhan B (2013) A comparative study on the predictive ability of the decision tree, support vector machine and neuro-fuzzy models in landslide susceptibility mapping using GIS. Comput Geosci 51:350–365. doi:10.1016/j.cageo.2012.08.023
Pradhan B, Lee S (2010) Landslide susceptibility assessment and factor effect analysis: backpropagation artificial neural networks and their comparison with frequency ratio and bivariate logistic regression modelling. Environ Model Softw 25:747–759. doi:10.1016/j.envsoft.2009.10.016
Pradhan B, Sezer EA, Gokceoglu C, Buchroithner MF (2010) Landslide susceptibility mapping by neuro-fuzzy approach in a landslide-prone area (Cameron Highlands, Malaysia). IEEE Trans Geosci Remote Sens 48:4164–4177. doi:10.1109/tgrs.2010.2050328
Prasad AM, Iverson LR, Liaw A (2006) Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9:181–199
Quinlan JR (1993) C45: programs for machine learning. Morgan Kaufmann, San Mateo
Quinlan JR (1996) Improved use of continuous attributes in C4.5. J Artif Intell Res 4:77–90
Rodriguez JJ, Kuncheva LI, Alonso CJ (2006) Rotation forest: a new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 28:1619–1630. doi:10.1109/TPAMI.2006.211
Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33:1–39. doi:10.1007/s10462-009-9124-7
Senthamarai Kannan S, Ramaraj N (2010) A novel hybrid feature selection via Symmetrical Uncertainty ranking based local memetic search algorithm. Knowl Based Syst 23:580–585. doi:10.1016/j.knosys.2010.03.016
Shun B, Wenjia W (2006) Investigation on diversity in homogeneous and heterogeneous ensembles. In: International joint conference on neural networks, 2006. IJCNN’06. 16–21 July 2006, pp 3078–3085. doi:10.1109/IJCNN.2006.247268
Sørensen R, Zinko U, Seibert J (2006) On the calculation of the topographic wetness index: evaluation of different methods based on field observations. Hydrol Earth Syst Sci Dis 10:101–112
Suzen ML, Doyuran V (2004) A comparison of the GIS based landslide susceptibility assessment methods: multivariate versus bivariate. Environ Geol 45:665–679. doi:10.1007/s00254-003-0917-8
Tang C, Zhu J, Qi X (2010) Landslide hazard assessment of the 2008 Wenchuan earthquake: a case study in Beichuan area. Can Geotechn J 48:128–145
Tien Bui D, Lofman O, Revhaug I, Dick O (2011) Landslide susceptibility analysis in the Hoa Binh province of Vietnam using statistical index and logistic regression. Nat Hazards 59:1413–1444. doi:10.1007/s11069-011-9844-2
Tien Bui D, Pradhan B, Lofman O, Revhaug I (2012a) Landslide susceptibility assessment in Vietnam using Support vector machines, decision tree and Naïve Bayes models. Math Prob Eng 2012:1–26
Tien Bui D, Pradhan B, Lofman O, Revhaug I, Dick OB (2012b) Application of support vector machines in landslide susceptibility assessment for the Hoa Binh province (Vietnam) with kernel functions analysis. In: Seppelt R, Voinov AA, Lange S, Bankamp D (eds) Proceedings of the iEMSs sixth biennial meeting: international congress on environmental modelling and software (iEMSs 2012). International Environmental Modelling and Software Society, Leipzig
Tien Bui D, Pradhan B, Lofman O, Revhaug I, Dick OB (2012c) Landslide susceptibility assessment in the Hoa Binh province of Vietnam: a comparison of the Levenberg–Marquardt and Bayesian regularized neural networks. Geomorphology 171–172:12–29
Tien Bui D, Pradhan B, Lofman O, Revhaug I, Dick OB (2012d) Landslide susceptibility mapping at Hoa Binh province (Vietnam) using an adaptive neuro-fuzzy inference system and GIS. Comput Geosci 45:199–211. doi:10.1016/j.cageo.2011.10.031
Tien Bui D, Pradhan B, Lofman O, Revhaug I, Dick OB (2012e) Spatial prediction of landslide hazards in Hoa Binh province (Vietnam): a comparative assessment of the efficacy of evidential belief functions and fuzzy logic models. Catena 96:28–40. doi:10.1016/j.catena.2012.04.001
Tien Bui D, Ho TC, Revhaug I, Pradhan B, Nguyen D (2013a) Landslide susceptibility mapping along the National Road 32 of Vietnam using GIS-based J48 decision tree classifier and its ensembles. In: Buchroithner M, Prechtel N, Burghardt D (eds) Cartography from pole to pole. Lecture notes in geoinformation and cartography. Springer, Berlin, pp 303–317. doi:10.1007/978-3-642-32618-9_22
Tien Bui D, Tin DQ, Ha VP, Revhaug I, Lien VN, Ha TT, Hoa LB (2013b) Spatial prediction of landslide hazard along the National Road 32 of Vietnam: a comparison between support vector machines, radial basis function neural networks, and their ensembles. In: Geohazards: impacts and challenges for society development in Asian Countries, 49th CCOP annual session, Sendai, Japan. Geological Survey of Japan, pp 161–171. doi:10.13140/RG.2.1.3073.2327
Tien Bui D, Pradhan B, Revhaug I, Trung Tran C (2014) A comparative assessment between the application of fuzzy unordered rules induction algorithm and J48 decision tree models in spatial prediction of shallow landslides at Lang Son City, Vietnam. In: Srivastava PK, Mukherjee S, Gupta M, Islam T (eds) Remote sensing applications in environmental research, society of earth scientists series. Springer, Cham, pp 87–111. doi:10.1007/978-3-319-05906-8_6
Tien Bui D, Pradhan B, Revhaug I, Nguyen DB, Pham HV, Bui QN (2015) A novel hybrid evidential belief function-based fuzzy logic model in spatial prediction of rainfall-induced shallow landslides in the Lang Son city area (Vietnam) Geomatics. Nat Hazards Risk 6:243–271. doi:10.1080/19475705.2013.843206
Tien Bui D, Le K-T, Nguyen V, Le H, Revhaug I (2016a) Tropical forest fire susceptibility mapping at the Cat Ba National Park Area, Hai Phong City, Vietnam, using GIS-based Kernel logistic regression. Remote Sens 8:347
Tien Bui D, Nguyen Q-P, Hoang N-D, Klempe H (2016b) A novel fuzzy K-nearest neighbor inference model with differential evolution for spatial prediction of rainfall-induced shallow landslides in a tropical hilly area using GIS. Landslides. doi:10.1007/s10346-016-0708-4
Tien Bui D, Pham TB, Nguyen Q-P, Hoang N-D (2016c) Spatial prediction of rainfall-induced shallow landslides using hybrid integration approach of least squares support vector machines and differential evolution optimization: a case study in Central Vietnam. Int J Digit Earth. doi:10.1080/1753894720161169561
Tien Bui D, Pradhan B, Nampak H, Quang Bui T, Tran Q-A, Nguyen QP (2016d) Hybrid artificial intelligence approach based on neural fuzzy inference model and metaheuristic optimization for flood susceptibility modelling in a high-frequency tropical cyclone area using GIS. J Hydrol 540:317–330. doi:10.1016/j.jhydrol.2016.06.027
Tien Bui D, Tuan TA, Klempe H, Pradhan B, Revhaug I (2016e) Spatial prediction models for shallow landslide hazards: a comparative assessment of the efficacy of support vector machines, artificial neural networks, kernel logistic regression, and logistic model tree. Landslides 13:361–378. doi:10.1007/s10346-015-0557-6
Trawiński K, Cordón O, Quirin A, Sánchez L (2013) Multiobjective genetic classifier selection for random oracles fuzzy rule-based classifier ensembles: how beneficial is the additional diversity? Knowl Based Syst 54:3–21. doi:10.1016/j.knosys.2013.08.006
Trigila A, Iadanza C, Esposito C, Scarascia-Mugnozza G (2015) Comparison of Logistic Regression and Random Forests techniques for shallow landslide susceptibility assessment in Giampilieri (NE Sicily, Italy). J Geomorphol. doi:10.1016/j.geomorph.2015.06.001
Tsangaratos P, Ilia I (2015) Landslide susceptibility mapping using a modified decision tree classifier in the Xanthi Perfection, Greece. Landslides. doi:10.1007/s10346-015-0565-6
Van Den Eeckhaut M, Vanwalleghem T, Poesen J, Govers G, Verstraeten G, Vandekerckhove L (2006) Prediction of landslide susceptibility using rare events logistic regression: a case-study in the Flemish Ardennes (Belgium). Geomorphology 76:392–410. doi:10.1016/j.geomorph.2005.12.003
Van Westen CJ, Rengers N, Soeters R (2003) Use of geomorphological information in indirect landslide susceptibility assessment. Nat Hazards 30:399–419
Vergari F, Della Seta M, Del Monte M, Fredi P, Lupia Palmieri E (2011) Landslide susceptibility assessment in the Upper Orcia Valley (Southern Tuscany, Italy) through conditional analysis: a contribution to the unbiased selection of causal factors. Nat Hazards Earth Syst Sci 11:1475–1497
Visser H, de Nijs T (2006) The map comparison kit. Environ Model Softw 21:346–358. doi:10.1016/j.envsoft.2004.11.013
Webb GI (2000) MultiBoosting: a technique for combining boosting and wagging. Mach Learn 40:159–196. doi:10.1023/a:1007659514849
Were K, Tien Bui D, Dick ØB, Singh BR (2015) A comparative assessment of support vector regression, artificial neural networks, and random forests for predicting and mapping soil organic carbon stocks across an Afromontane landscape. Ecol Indic 52:394–403
Witten IH, Frank E, Mark AH (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, Burlington
Yalcin A, Reis S, Aydinoglu AC, Yomralioglu T (2011) A GIS-based comparative study of frequency ratio, analytical hierarchy process, bivariate statistics and logistics regression methods for landslide susceptibility mapping in Trabzon, NE Turkey. Catena 85:274–287. doi:10.1016/j.catena.2011.01.014
Yang Q, Shao J, Scholz M, Plant C (2011) Feature selection methods for characterizing and classifying adaptive Sustainable Flood Retention Basins. Water Res 45:993–1004. doi:10.1016/j.watres.2010.10.006
Yao X, Tham LG, Dai FC (2008) Landslide susceptibility mapping based on support vector machine: a case study on natural slopes of Hong Kong, China. Geomorphology 101:572–582. doi:10.1016/j.geomorph.2008.02.011
Yilmaz I (2009) Landslide susceptibility mapping using frequency ratio, logistic regression, artificial neural networks and their comparison: a case study from Kat landslides (Tokat-Turkey). Comput Geosci 35:1125–1138. doi:10.1016/j.cageo.2008.08.007
Zhang F, Pei X, Chen W, Liu G, Liang S (2014) Spatial variation in geotechnical properties and topographic attributes on the different types of shallow landslides in a loess catchment. China Eur J Environ Civil Eng 18:470–488. doi:10.1080/19648189.2014.881754
Acknowledgments
This research was supported by the Geographic Information System group, University College of Southeast Norway, Bø i Telemak, Norway.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest.
Rights and permissions
About this article
Cite this article
Tien Bui, D., Ho, TC., Pradhan, B. et al. GIS-based modeling of rainfall-induced landslides using data mining-based functional trees classifier with AdaBoost, Bagging, and MultiBoost ensemble frameworks. Environ Earth Sci 75, 1101 (2016). https://doi.org/10.1007/s12665-016-5919-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12665-016-5919-4