Introduction

Landslides are considered as natural phenomena that are classified as a highly intense threat to human life, property, infrastructure, and natural environment observed mostly in mountainous and hilly regions. According to a report conducted during the FP7 SafeLand project (FP7 SafeLand 2012), China is covered by vast areas that are classified as regions with high landslide risk, while the manifestation of landslide phenomena results in an estimated 700 to 1000 deaths every year and damages of infrastructure and properties that exceed $10 billion RMB annually. In order to reduce and mitigate the devastated consequences caused by landslides, the Chinese government has taken specific measures since 1999, such as, nationwide landslide investigation and risk zoning; detailed mapping for high risk zones of landslide hazards; stabilization and mitigation on major landslides; weather-based regional landslide hazard warning; geohazard risk assessment on infrastructure construction; and education and training for geohazard mitigation (FP7 SafeLand 2012). Analyzing the first two procedures, the determination of the spatial and temporal extent of landslide hazard requires to identify areas which are, or could be, affected by a landslide and estimate the probability of such landslide occurrence within a specified period of time. However, to specify the precise time frame for future occurrence of a landslide event, it can be a difficult task. As a result, landslide hazard can be represented by landslide susceptibility, if only the predisposing and preparatory landslide variables are considered. A landslide susceptibility map provides the spatial distribution and rating of the terrain according to its propensity to slide, the manifestation of which depends on the topography, geology, geotechnical properties, climate, vegetation, and anthropogenic factors (Fell et al. 2008). According to Guzzetti et al. (2000), a landslide susceptibility map is valuable when the information and data shown are useful, relevant, and fully understood by the user. In this context, the present study produces a landslide susceptibility map for the area of Nancheng County, China, in order to provide vital information concerning landslide phenomena to local authorities and government agencies for implementing appropriate decision-making and land use planning strategies.

In the last three decades, numerous methods and techniques have been utilized for landslide susceptibility and hazard and risk assessments; those methods could be classified into qualitative and quantitative or direct and indirect (Tien Bui et al. 2015). Qualitative methods are considered as methods that are characterized by their subjective nature, which ascertain susceptibility heuristically and mainly involve direct field geomorphological analysis and also the usage of index or parameter maps (Verstappen 1983; Leroi 1996; Soeters and Van Westen 1996). On the other hand, quantitative methods are based on numerical estimates and involve statistical, probabilistic, and data mining methods (Carrara et al. 1991; Van Westen et al. 1997, Castellanos Abella and van Westen 2007; Chowdhury 1976; Baldelli et al. 1996; Van Westen et al. 1997; Lees 1996; Gomez and Kavzoglu 2005). A great number of scientific research can be found that utilize bivariate statistics that has been adopted by many researchers (Magliulo et al. 2008; Yilmaz et al. 2012), as well as multivariate methods that implement discriminant analysis (Lee et al. 2008) or linear and logistic regression (Dai and Lee 2003; Ayalew and Yamagishi 2005; Tsangaratos and Ilia 2016a), frequency ratio (Lee and Pradhan 2006; Yilmaz 2010; Akinci et al. 2011), certainty factor approach (Lan et al. 2004; Sujatha et al. 2012), and Dempster–Shafer and weight of evidence models (Tangestani 2009; Cervi et al. 2010; Neuhauser et al. 2012; Tien Bui et al. 2012a; Ilia and Tsangaratos 2016; Hong et al. 2016a). In addition, data mining methods have been applied for landslide susceptibility including fuzzy logic (Ercanoglu and Gokceoglu 2004; Pradhan et al. 2010; Akgun et al. 2012), artificial neural network (Pradhan and Lee 2009; Tien Bui et al. 2012b; Tsangaratos and Benardos 2014; Tien Bui et al. 2016), neuro–fuzzy (Vahidnia et al. 2010; Sezer et al. 2011; Pradhan 2013), and decision–tree models (Wan 2009; Yeon et al. 2010; Tien Bui et al. 2012c; Tsangaratos 2012; Pradhan 2013; Tsangaratos and Ilia 2016b). Both methods have been applied worldwide, and their performance is based on the availability and quality of data and the scale of analysis.

During a landslide susceptibility assessment, three important aspects should be successfully addressed in order to enhance the predictive power of landslide susceptibility models (Chacón et al. 2006; Irigaray et al. 2007; Costanzo et al. 2012; Tien Bui et al. 2015; Murillo-García et al. 2015): the preparation of a landslide inventory map, the identification of the variables that significant influence stability in ground surface, and the appropriate reclassification of the variables. The preparation of a landslide inventory map is based on a conceptual frame in which the past and present provide evidence for the future, failures do not occur randomly, failures share common geotechnical characteristics, and similar conditions produce similar patterns of failures (Tsangaratos and Koumantakis 2013). The second essential aspect is the identification of the influence of each variable contributes to the overall susceptibility that is expressed with a weighted coefficient that can be estimated through specific procedures according to different models. Finally, the determination of the classes for each variable is equally essential with the estimation of the coefficients, a procedure that could affect the quality of the outcomes of landslide susceptibility analysis (Chacon et al. 2006; Costanzo et al. 2012).

In this context, the present study attempts to address the above mentioned aspects by following a novel methodology. Specifically, landslide and non-landslide areas were verified by the usage of remote sensing techniques, Google Earth® and the analysis of high resolution digital elevation models, while the significance of each landslide-related variable was estimated by applying statistical and data mining methods that also produced a series of landslide susceptibility maps. However, the main novelty of the study is the determination of the number of classes for each landslide-related variable by estimating the information coefficient that is derived by the Shannon’s entropy index. The developed methodology was tested in the Nancheng County, China, by applying three different methods: the Logistic Regression (LR), the Weight of Evidence method (WofE) as representatives of bivariate statistical methods, and Random Forest (RF) as a representative of data mining techniques. The usage of these three methods is considered to be appropriate since they are suitable for regional and semi-regional scale analysis and also they exploit both remote-sensing datasets and field surveys. The computational process was carried out using R Studio, SPSS 16.0 (SPSS 2007), while ArcGIS 10.1 (ESRI 2013) was used for compiling the data and producing the landslide susceptibility maps.

Materials and methods

Study area

The Nancheng County is located in the Eastern of the JiangXi Province and is under the jurisdiction of the prefecture-level city of Fuzhou. The study area lies between longitudes 440,000 and 490,000 and latitudes 3,020,000 and 3,070,000 (Beijing 1954/3-Degree CM 117E as the reference coordinate system) covering an area of about 1698.3 km2, with altitude ranging between 50 to 1180 m above sea level (Fig. 1).

Fig. 1
figure 1

The study area (Beijing 1954/3-Degree CM 117E as the reference coordinate system, suitable for use in China between 115o 30 E and 118o 30 E)

Around 61.57 % of the study area has a slope gradient less than 15° whereas areas with a slope gradient larger than 45° account for only 0.39 %. About 25.38 % of the area is characterized by slope gradient between 15° and 25°, while 10.01 % is characterized by slope gradient between 25° and 35°. Dominant features in the area are the Hongmen reservoir and the Fuhe, Xuijian, and Latin river that flow across the research area. The waters of the Fuhe River reach the Poyang Lake that is located north of the Nanchang prefecture of Jiangxi.

The climate of Nancheng County is classified as humid subtropical (KöppenCfa), with long, humid, very hot summers, and cool and drier winters with occasional cold snaps. According to the Jiangxi Province Meteorological Bureau (http://www.weather.org.cn), the mean annual rainfall for the period 1953–2015 ranged between 900.3 and 2866.4 mm. The average annual temperature is 17.8 °C, while the average annual water surface evaporation for the area is estimated to be 1546.7 mm. The rainy season is from April to July accounting for the 55.2 % of the yearly rainfall. In May and June, the average rainfall varies between 270 and 305 mm per month.

Concerning the geological settings, more than 22 geologic groups and units are recognized, data was obtained by the China Geology Survey (http://www.cgs.gov.cn). In the present study, the lithology map (scale 1:200,000) was reconstructed by classifying the geological formations into nine classes, based on clay composition, degree of weathering, and physical and strength parameters (Table 1, Fig. 2). The main lithological unit that covers approximately 37 % of the area is granite porphyry of Cretaceous age, tuff, ignimbrite, and sandstone gravel (class E) followed by leptynite, schist, and marbles (class F) that covered 24 % of the area and gray brown granulite, mica schist, and quartz schist (class G) that covered 17 % of the area. The soil profiles of the area are mainly developed due to the action of weathering.

Table 1 Types of geological formation of the study area
Fig. 2
figure 2

The lithology map of the study

The developed methodology

The methodology followed during the present study could be separated into a four-phase procedure: (a) constructing the inventory map and selecting the appropriate landslide-related variables; (b) the data pre-processing phase; (c) the phase of implementing the various techniques and methods in order to construct the landslide susceptibility map; and (d) the validation and comparison of the models. Figure 3 illustrates the flowchart of the followed methodology, while a brief description of each phase is presented in the paragraphs below.

Fig. 3
figure 3

Flowchart of the developed methodology

Constructing inventory map and selecting the landslide related variables

The first phase of the followed methodology was to construct the landslide and non-landslide inventory database. The database included information about the location, type of failure, and other features of landslide incidence and also the locations of non-landslide areas in order to use them during the training and predictive phase.

Specifically, the landslide inventory database which included 112 landslide locations was provided by the Jiangxi Department of Land and Resources (http://www.jxgtt.gov.cn) and the Jiangxi Meteorological Bureau (http://www.weather.org.cn). The database involved 70 rotational slides and 42 translational slides. The size of the smallest landslide is approximately 15 m2, the largest around 18,000 m2, and the average is estimated to be 878.7 m2. The large-sized landslides (>1000 m2) that occurred in the study area affected 283 people, accounting for only 11.4 % of the total number of landslides. Around 33.1 % of the total landslides are medium-sized (200–1000 m2), and 213 people are affected by these landslides. Small-sized landslides (<200 m2) that affected 374 people are accounted for 55.4 % of the total landslides and are inventoried on metamorphic rocks (schist, granulite, and marbles), having mass thickness between 3 and 8 m. According to the report in the Nancheng area, landslides occurred during and after incidence of heavy rainfall. Moreover, around 42.7 % of the landslides that were reported occurred when the measured rainfalls were around 95 mm per day whereas the other landslides occurred when the daily rainfall was larger than 110 mm.

The non-landslide areas were identified with the usage of Google Earth® and the analysis of high resolution digital elevation models. Google Earth® provides worldwide coverage of high resolution and very high resolution optical satellite images. Its main advance is that it can present the images into three dimension (3D), providing in that way an excellent tool for exploiting the satellite images and detecting the non-landslide areas. The areas that are potentially classified as non-landslide areas are characterized by gentle and without any changes morphometric characteristic. The height difference, the steepness, and the orientation of slopes and also the absence of concavities and convexities are the main criteria for identifying the non-landslide areas.

Eleven (11) landslide-related variables were selected concerning the experience gained from studying landslide phenomena in the wider area, the local geo-environmental conditions, and the availability of sufficient data, namely, lithology, slope, aspect, altitude, topographic wetness index (TWI), sediment transport index (STI), plan curvature, profile curvature, distance to river network, distance to tectonic features, and distance to road network.

In order to extract the necessary layers that correspond to the morphometric landslide-related variables of slope, aspect, altitude, TWI, STI, plan curvature, and profile curvature, a digital elevation model (DEM) of grid size 25 m was used, generated from the Advanced Space borne Thermal Emission and Reflection Radiometer Global Digital Elevation Model (ASTER GDEM) Version 2 (http://gdem.ersdac.jspacesystems.or.jp). The ASTER GDEM Version 2, which was available for the public in 2011, is considered as the highest resolution DEM among the free accessible global DEMs having a spatial resolution of 30 m (Arefi and Reinartz 2011). The spatial product is a joint outcome developed by the Ministry of Economy, Trade, and Industry of Japan and the United States National Aeronautics and Space Administration that covers the entire land surface of the Earth. The road and river networks were digitized from 1:50,000 scale topographic maps.

Constructing the training and validation datasets

As proposed by the methodology training and validating, data sets were randomly produced from the total number of landslide and non-landslide areas. Specifically, by utilizing the subroutine subset wizard that is embodied in the Geostatistic toolbox (ESRI 2013), the first data set contained a number of data that equaled to approximately 70 % of the total number of landslide and non-landslide, while the rest 30 % served as validating data.

Information coefficient—number of classes

The next phase, the phase of pre-processing, involves the estimation of the exact number of classes that maximize the information coefficient of each variable based on Shannon’s entropy index. The Shannon’s entropy index has been used in the Information Theory as a measure originally proposed by Claude Shannon to quantify the entropy, disorder, uncertainty, or information content in strings of text (Shannon 1948). The entropy model introduced by Shannon has been used in several landslide assessments in order to estimate weighting coefficients of landslide-related variables (Yang and Qiao 2009; Yang et al. 2010; Pourghasemi et al. 2012). The model involves the calculation of the density of landslides, as done in bivariate analysis, within each class of each variable. The value of each variable is expressed as an entropy index, which indicates the extent of disorder in the environment. According to Bednarik et al. (2010), the entropy index expresses which variables are the most influential for the evolution of instability. The main difference between the present study and the approaches described in the aforementioned publications is that it estimates the information coefficient each time for a different number of classes and selects the one that maximizes the information coefficient of the variable in question. The information coefficient ranges between 0 and 1, with values closer to 0 indicating less information and values closer to 1 indicating more information. The equations used to calculate the information coefficient are given bellow (Bednarik et al. 2010; Constantin et al. 2011):

$$ {p}_{ij}=\frac{L_{ij}}{A_{ij}} $$
(1)
$$ {P}_{ij}=\frac{P_{ij}}{{\displaystyle \sum_{j=1}^{c_j}{p}_{ij}}} $$
(2)
$$ {H}_j=-{\displaystyle \sum_{i=1}^{c_j}{P}_{ij}\otimes { \log}_2{P}_{ij}} $$
(3)
$$ {H}_{j \max }=-{ \log}_2{c}_j $$
(4)
$$ {I}_j=\frac{H_{j \max }-{H}_j}{H_{j \max }} $$
(5)

where A ij is the area percentage of the ith class of the jth variable, L ij is the landslide percentage of the ith class of the jth variable, Pij is the probability density of the ith class of the jth variable, Hj is the entropy value of the jth variable, Hjmax is the entropy of the jth having c classes, and Ij is the information coefficient of the jth variable.

The reclassification process was performed by the Reclass subroutine using either Geometrical Intervals or Natural Breaks method (ERSI 2013). The choice of which to use is based on the type of distribution the data have. Specific, Geometrical Intervals was used for visualizing continuous data and providing an alternative to the Natural Breaks classification method. The specific benefit of the Geometrical Intervals method is that it works reasonably well on data that are not distributed normally, particular on data that are heavily skewed. On the other hand, Natural Break method is applied on normally distributed data.

Conditional independence and multicollinearity analysis

In order to implement WofE method, the conditional independency assumption among the landslide-related variables must be valid and the data population of each variable must have a normal distribution. According to Bonham-Carter (1994), this rough assumption may lead to errors and, in order to solve this problem, non-parametric statistics can be used since they are not based on the assumption of normal distribution. To calculate independency when applying non-parametric statistics, the χ 2 (chi-square) method can be used (Ilia and Tsangaratos 2016).

The next step is to implement the multicollinearity analysis in order to estimate the correlation among the predictor features (Dormann et al. 2013; Tien Bui et al. 2015). For this purpose, the proposed methodology uses the variance inflation factor (VIF) and tolerance (TOL) two important indexes for multicollinearity analysis (Marquardt 1970; Weisberg and Fox 2010). Although no rules exist for interpretation of VIF, the most common rule of thumb is using 10 as a threshold for severe multicollinearity, while several authors apply a very strict threshold of 2 or 5, above which variables are considered multicollinear and are excluded from the model (O’brien 2007; Van Den Eeckhaut et al. 2006, 2010; Guns and Vanacker 2012; Tsangaratos and Ilia 2016a), while a value of TOL smaller than 0.1 indicates serious multicollinearity between independent variables (Menard 2002).

Implementing logistic regression

Logistic Regression is among those statistical methods that have been proved to be highly reliable when performing a landslide susceptibility assessment (Dai et al. 2002; Ayalew and Yamagishi 2005; Yesilnacar and Topal 2005; Gorsevski et al. 2006; Yilmaz 2010; Akgun et al. 2012). The independent variables in this model are considered as predictors of the dependent variable and can be measured on a nominal, ordinal, interval, or ratio scale, while the dependent variable is in a binary format. The relationship between the dependent variable and independent variables is nonlinear (Yesilnacar and Topal 2005).

LR is thought as a special case of a generalized linear model; however, it is based on quite different assumptions concerning the relationship between the dependent and independent variables from those followed by linear regression models. The conditional distribution is a Bernoulli distribution rather than a Gaussian distribution, since the dependent variable has the form of a binary variable (presence or absence of landslides).

In logistic regression analysis, the relationship between the occurrence and its dependency on several variables can be expressed by the following equation (Eq. (6)):

$$ p=\frac{1}{1+{\mathrm{e}}^{-\mathrm{z}}} $$
(6)

where p is the probability of a landslide occurrence. The probability can take values from 0 to 1 on an S-shaped curve and z is the linear combination of a set of landslide-related variables. Logistic regression involves fitting an equation of the following form to the data (Eq. (7)):

$$ z\left({x}_i\right)={b}_{\mathrm{o}}+{b}_1{x}_1+{b}_2{x}_2+\dots +{b}_n{x}_n $$
(7)

where b 0 is the intercept of the model, the b i (i = 0, 1, 2, ..., n) is the slope coefficients of the logistic regression model, and x i (i = 0, 1, 2,. .., n) are the independent variables. The linear model formed is then a logistic regression of presence or absence of landslides (present conditions) on the independent variables (pre-failure conditions).

Implementing weight of evidence

WofE is a data-driven approach that is based on the Bayes theorem and on the concepts of prior and posterior probability (Bonham-Carter 1994). There are numerous studies of landslide susceptibility analysis that utilize the WofE method (Lee et al. 2002; Lee et al. 2004; van Westen et al. 2003; Mathew et al. 2007; Bettian and Birgit 2007; Neuhäuser and Terhorst 2007; Poli and Sterlacchini 2007; Dahal et al. 2008; Sharma and Kumar 2008; Barbieri and Cambuli 2009; Ghosh et al. 2009; Tangestani 2009; Cervi et al. 2010; Ilia et al. 2010, 2013; Park 2010; Regmi et al. 2010; Armas 2012; Kayastha et al. 2012; Tien Bui et al. 2012a; Thiery et al. 2014; Kouli et al. 2014; Ilia and Tsangaratos 2016), in which the main objective is to estimate if a given set of independent variables could predict the presence of landslide incidence that is considered as the dependent variable. The method investigates the spatial relationship between the distribution of the areas affected by landslides and the distribution of the landslide-related variables (Ilia et al. 2010; Neuhauser et al. 2012; Ilia and Tsangaratos 2016). A measure of the spatial association between landslide locations and landslide-related variables is provided through the magnitude of contrast (C), which is determined by the difference of positive (W+) and negative (W−) weights. W+ and W− provide information about whether there is a positive or a negative spatial correlation between the landslide-related variables and the landslide locations. When C is positive, it implies positive correlation, and when it is negative, it implies negative spatial association (Bonham-Carter et al. 1989; Agterberg et al. 1990). The studentized value of C is calculated as the ratio of C to its standard deviation stdC, (C/stdC), and serves as a guide to the significance of the spatial association, acting as a measure of the relative certainty of the posterior probability (Bonham-Carter 1994).

Implementing random Forest

Random Forest (RF) is an ensemble learning method, which is based on the generation of several classification trees, which are aggregated to estimate a classification (Breiman et al. 1984; Breiman 2001). The algorithm exploits random binary trees which use a subset of observations through bootstrapping techniques: from the original data set, a random selection of training data is sampled and used to build the model, the data not included are referred to as out-of-bag (OOB) (Breiman 2001). According to Hansen and Salamon (1990), an ensemble method, such as RF, is more accurate than individual members if only data appear random and are diverse. In the case of RF, diversity is achieved by resampling the data with replacement and randomly changing the predictive factor over the different tree induction processes (Youssef et al. 2015).

One of the main advantages of RF is the ability to avoid over-fitting and growing a large number of random forest trees where it does not create a risk of over-fitting (e.g., each tree is a completely independent random experiment). The RF algorithm data does not need to be rescaled, transformed, or modified. It has resistance to outliers in predictors and automatically handles the missing values (Breiman and Cutler 2004).

Models validation and comparison

For the estimation of the performance of the three methods, two statistical evaluation criteria were utilized by using the training and validation data; the first one is the overall accuracy on the training data, which is an indication of the successful power of the model. The second one is the overall accuracy on the validation data, which is an indication of the predictive power of the model. Both criteria are calculated as the ration of the true positives plus the true negatives to the total number of data. The validation processes were achieved by using the receiver operating characteristic (ROC) curve analysis (Fawcett 2006). Using the landslide grid cells in the training dataset, the success-rate results were obtained, while the validation dataset was used for the construction of the prediction-rate curves (Chung and Fabbri 2003). The area under the ROC curve (AUC) has been used as a metric to access the overall quality of the predictive models by evaluating the models ability to anticipate correctly the occurrence or non-occurrence of predefined events (Hanley and McNeil 1982; Negnevitsky 2002; Fawcett 2006). If AUC is close to 1, the outcomes of the analysis are excellent, while if the AUC is closer to 0.5, the less accurate the result of the analysis is.

In addition, the landslide density ratio was calculated as a measure of sufficiency (Can et al. 2005; Pradhan and Lee 2010). A model is more sufficient and accurate when there is an increase in the landslide density ratio when moving from low susceptible classes to high susceptible classes and when the high susceptibility class covers small extent areas.

Results and discussion

Determining the class numbers of the landslide-related variables

Following the procedure described in the methodology, the landslide and non-landslide inventory database was constructed with the usage of Google Earth® and the analysis of the high resolution DEM. In order to capture representative information concerning the landslide-related variables, about the 112 landslide locations, additional points were introduced when necessary, especially when the landslide area had a large surface coverage, creating a total of 286 points. Equal number of 286 non-landslide points were identified, while by applying the subroutine subset wizard, approximately 70 % of the total number of landslide and non-landslide were used as training data and the rest 30 % served as validating data. Figure 4 illustrates the spatial distribution of the landslide and non-landslide areas.

Fig. 4
figure 4

The spatial distribution of non-landslide and landslide points

The next action was to estimate the number of classes of each landslide-related variable that maximize the information coefficient based on Shannon’s entropy index. The analysis was performed for two (2) to six (6) classes, for the each variable, except of the variable lithology that is a categorical variable. Table 2 provides the Information coefficient values for each class.

Table 2 Information coefficients

For the first variable, altitude, the information coefficient has the highest value, 0.3543, when classified by the Geometrical Interval classification method into two (2) classes. Similar, slope maximizes the information coefficient when classified into three (3) classes having a value of 0.1432. Aspect, which has been classified with the Natural Break method, also presents the highest information coefficient value (0.1123) when classified into three (3) classes, in comparison with those estimated when classified into a different number of classes. TWI and STI were classified by implementing the Geometrical Interval classification method and maximize the information coefficient when classified into two (2) classes, having values 0.0441 and 0.0466, respectively. Plan curvature, distance from river network, and distance from road network maximize its information coefficient when classified into four (4) classes, with values 0.0115, 0.0627, and 0.0902, while profile curvature when classified into six (6) classes, with information coefficient value, 0.0473. Finally, distance to tectonic features has the highest value of information coefficient when classified into three (3) classes. Plan curvature and profile curvature were classified by using the Natural Break method, while distance from river network, distance from road network, and distance to tectonic features were classified by using the Geometrical Interval method. NC stands for non-calculable, meaning that for the certain classification, a class of the variable does not contain an incidence. Comparing the information coefficients among the variables, the most informative appears to be altitude, followed by slope and aspect, while the least informative appears to be the plan curvature. Figure 5a–j shows the spatial pattern of the classes that maximize the information coefficient for each of the landslide-related variables used in the analysis.

Fig. 5
figure 5figure 5figure 5

The landslide-related variables. a Altitude, b slope angle, c aspect, d TWI, e STI, f plan curvature, g profile curvature, h distance to river network, i distance to tectonic features, j distance to road network

Multi-collinearity analysis

The VIF’s and tolerance values (TOL) were estimated by performing multicollinearity analysis (Table 3). According to the results, there was no serious multicollinearity between the independent variables. The smallest TOL was the one calculated for the plan curvature variable (0.405) which however is higher than 0.100 the theoretical critical value for evidence of collinearity (Menard 2002). Also, the VIF’s values for all the variables are less than 5, a similar theoretical threshold of multicollinearity (O’brien 2007; Van Den Eeckhaut et al. 2006, 2010; Guns and Vanacker 2012; Tsangaratos and Ilia 2016a).

Table 3 Multi collinearity analysis

Applying logistic regression method

The training dataset was evaluated using a chi-square of Hosmer-Lemeshow test, Cox and Snell R 2, and Nagelkerke R 2, while accuracy percentages of classification for all training sets were also calculated. Hosmer-Lemeshow test showed that the goodness of fit of the equation can be accepted since the significance of the chi-square is larger than 0.05 (Table 4).

Table 4 The overall statistics of the logistic regression

The overall precession and recall index of the classification is 82.5 %, which is quite acceptable. The logit of f(x) function is calculated for all of the grids of the Nancheng County, in which zero (0) corresponds to no susceptibility and one (1) to total susceptibility. Based on constant values that were calculated, the logistic regression is compiled according to Eq. (8) as follows:

$$ \mathrm{z}=-5.0789+\left[0.3079*\left(\mathrm{Lithology}\right)\right]+\left[2.0617*\left(\mathrm{Altitude}\right)\right]+\left[0.8983*\left(\mathrm{Slope}\right)\right]+\left[0.0627*\left(\mathrm{Aspect}\right)\right]+\left[-2.0718*(TWI)\right]+\left[2.2334*(STI)\right]+\left[-0.084*\left(\mathrm{Profile}\ \mathrm{Curvature}\right)\right]+\left[-0.2199*\left(\mathrm{Plan}\ \mathrm{Curvature}\right)\right]+\left[0.4515*\left(\mathrm{Distance}\ to\ \mathrm{River}\ \mathrm{Network}\right)\right]+\left[0.2376*\left(\mathrm{Distance}\ to\ \mathrm{Tectonic}\ \mathrm{Features}\right)\right]+\left[-0.4703*\left(\mathrm{Distance}\ to\ \mathrm{Road}\ \mathrm{Network}\right)\right] $$
(8)

In order to predict the possibility of landslide occurrence in each grid, probability was calculated from Eq. (6) and the landslide susceptibility map was produced (Fig. 6).

Fig. 6
figure 6

Landslide susceptibility map produced by the LR method

The conditional variables lithology, altitude, slope, aspect, STI, distance to river network, and distance to tectonic features affect the LR function positively, while the highest b coefficient according to Eq. (8) is allocated to STI and altitude, which are 2.2334 and 2.0617, respectively. TWI, profile and plan curvature, and distance to road network have a negative effect on the landslide occurrence as they have negative b coefficients. From the visual analysis of the landslide susceptibility map, high and very high susceptible zones are located at the west and east mountainous areas, while the central area is characterized by very low to low susceptibility values. It is clear that the spatial pattern of the landslide susceptibility follows the distribution of the elevation and slope observed in the study area, since lowlands are characterized by very low to low susceptibility values. One can also observe a strong association between the lithological coverage and the landslide susceptibility values.

Applying weight of evidence method

The estimation of the conditional independency among the landslide-related variables was performed by the chi-squared statistic test. Table 5 illustrates the results of the chi-squared test on the observed distribution and expected distribution of the landslide occurrence based on posterior probabilities calculated using the 11 variables. The theoretical χ 2 values are presented in brackets. From the total of 55 pairwise comparisons, 14 conditional dependencies have been identified at a 0.01 significance level and varying degrees of freedom. Specifically, TWI showed six (6) conditional dependences, while distance to road network showed four (4) conditional dependences. Despite the observed conditional dependence among some of the variables, it was decided to proceed in the analysis in order to compare the three models under the same settings.

Table 5 Test of conditional independence

The next phase was to calculate the weights of the landslide-related variables according to the methodology of the WofE method. Table 6 provides the C values that are used to construct the landslide susceptibility map through an aggregated weighted method and also the stdC and C/stdC values. Ranking the positive spatial correlation between the classes of the landslide-related variables and the landslide locations, areas that have elevation greater than 131 m exhibit the highest C value (1.6230), followed by areas that have TWI values less than 5.85 (1.6179). Concerning the lithological formation of the research area, the Wan Yuan group that consists of granulite and mica-quarts schist is found to have the highest C value (1.4038), while areas that have an orientation between 109° and 228° have moderate C values (1.0003).

Table 6 The weights of the landslide-related variables based on the WofE method

Figure 7 shows the landslide susceptibility map constructed by WofE method. A similar pattern of landslide susceptibility values to the LR method was observed, with high and very high susceptible areas to be located at the west and east mountainous areas, while the central area is characterized by moderate to low susceptibility values.

Fig. 7
figure 7

The landslide susceptibility map produced by the WofE method

Applying random Forest method

To implement successively the RF method, there is a need to estimate the minimum number of trees required to minimize the Out-Of-Bag error and also the need to estimate the number of variables randomly sampled as candidates at each split. As illustrated in Fig. 8a, the Out-Of-Bag error (black line) is less fluctuated when the number of trees exceeds 800, while Fig. 8b gives the results of the tuning process concerning the number of variables used in each split. It was decided to train the RF model using two (2) random variables at each split and 1000 trees.

Fig. 8
figure 8

a Error OOB vs number of trees, b number of variables used in each split (mtry)

After the training phase ended, some extra information about the influence of each variable has on the overall landslide susceptibility analysis followed by the RF method was gained. Specifically, Fig. 9 illustrates the 11 variables ordered by the mean decrease accuracy and the mean decrease Gini. The mean decrease in Gini coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest model, while the mean decrease in accuracy a variable causes is determined during the Out-Of-Bag error calculation phase. The more the accuracy of the random forest decreases due to the exclusion of a variable, the more important that variable is assumed, thus variables with a large mean decrease in accuracy are more important. According to those two metrics, the most important variable is altitude followed by TWI and lithology.

Fig. 9
figure 9

Mean decrease accuracy and mean decrease Gini

In Table 7, the variables that are more often used during the training phase are reported. The most used variables were lithology (13.46 %), plan curvature (13.22 %), and distance to river network (12.93 %) followed by distance to road network (12.56 %), aspect (11.27 %), and distance to tectonic features (10.25 %). Profile curvature (9.77 %), TWI (5.61 %), slope (5.02), and altitude (4.91 %) are the least used, while STI only participates in the model 0.89 % of the total number of times each variable was used.

Table 7 Number of times variables were used when applying RF

Figure 10 illustrates the landslide susceptibility map constructed according to the RF method. From the visual analysis of the landslide susceptibility map, it seems that it follows the pattern of altitude, lithology, and the distance to river network. High and very high susceptible zones are located along the road network mainly at the west and east mountainous areas, while the central area is characterized by very low to low susceptibility values.

Fig. 10
figure 10

The landslide susceptibility map produced by the RF method

Insights about the influence the landslide-related variables have in predicting the stability condition of the research area have been obtained by the implementation of the three models. Specifically, RF and WofE considered altitude, lithology, and TWI as the most important variables. LR identifies altitude and lithology as affecting the LR function positively, while TWI affects the LR function negatively. Concerning the altitude of a surface, it could be considered as a variable that indirectly contributes to the slope failure manifestation (Dai et al. 2002). The elevation of a surface is considered to be formed by the combined action of tectonic activity, weathering, and erosion processes and is also related with the action of the climatic conditions through a complex interactive influence. The analysis performed in our study showed that areas with elevations greater than 131 m experience considerable higher chance of landslide occurrence. Regarding lithology, the high percentage of small-sized landslides was observed in areas covered by metamorphic rocks (schists, granulite, and marbles) having mass thickness between 3 and 8 m. However, it should be mentioned that the scale of the available lithological map makes it difficult to distinguish in more detail the overlying lithology. Quaternary deposits that may be present are not mapped and thus the types of landslides observed are not associated with those types of geological formations. This issue should be addressed as key research point in the close future. Finally, concerning the TWI, that is an index to describe the effect of topography on the location and size of saturated source areas of runoff generation (Moore et al. 1991), the analysis revealed that areas less saturated exhibit higher landslide susceptibility.

Validation and comparison

The next phase of the followed methodology was to estimate the relative distribution of the landslide susceptibility zones and the landslide density for each of the three methods. All models showed an increasing landslide density ratio when moving from low susceptible classes to high susceptible classes (Fig. 11). However, the WofE method showed the highest density (0.7740), followed by the LR method (0.6739) and the RF method (0.4284) in the very high susceptible zone. The percentage of landslides found in the very high susceptible zone for WofE, LR, and RF was estimated to be 77.97, 67.13, and 41.25 %, respectively, while the percentage of the area classified as very high susceptibility according to WofE, LR, and RF was estimated to be 28.64, 19.80, and 18.46 % of the total research area, respectively.

Fig. 11
figure 11

Bar graphs showing the relative distribution of landslide susceptibility zones and landslide density

According to the methodology, the validation of the three methods was estimated by calculating the successive and predictive power on the bases of the training and validation dataset. Figure 12 illustrates the area under the ROC curve (AUC) that expresses the models ability to anticipate correctly the occurrence or non-occurrence of landslides for the three models. The highest train AUC value was obtained by the RF method (0.9350) followed by the WofE (0.9255) and the LR method (0.9097). The highest predictive ROC curve with AUC values equal to 0.9220 was again achieved by the RF method followed by the WofE (0.9090) and LR method (0.8940).

Fig. 12
figure 12

Successive and prediction rate curve for the LR (a), RF (b), and WofE (c) methods

Our findings are consistent with the results from similar comparative studies. More specifically, as reported in a landslide susceptibility analysis presented by Esposito et al. (2014) which compared the outcomes of a RF model with a LR model in Rio de Janeiro, Brazil, the RF model showed higher accuracy than the LR model, with AUC values estimated to be 0.81 and 0.72, respectively. Similar results of higher accuracy were also reported in a comparative landslide susceptibility study, indicating the RF model as the most accurate against a LR model and a frequency ratio model (Trigila et al. 2015). Also, Goetz et al. (2015) reported that RF model had a slightly better performance in a landslide susceptibility assessment contacted in Austria, when compared with LR, WofE, and other advanced data mining techniques. In contrast to the above studies, findings of a landslide susceptibility analysis held in Lianhua County, China, reported however the poor performance of RF model when compared with a data driven evidential belief function, a frequency ratio and a LR model (Hong et al. 2016b).

In any case, understanding the abilities and limitation of each method remains critical for selecting the most accurate model (Goetz et al. 2015). The b coefficients of the LR function are able to provide an estimate of the importance of each variable plays in explaining the presence of landslide; however, they do not provide information about the relative priorities or importance among the predictive variables. In WofE method, the conditional independency assumption among the landslide-related variables must be valid, while the data population of each variable must have a normal distribution. On the other hand, RF has several advantages; it does not require assumptions on the distribution of explanatory variables, it allows for the use of either categorical or numerical variables, it accounts interactions and nonlinearities among variables, and its ability to provide information about the influence of each variable on the overall result (Catani et al. 2013; Pourghasemi and Kerle 2015).

Conclusions

The present study presents a novel methodology in which Shannon’s entropy model was used for classifying landslide-related variables in order to produce landslide susceptibility maps. Specifically, Shannon’s entropy model was utilized for determining the appropriate classes that maximize the information coefficient for each variable. The developed methodology was implemented in the Nancheng County, China, using three quantitative methods, logistic regression, weight of evidence, and random forest, and was based on the analysis of eleven (11) conditional variables, namely, lithology, altitude, slope, aspect, topographic wetness index, sediment transport index, plan curvature, profile curvature, distance to river network, distance to tectonic features, and distance to road network.

According to the results of the research, each model had satisfactory performance, though the RF model had a slightly higher performance in terms of AUC predictive values (0.9220) against the ones estimated for the WofE (0.9090) and the LR model (0.8940). The same pattern was observed when the success power of the models was calculated. Specifically, RF outperforms LR and WofE, having a higher performance in terms of AUC successive values (0.9350) in comparison with the ones calculated for WofE (0.9255) and LR (0.9097). From the visual inspection of the produced landslide susceptibility maps, the most susceptible areas are located at the west and east mountainous areas, while the central area is characterized by moderate to low susceptibility values.