Keywords

1 Introduction

Landslides occur in a variety of environments, characterized by either steep or gentle slope gradients, from mountain ranges to coastal cliffs or even underwater. Gravity is the primary driving force for a landslide to occur, but there are other factors affecting slope stability that produce specific conditions, which make a slope prone to failure. In many cases, the landslide is triggered by a specific event, such as a heavy rainfall, an earthquake, a slope cut to build a road and many others. Earthquake-induced landslides are one of the most catastrophic effects of earthquakes, as evidenced by many historic events over the past decades, especially in countries with high seismicity [1]. As a few examples, the 1994 Northridge earthquake triggered more than 11,000 landslides. In the 2008 Wenchuan earthquake in China, the Tangjiashan landslide with over 20.37 million m3 mass movement, blocked the main river channel and formed a landslide dam, putting millions of people downstream at risk. According to Jibson et al. [18], there are cases that the consequences of landslides, triggered by an earthquake, have a massive impact in human lives and facilities. The correlation of the pattern of COLA with geological and topographical variables i.e. lithology, slope angle and slope aspect with the volume of landslides has been investigated by several researchers. The density of the mapped landslide concentration is normally associated with the seismic shaking magnitude. In particular, it was shown that landslide frequencies are higher in areas of highest Peak Ground Acceleration (PGA) and that landslide density decays with the epicentral or fault distance [2, 20, 25].

The pinpoint of the areas that are most vulnerable in Coseismic Landslides is vital in order to take actions in time and reduce the risk in those areas. Realistic prediction of COLA is crucial for the design of key infrastructure and to protect human lives in seismically active regions. Among many existing methods for landslide assessment, the Newmark sliding mass model has been extensively utilized to estimate earthquake-induced displacements in slopes, earth dams and landfills since the 1960s. As technology develops, new methods and techniques were proposed for assessing the degree of danger within an area. Such, instruments are satellite imagery and Geographic Information System technology (GIS), especially using statistical analyses of geo-environmental and seismologic factors into GIS software [29]. In particular, the characteristics of the land sliding area is statistically related to control factors such as topographic, geologic and seismic parameters e.g. slope angle, slope aspect, curvature, lithology, PGA, seismic intensity distribution and distance from the seismic fault or epicenter [27]. These correlations can provide crucial information that can be used for seismic landslide hazard analysis and planning mitigation measures for prone earthquake-induced landslides regions [6, 7, 18]. Coupling effect between topography and soil amplification, leads to complex wave propagation patterns due to scattering and diffracting of waves within the low velocity near surface layers. These ground motion effects have significant impact on COLA assessment, but only limited efforts have considered them in empirical models. There is a clear need to develop innovative numerical schemes to address the above challenges.

In existing literature, there is a lack of models that can predict the severity of Coseismic Landslides using only the slope angle, slope aspect and the geological form of a specific area. This work represents an extended version of the previous research of the authors [31]. It is very common, to use more features for the forecasting, which derive from the use of expensive equipment. Thus, it is essential for the deployment of a model, that it is not based on large financial funds and be equally effective. Being able to predict the severity of an upcoming landslide after an earthquake, it could be extremely beneficial for the effective treatment of disastrous consequences. The development of such a model, could assist risk management organizations, public agencies and stakeholders, or even governments, to apply a better distribution of the staff and financial resources to each area, for the confrontation of potential corollaries, or even develop appropriate mitigation plans, increasing the resilience of the community.

The statistical analysis of geo-environmental and seismologic factors is performed by bivariate and multivariate approaches. The purpose of this study is the recommendation of a hybrid algorithm, which could find the association of three main factors of COLA, (slope, aspect and geological form), with their severity. The proposed hybrid model uses Fuzzy c-Means clustering [3, 12], Ensemble Adaptive Boosting (ENAB) and Ensemble Subspace k-Nearest Neighbor (ES_k-NN) classifiers [10, 15]. The existing literature, like [23, 32] does not exploit the combination of the above algorithms. Current methods identify COLA after an earthquake, consider optical imagery. They are too slow to effectively inform emergency response activities. There is a need for a fast and flexible model to consider more affordable factors. All current approaches are using crisp values for the determination of involved features. This could lead to misclassification of COLA for values close to the borderline.

2 Area of Research

Lefkada, is a Greek island in the Ionian Sea on the west coast of Greece, connected to the mainland by a long causeway and floating bridge. Lefkada measures 35 km from north to south, and 15 km from east to west. The area of the island is about 302 km2, the area of the municipality (including the islands Kalamos, Kastos and several smaller islets) is 333.58 km2. The basic geological forms for of the island are: a) A carbonate sequence of the Ionian zone. b) Limestone of Paxos (Apulia) zone restricted in the SW peninsula of the island. c) Few outcrops of ionian flysch (turbidites) and Miocene marls-sandstones mainly in the northern part of the island [9]. The geological Zones of “Ioanian” and “Paxos” are separated by a bountary which is located in the NW-SE direction of the region and projects onshore Southcentral Lefkada, near “Hortata” Village, in the form of a burried thrust fault by scree and late Quaternary deposits [28]. Pleistocene and especially Holocene coastal deposits are sprawling in the Nothern edge of Lefkada, where its capital is located, in the valley of “Vassiliki” and in the Coast “Nydri”. Due to its location in the Ioanian sea and to its complex crustal deformation resulting from the subduction of the African Plate towards NE and to the Apulian platform continental collision further to the Northwest, Lefkada is one of the most tectonically active areas in the European continent [14, 16]. The principal active tectonic structure is 140 km long dextral strike-slip Cephalonia-Lefkada transform fault CTF [24]. It has a GPS slip-rate bracketed between 10 and 25 mm/yr. Most of the slope failure cases have been reported on the western part of the island, which owes its deep morphology to this offshore CTF and its onshore sub-parallel fault; the “Athani-Dragano” fault [9]. The latter is a NNE-SSW striking fault, forming a narrow elongated continental basin, precicely depicted in the region’s morphology and indicated on satellite images and aerial photos.

There is thorough and detailed record about at least 23 events, with crucial impact on the ground of Lefkada [26]. A first conclusion drawn by the events is that earthquakes occur in pairs (twin ore cluster events) with time period of occurrence ranging between 2 months and 5 years e.g. 1612–1613 (16 months); 1625–1630 (5 years); 1722–1723 (10 months); 1767–1769 (2 years); 1783–1783 (2 months, possible aftershock); 1867–1869 (2 years); 1914–1915 (2 months); 1948–1948 (2 months). Therefore, it is of great importance to pinpoint the location of Coseismic Landslides since it will be useful in order to reduce the hazards and increase the resilience at the island.

2.1 Coseismic Landslides at the Island of Lefkada

The most recent and well examined earthquakes are those of 2003 and 2015. The penultimate earthquake caused massive slope failures at the western part of the island. The amount of the debris material that arose was remarkably larger than the one of 2015. Numerous landslides occurred on the whole island and especially in the northwestern and central area, on both natural and cut slopes, as well as, on downstream road embankment slopes. Among the most indicative rock falls with diameters up to 4m, were observed along the 6 km long road of “Tsoukalades-Agios Nikitas” which is within the epicentral area, and are accompanied by gravel, small rock and soil slides [26]. The frequent occurrence of this failures led to the closure of the road network which lasted for more than two years. The reported rock falls followed the trace of a 300 m high morphological scarp, and especially a 10–40 m high artificial slope [26].

Regarding the 2015 earthquake, the dominant geological effects were related to slope failures i.e. rock falls and slides, and shallow and deep seated landslides on both natural and cut slopes [28]. These failures were documented on the western part of the island, while the most densely concentration of these phenomena was reported on the coastal zone from “Porto Katsiki” to “Egremnoi-Gialos” beach and along the 6 km long coastal road of “Tsoukalades-Agios Nikitas” [28]. Shallow landslides and rock slides were mainly generated in areas where the clastic material covered the bedrock, and particularly in places where the rock mass was heavily jointed. Deep seated landslides were mainly documented at the area of “Egremnoi” [29]. At this area, deep seated landslides were reported, and large amount of debris material moved downslope inducing severe damages to the road network and to residential houses. The debris consists of coarse-grained size material with large diameter gravels and few boulders.

In order to investigate the earthquake-induced landslide density, event-based inventories were developed by taking into account aerial and satellite imagery in Google Earth in order to enrich and update existing landslide datasets, previously compiled for the two earthquakes [27]. Google Earth imagery of June 12, 2003 and December 19, 2005 was used for mapping 2003 earthquake landslides, and November 15, 2013 and April 15, 2016 for 2015 earthquake. Landslide activity along the western part of Lefkada is considered as minimal between major earthquakes, as observed on multi-date satellite imagery and confirmed by local residents. The short period between each satellite imagery pair (2–3 years) is believed to include only the COLA, with very few if any at all landslides triggered by other factors. In total, 301 and 596 coseismic landslides were mapped for the 2003 and 2015 earthquakes. For the extraction of morphological and terrain parameters of the compiled landslide datasets, a detailed Digital Elevation Model (DEM) with spatial resolution of 5 m was used. The 5 m DEM was obtained from Hellenic Cadastre and it was extracted from aerial imagery stereo-pairs, having a vertical accuracy of 4 m [29].

Having completed the polygon-based inventories, a statistical analysis of landslide distribution took place. In total, 596 and 301 landslides were identified covering (planar area) 1.29 km2 and 1.6 km2 for the 2015 and 2003 events. These planar-oriented areas are obtained as projected measurements. The minimum and maximum landslide areas were 40.4 m2 and 42,940 m2 for the 2015 earthquake, while for the penultimate event the relevant values were 129.8 m2 and 98,300 m2 [29]. The minimum and maximum landslide areas for 2015 in an area of 1.78 km2, were 51.30 m2 and 58.330 m2. They were found by considering the DEM for the delineation of the landslide area. The values for the 2003 earthquake were 140.9 m2 and 148.469 m2 in an area of 2.28 km2 [29].

3 Dataset Pre-processing

Five features were related to COLA, 4 of which were numeric (Planar Area, Average Slope, Area, Average Aspect, Id) and 1 was nominal (Geologic Form). Given the fact that there are several landslides with 2 or more geological forms, the Coseismic Landslides were reassigned. This resulted in 421 instances for 2003 and 767 for 2015. For both years, the features are the same. Thus, the same data preprocessing was applied for both datasets. Nonetheless, in the 2003 COLA, two additional geological forms were observed, compared to ones of 2015. The data preprocessing for these two years was done independently. This resulted in an overall evaluation of the proposed algorithm, that proves its consistency and efficiency regardless the year. For data handling, 3 steps were followed. The 1st was related to the manual processing of the Average Slope, Average Aspect and Geological Forms. The 2nd and 3rd steps were performed by developing novel code in Matlab R2019a.

It was observed during the experiments that there are some variations in landslides that have the same severity but appear to have quite different slopes. For this reason, the natural logarithm function ln(x) was applied on the values of the slopes in order to smooth out any spikes. Regarding the average aspect, the initial elaboration was to transform it from nominal to numeric in a scale from 1 to 8, according to Fig. 1.

Fig. 1.
figure 1

Aspect and corresponding degrees

Nonetheless, it was considered more efficient to separate the landslides that have aspect 10° from the ones that have 350° (Fig. 1b). For this purpose, for each landslide the actual aspect in degrees was used.

3.1 Labeling Geological Forms

The Geological form feature was transformed to numeric, in a scale from 1 to 20 for 2003, from 1 to 18 for 2015. Table 1 presents the numeric labels for each Form.

Table 1. Geological form types for 2003 and 2015

3.2 Fuzzy C-Means Clustering of Landslides

After applying a statistical analysis of the datasets, it was pinpointed that some values of the area and the planar area, had high standard deviation and a non-representative mean value. FCM is the fuzzy equivalent of the “hard” clustering algorithm. It has been employed, due to the fact that it allows an individual to be partially classified into more than one cluster, with different degrees of membership. It was performed on the features Area and Planar Area, to provide the labels required for the development of the hybrid machine learning model [3, 4, 12]. The process of fuzzy partitioning is performed through a repetitious optimization of the objective function 1 with the update of membership uij and the cluster centers cj (function 2):

$$ J_{m} = \sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{C} {u_{ij}^{m} \left\| {x_{i} - c_{j} } \right\|^{2} ,} } 1 \le m < \infty $$
(1)
$$ u_{ij} = \frac{1}{{\sum\limits_{k = 1}^{C} {\left( {\frac{{\left\| {x_{i} - c_{j} } \right\|}}{{\left\| {x_{i} - c_{k} } \right\|}}} \right)^{{\frac{2}{m - 1}}} } }},\quad c_{j} = \frac{{\sum\limits_{i = 1}^{N} {u_{ij}^{m} \cdot x_{i} } }}{{\sum\limits_{i = 1}^{N} {u_{ij}^{m} } }} $$
(2)

where m is the fuzzifier (determining the level of cluster fuzziness), uij is the degree of membership of xi in the cluster j, xi is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, ||*|| is any norm expressing the similarity between any measured data and the center. The final condition for stopping the iterations is when \(\max_{ij} \left\{ {\left| {u_{ij}^{(k + 1)} - u_{ij}^{k} } \right|} \right\} < \varepsilon\), where ε is a termination criterion between 0 and 1, and k indicates the iteration steps [21, 22].

This paragraph presents the hyperparameters and their values. The corresponding membership degrees uij for large m values are small, which implies clusters with smaller bounties. The maxIterations (assigned the default value 100) is the maximum number of optimization iterations and minImprovement (assigned the default value 0.00001) is the minimum improvement in the Objective function (OBJ) per iteration. The value of m was chosen (after trial and error) to be equal to 2.

The developed FCM.m script in Maltab, transfers the content of the data files into Matlab Tables for further processing and applies the FCM. The FCM.m script creates the clusters of the COLA according to their severity and assigns the labels. The number of clusters was chosen to be 6. Τhe Linguistic of the respective clusters is a combination of the 3 potential states (Low Medium, High) of the Planar area and Area (Tables 2 and 3).

Table 2. Clusters with their corresponding labels (planar area, area)
Table 3. Respective landslides for each cluster for 2003 and 2015

The FCM.m script is presented in the form of natural language, in Algorithm 1.

figure a
Fig. 2.
figure 2

Clusters for original data for 2003 (2a) and 2015 (2b) respectively.

3.3 Fuzzy Clustering with FCM Algorithm and T-Norm

Some instances had similar membership values (MEV) for two clusters, thus the need to create new clusters has proved imperative. Landslides with MEV (to their dominant class) below a certain threshold, were re-sorted. Thus, each factor was assigned a weight [5, 33] based on Eq. (3) that implements a fuzzy conjunction \(f(\mu_{i} ,w_{i} )\) between many fuzzy sets. Each MEV μi is assigned a weight wi.

$$ \mu_{S} (x_{i} ) = Agg\left( {f\left( {\mu_{{\text{A}}} (x_{i} ),w_{1} } \right),f\left( {\mu_{{\text{A}}} (x_{i} ),w_{2} } \right), \ldots ,f\left( {\mu_{{\text{A}}} (x_{i} ),w_{n} } \right)} \right) $$
(3)

i = 1, 2,…, k. Function f is defined as follows: \(f(a,w) = a^{\frac{1}{w}}\), a is the MEV [13, 33]. The Hamacher T-Norm was used as an Aggregation function.

$$ A \cap B = \frac{{\mu_{A} (x) + \mu_{B} (x) - 2\mu_{A} (x)\mu_{B} (x)}}{{[1 - \mu_{A} (x) + \mu_{B} (x)]}} $$
(4)

The script T-Norm_Clustering.m was deployed in Matlab, to apply the FCM with the Hammacher fuzzy conjunction. It is presented in natural language form in Algorithm 2.

figure b

The T-Norm_Clustering.m script developed new clusters between the already existing ones, for both 2003 and 2015. Four new clusters were created for 2003 and 5 for 2015. If a cluster was created between clusters 4 and 5, then its name would be 4.5. The result of clustering with FCM and Hammacher is presented in Fig. 3a and b. The T-Norm optimized the classification of each instance according to its severity. Table 4, presents the exact number of instances that correspond to each cluster.

Fig. 3.
figure 3

Clusters for original data for 2003 (3a) and 2015 (3b) after applying T-Norm

Table 4. Total landslides of each cluster for 2003 and 2015 after T-Norm

4 Classification Methodology

Following the clustering of Coseismic Landslides using the features “planar area” and “area”, a classification based on 3 factors was performed. The three independent variables used for the classification are Average Slope, Average Aspect and Geological Form of landslides, which was labeled as indicated in Table 1. The cluster obtained by the FCM with Hammacher Aggregation, is the target variable.

A total of 25 classification algorithms were employed: Fine Tree, Medium Tree, Coarse Tree, Linear Discriminant, Quadratic Discriminant, Linear SVM, Quadratic SVM, Cubic SVM, Fine Gaussian SVM, Medium Gaussian SVM, Coarse Gaussian SVM, Cosine KNN, Coarse KNN, Cubic KNN, Weighted KNN, Fine KNN, Medium KNN, Gaussian Naive Bayes, Kernel Naïve Bayes, Boosted Trees, Bagged Trees, Subspace Discriminant, Subspace KNN, RUSBoost Trees, Ensemble Adaptive Boosting. The one with the highest performance was the Ensemble Adaptive Boosting Algorithm (AdaBoost). AdaBoost, has proved to be very efficient for all classes except the first three (clusters 1, 1.5, and 2). Thus, if an observation was classified in the first 3 clusters, another algorithm was applied in order to classify it more accurately. The best algorithm for this approach, was the Ensemble Subspace k-NN (Fig. 4).

Fig. 4.
figure 4

The hybrid classification model

4.1 Ensemble AdaBoost

It makes predictions based on a number of different models. By combining individual models, it tends to be less biased and less data sensitive. Ensemble AdaBoost works especially well with decision trees. It is the most popular boosting technique, developed for classification. It learns from previous mistakes, e.g. misclassification data points, by increasing their weights. The learner with higher weight has more influence on the final decision.

4.2 Ensemble Subspace k- Nearest-Neighbors (Ensemble Subspace k-NN)

The k-nearest neighbors (k-NN) is a lazy and non-parametric Learning algorithm [8].

It is a traditional classification rule, which assigns the label of a test sample with the majority label of its k nearest neighbors, from the training set. Given a set X of n points and a distance function, k-NN search finds the k closest points to a query point or a set of them [17]. Dunami [11] first introduced a weighted voting method, called the distance-weighted (DW) k-nearest neighbor rule (Wk-NN). In this approach, the closer one neighbor is, the greater the weight that corresponds to it, using the DW function.

The neighbor farthest from all the others corresponds to a weight of 0, while the one closest to the observation corresponds to a weight of 1. All neighbors in the middle area get corresponding values between 0 and 1. The most common and most consistent ensemble method for k-NN is the Ensemble Subspace k-NN and related works using this algorithm can be found in [10, 15, 17]. Tuning of the hyperparameters was done based on a combination of 10-fold Cross Validation and Grid Search. According to the literature, this combination is one of the most widely strategies used in machine learning.

This was a multiclass classification case, so the “One Versus All” Strategy [19, 30] was used for the evaluation. Table 5 shows the 5 performance indices that were used.

Table 5. Indices used for the evaluation of the multi-class classification

5 Experimental Results

The experiments were performed in Matlab R2019a. The initial range of the AdaBoost hyperparameters is the following: Maximum number of splits (MNS) takes values in the interval [10, 500], whereas the range interval of the parameter Number of learners (NLE) is [1,800] and the respective one for the Learning Rate (LR) is [0.001, 1]. The optimal hyperparameters values found were 175, 88, 1, 10, for the MNS, NLE and LR and the number of Grid Divisions respectively. The Learner Type was a “Decision Tree” and the Optimizer employed was Grid Search. The Ensemble AdaBoost achieved an accuracy of 64% and 68% for 2003 and 2015. The following Tables 6 and 7 are the confusion matrices of the above optimal algorithm for 2015 and 2003, whereas Fig. 5a and b are the ROC curves for the respective years. Tables 8 and 9 present the values of all the performance indices for AdaBoost.

It is obvious that AdaBoost successfully classifies all the landslides that belong to Cl2.5 and above. This this significant, as the algorithm can indicate with high accuracy, COLA that would be the most disastrous. Prediction of the first 3 clusters, requires more attention.

Table 6. Confusion Matrix for the Ensemble AdaBoost for 2015
Table 7. Confusion matrix for the Ensemble AdaBoost for 2003
Fig. 5.
figure 5

ROC Curves for 2015 (5a) and 2003 (5b)

Table 8. Classification performance indices for the Ensemble AdaBoost (2015)
Table 9. Classification performance indices for the Ensemble AdaBoost (2003)

The Ensemble k-NN was employed. Tuning of hyperparameters was applied with the combination of 10-fold cross validation and grid search. The initial range for the hyperparameters of the Ensemble k-NN are: Maximum number of splits takes values in [10, 500], the parameter Number of learners in [1,800] the Learning Rate in [0.001, 1] and Subspace Dimension takes values in [2, 10]. The optimal hyperparameters values found were 20, 30, 0.1, 3, 10, for the MNS, NLE, LR, Subspace Dimension and the number of Grid Divisions respectively. The Distance Metric was a “Euclidean” and the Optimizer employed was Grid Search. The Ensemble Subspace k-NN achieved an accuracy equal to 70.07% and 72.88% for 2003 and 2015. The Confusion Matrix for each year is presented in Tables 10 and 11 and the corresponding ROC Curves in Fig. 6a and b.

Table 10. Confusion matrix for ensemble subspace k-NN algorithm for 2015
Table 11. Confusion matrix for ensemble subspace k-NN algorithm for 2003

Tables 12 and 13, present the values of the performance indices for the above optimal algorithm. After applying and the Ensemble k-NN algorithm, a significant increase of overall accuracy and indexes was observed. The accuracy was 70% and 72% for 2003 and 2015 respectively.

Fig. 6.
figure 6

ROC curve for ensemble subspace k-NN for 2015 (6a) and for 2003 (6b)

Table 12. Classification performance indices of ensemble subspace k-NN (2015)
Table 13. Classification performance indices of Ensemble Subspace k-NN (2003)

6 Discussion and Conclusion

At first glance, the efficiency of the model is high but not optimal. We must consider that this modeling effort has achieved to effectively classify the severity of the complicated Coseismic Landslides phenomenon, using only 3 independent variables. From this point of view, the performance is reliable, and it has a certain level of novelty, as to the best of our knowledge, there does not exist another approach with similar accuracy in the literature. It is a pioneer research, employing state of the art Hybrid Machine Learning algorithms in Geomechanics. Indices, especially the accuracy and F1 score, indicate a very flexible model that can predict the most severe landslides and can handle the landslides that are not so dangerous. The results of 2003 and 2015 are similar, which means that the algorithm is consistent and that it can generalize (not case dependent). The performance of the algorithm for 2015 is better, because of the more contemporary equipment used for the extraction of features. This research addresses one of the most crucial natural hazards. It is of essential importance to urban planning and to the functioning of societies. Future research will focus on predicting the timeframe and the area of a landslide after an earthquake.