1 Introduction

Landslides are one of the worst types of natural disasters, which occur frequently around the world, particularly in mountainous regions (Sassa et al. 2010; Juang et al. 2019). The Three Gorges Reservoir area is in the middle and upper reaches of the Yangtze river. Since the impoundment of water in 2003, the reservoir bank has suffered periodic fluctuation of reservoir water level for a long time, which makes the rock and soil of the slope at the reservoir bank undergo the change of dynamic osmotic pressure repeatedly, thus causing great impact on the surrounding regional geological environment, resulting in the deformation and destruction of the original stable reservoir bank, and leading the reactivation and deformation of many ancient landslides (Tang et al. 2015, 2019). The political, economic, and social status of a large hydropower hub is significant (Wu et al. 2017; Li et al. 2019b, a). Therefore, it is of great significance to carry out the research on the riverbank landslides in the Three Gorges Reservoir area.

The deformation and failure of landslides are the manifestation of the coupling action of internal and external factors. Among them, the internal factors include landform, geological structure, rock and soil properties, etc. External factors include rainfall, reservoir water level, vibration, human activities and so on (Zhou et al. 2018a, b; Yao et al. 2019). For the landslide in the Three Gorges Reservoir area, the periodic change of reservoir water level fluctuated between 145 and 175 m every year, which results in the periodic change of dynamic seepage pressure of the rock and soil all the year round and has a negative impact on the stability of the landslide (Song et al. 2018; Huang et al. 2018). Rainfall is also an important external factor to cause deformation and failure of the landslide (Wang and Sassa 2001; Cao et al. 2020). On the one hand, rainfall infiltration increases the sliding force of slope. On the other hand, rainfall leads to the weakening of rock and soil mass, which reduces the stability of the landslide (Miao et al. 2019; Wang et al. 2020; Wu et al. 2020). Therefore, the reservoir water level and rainfall can be used as the hydrologic triggering factors of landslide deformation and failure in the Three Gorges Reservoir area (Xiong et al. 2019).

With the improvement of monitoring technology and accuracy, a large number of monitoring data are collected by real-time monitoring system of landslide (Zhang et al. 2018). For landslide monitoring data, most of the current research focuses on qualitative analysis or displacement prediction (Shihabudheen et al. 2017; Miao et al. 2018; Intrieri et al. 2019; Li et al. 2019b, a). In terms of data mining analysis, Tsai et al. (2013) used data mining technology to analyze terrain and vegetation factors to verify landslides induced by regional heavy rainfall in Taiwan. And then, decision tree and Bayesian network algorithms were used to extract information from landslides data. Huang et al. (2016) studied the correlation criteria between landslide displacement and reservoir water level and rainfall, and determined the triggering factors of landslide. Wu et al. (2016a, b), Ma et al. (2017a, 2018 and 2020) proposed a data mining method to investigate the hydrological causes of Majiagou landslide, and Apriori algorithm was used to mine association rules to determine the contribution of each hydrological parameter to the landslide movement. Ma et al. (2017b) proposed a hybrid method based on two-step clustering and decision tree C5.0 algorithm to establish a step type landslide deformation prediction model. Wang et al. (2019) proposed an improved parallel mining algorithm for cooperative frequent itemset in multiple data streams.

In this paper, a data mining method combining two-step clustering, Apriori algorithm and decision tree C5.0 were proposed, as shown in Fig. 1. The Baishuihe Landslide in the Three Gorges Reservoir area was taken as the research object. First, 6 hydrologic induced factors were chosen to carry out the data mining analysis, including monthly cumulative rainfall (\(q^{{{\text{month}}}}\)), monthly maximum daily rainfall (\(q_{{{\text{max}}}}^{{{\text{day}}}}\)), monthly maximum continuous rainfall (\(q_{{{\text{continuous}}}}\)), monthly average water level (\(\overline{h}\)), monthly variation of water level (\(\Delta h\)), and monthly maximum daily variation of water level (\(\Delta h_{{{\text{max}}}}^{{{\text{daily}}}}\)). Then, the two-step clustering was used to cluster six triggering factors and deformation rate of the landslide, and the Apriori algorithm was used to mine the association rules between triggering factors and deformation rate. A total of 173 association rules are generated, and 20 rules are selected to be analyzed. At last, the decision tree C5.0 model was built to carry out threshold analysis of landslide triggering factors. The data mining method proposed in this paper has a high accuracy in the study of Baishuihe landslide monitoring data, which could provide a significant basis for the data analysis and prediction of the accumulative landslide in the Three Gorges Reservoir area.

Fig. 1
figure 1

Flow chart of the data mining process (+, ++, and +++ represent low, medium, and high performance)

2 Methodology

2.1 Two-step clustering

The two-step clustering method realizes the data clustering process by pre-clustering and clustering (Ding et al. 2012; Wu et al. 2016a, b), as shown in Fig. 2. The main characteristics of the two-step clustering algorithm are: (1) it can deal with both numerical and categorical variables; (2) it can determine the number of clusters according to certain criteria automatically; (3) it can diagnose outliers and noise data in samples. Pre-clustering uses “sequential” method to roughly divide samples into several sub categories. At the beginning, all the data samples are regarded as a large class. After reading a sample data, it is determined whether this sample should be merged into the existing subclass, or a new class will be derived according to the degree of familiarity. This step is repeated, and the final sample data are divided into L classes. The number of clusters in the pre-clustering process is increasing. Based on pre-clustering, the clustering process also judges whether the sub-clusters generated in the pre-clustering can be merged according to the “degree of affinity” of the samples, and finally the sample data are divided into L categories. In the process of clustering, the number of clusters is decreasing. For numerical variables, Euclidean distance is usually used in two-step clustering. If the sample data contain both numerical variables and subtype variables, logarithmic likelihood distance should be used.

Fig. 2
figure 2

The algorithm implementation process of two-step clustering

2.2 Apriori algorithm

Apriori algorithm was first proposed by Agrawal and Srikant and has become the core algorithm of association rule mining (Agrawal et al. 1993). This algorithm can only deal with categorical variables and cannot deal with numeric variables (Perego et al. 2001; Guo et al. 2019).

Frequent item set is the item set T containing item a. If its support is greater than or equal to the support threshold specified by the user, that is:

$$ \frac{{\left| {T(a)} \right|}}{\left| T \right|} \ge \min {\text{supp}} $$
(1)

then a is called frequent item set. The set including 1 item (length 1) is called frequent 1 item set, which is recorded as L1. As shown in Fig. 3, a, b, c, d at the bottom layer can be called frequent 1-term set when the minimum support degree is met. The frequent item set with k items is called frequent k item set, which is recorded as Lk. The upper level project sets ab, abc, and abcd are frequent k-item sets when they meet the minimum support.

Fig. 3
figure 3

The algorithm implementation process of Apriori

Apriori algorithm uses the iterative method of layer by layer search to generate frequent item sets. Frequent k-item sets are used to explore and generate (k + 1) item sets. The algorithm implementation process is shown in Fig. 3. The frequent item set with a length of 1 is searched out. L1 is used to generate frequent item set L2 with a length of 2, and L2 is used to generate frequent item set L3 with a length of 3. In this way, all frequent item sets are searched.

2.3 Decision tree C5.0

Decision tree model has advantages in estimating process and interpreting parameters (Pandya et al. 2015). Different from other statistical methods, decision tree model does not make statistical assumptions, and can process data representing different scales.

2.3.1 Growth of decision tree C5.0

Decision tree C5.0 is based on the development of ID3 algorithm. The calculation method is derived from the concept of entropy, that is, the average uncertainty of the information source before it is sent out. A node n, assuming n is the whole sample set, C is a set of target variables, t is the number of C categories. Then entropy is defined as:

$$ {\text{Ent}}\left( N \right) = - \sum\limits_{i}^{t} {p\left( {C_{i} \left| N \right.} \right)} \log_{2} p\left( {{\text{Ci}}\left| N \right.} \right) $$
(2)

where \(p\left( {C_{i} \left| N \right.} \right)\) is the relative probability of \(C_{i} \left( {i = 1,2,...,t} \right)\). If a variable t with attribute is divided into k classes, the conditional entropy after the variable is introduced is defined as:

$$ {\text{Ent}}\left( {N\left| T \right.} \right) = \sum\limits_{j}^{k} {\frac{{\left| {T_{j} } \right|}}{\left| N \right|}} \times {\text{Ent}}\left( N \right) $$
(3)

The entropy difference between the newly split node and the original node is the information gain, which can be expressed as:

$$ {\text{Gains}}\left( {N,T} \right) = {\text{Ent}}\left( N \right) - {\text{Ent}}\left( {N\left| T \right.} \right) $$
(4)

Normally, \({\text{Ent}}\left( N \right) > {\text{Ent}}\left( {N\left| T \right.} \right)\). The degree of random uncertainty of information elimination is represented by information gain. Therefore, the growth of decision tree is determined by selecting the best grouping variable with the maximum information gain rate. Its definition is:

$$ {\text{GainRatio}} = \frac{{{\text{Gains}}\left( {N,T} \right)}}{{{\text{Ent}}\left( T \right)}} $$
(5)

2.3.2 Pruning rules of decision tree C5.0

Decision tree C5.0 uses statistical confidence interval estimation method to evaluate the error of the training set. If node n contains samples of En prediction errors, the error rate of this node is:

$$ f_{n} = \frac{{E_{n} }}{N} $$
(6)

In addition, the estimation error of node n is defined as:

$$ e_{n} = f_{n} + z\sqrt {\frac{{f_{n} \left( {1 - f_{n} } \right)}}{N}} $$
(7)

where z represents the threshold, which is generally equal to 1.15. On this basis, when the weighted error of the leaf node of the subtree to be pruned is greater than the estimation error of the parent node, the leaf node can be clipped, which is expressed as:

$$ \sum\limits_{n - 1}^{r} {p_{n} e_{n} } > e,n = 1,2, \cdots ,r $$
(8)

where r is the number of unmodified leaf nodes, Pn is the ratio of the sample size of leaf nodes to the sample size of subtree, and e is the estimated error value of the parent node.

3 Case study: baishuihe landslide

3.1 Geological conditions

Baishuihe Landslide is located on the right bank of the Yangtze River, which belongs to the Shazhenxi town, Zigui County, Hubei Province (Fig. 4). It is approximately 56 km away from the Three Gorges Dam. Baishuihe landslide is a large-scale ancient cumulative landslide with the average slope inclination of 30° and the average thickness of 30 m. The volume of the landslide is 645 × 104 m3, covering an area of 21.5 × 104 m2. The main sliding direction is NE15–NE20°. The north–south and east–west length of the landslide are about 600 m and 700 m, respectively. The Baishuihe Landslide formed in a nearly north–south gully with the south higher than the north and spread into the Yangtze River. The gradient of the toe and rear of the landslide is large, and the central portion is flat. In morphology, there is irregular flat concave terrain on both sides of the landslide, which is slightly higher than the middle of the landslide. The toe of the landslide extends to the bed of the Yangtze River, and the crown of the landslide is located at the boundary of the rock and soil with a height of 410 m. From the plane view, the boundary of the landslide is displayed in the shape of an irregular round-backed armchair.

Fig. 4
figure 4

a Location of the Baishuihe Landslide; b Topographic map of the Baishuihe Landslide; c Overall view of the Baishuihe Landslide

Schematic geological profile of the Baishuihe Landslide is shown in Fig. 5. The materials of the landslide are Quaternary deposits, including silty clay and fragmented rubble with a loose and disorderly structure. The lithologies of the bedrock and strata that crop out around the landslide are mainly Jurassic siltstone, arenaceous shale, and quartz sandstone, with dip directions at 15° and dip angles of 36°. Physical and mechanics parameters of landslide materials were shown in Table 1.

Fig. 5
figure 5

Schematic geological profile of the Baishuihe Landslide (II–II′)

Table 1 Physical and mechanics parameters of landslide materials

Based on monitoring of surface displacement and the surface deformation characteristics, the Baishuihe Landslide was divided into two major areas in July 2004.

  1. 1.

    The active area (section A, namely main deformation zone or the warning area) is the front part of the landslide and has large deformation. Due to flooding by the reservoir water after the Three Gorges Dam was built, the landslide has obvious displacements, and multiple transverse tension cracks occur in the eastern part.

  2. 2.

    The relatively stable section B is the middle and rear of the landslide, where the accumulated deformation is small, and the deformation rate is slow, only 1.5–4.0 mm/a.

3.2 Deformation of the landslide

As an active ancient landslide, Baishuihe landslide has slid for many times. On August 25, 1993, a landslide occurred at the back edge of the landslide, forcing 15 residents to move away. Since 2003, Baishuihe landslide has been warned for many times because of its strong deformation. In 2003, cracks with more than 300 m long were found in the eastern slip tongue, and 4 households were forced to leave. On the morning of June 30, 2007, approximately 100,000 m3 of landslide piled on the road in the rear of the active area (Fig. 6). The Baishuihe landslide showed obvious deformation during the flood season from 2008 to 2012. By the end of August 2012, the maximum displacement reached 3148.3 mm. From May to August 2015, the retaining wall of Shahuang road at the back edge of the landslide cracked, with a crack width of 1–5 cm (Fig. 6). The deformation of the retaining wall is caused by the creep of the soil along the soil rock interface, which is a local deformation related to rainfall.

Fig. 6
figure 6

Macroscopic deformation of the Baishuihe Landslide

3.3 Analysis of the monitoring data

Since 2003, a total of 11 GPS displacement monitoring points are deployed on the Baishuihe landslide. Among them, ZG93, ZG118, and XD01 are in the active area, and the monitoring period of them is relatively long. Therefore, these 3 monitoring points can reflect the deformation characteristics of landslide accurately. Thus, in this study, the data of the monitoring points ZG93, ZG118, and XD01 from June 2006 to December 2016 were selected as the research object, as shown in Fig. 7. In June 2003, the water level of the reservoir was raised to 135 m for the first time. Until 2006, the reservoir level has been kept in the 135–140 m range. From 2006 to 2007, the water level of the reservoir has been raised to 155 m and then fluctuated in the 145–155 m. After 2008, the reservoir level was raised to over 170 m, and the normal operation mode of 145–175 m was started after 2010. According to the water level scheduling of the Three Gorges reservoir, the landslide deformation is divided into 3 stages:

  1. 1.

    Phase I (from June 2003 to June 2006): In this phase, the reservoir water level fluctuates between 135–140 m. Although the fluctuation range of the water level is small, the displacement of monitoring points begins to “step-like” increase steadily. The increase of displacement is mainly concentrated in the decline of reservoir water level and the subsequent period. The periodic decline of reservoir water level is the main factor causing the increase of displacement, and the displacement has a certain lag to reservoir level. In July 2005, there was a heavy rainfall process, and the displacement did not increase sharply.

  2. 2.

    Phase II (from July 2006 to June 2008): In this phase, the reservoir water level fluctuates between 145 and 155 m. Among them, during April June 2007, when the reservoir level dropped from 155 to 145 m for the first time, the large drop of the water level led to the increase of the hydrodynamic pressure in the landslide, and changed the seepage field of the landslide, which made the displacement of each monitoring point appear the first sudden increase. The increasement of XD01 was more than 1000 mm.

  3. 3.

    Phase III (from July 2008 to December 2016): In this phase, the reservoir water level fluctuates between 145 and 175 m. The landslide displacement increases in “step-like”, and the annual growth rate decreases before 2015. The displacement of XD01 monitoring point on the right side of the landslide is significantly larger than that of other monitoring points. The deformation on the right side of the landslide is relatively large, which is consistent with the field investigation. Cracks of Baishuihe landslide are mostly concentrated near the right edge of the landslide mass.

Fig. 7
figure 7

Long-term monitoring data of the Baishuihe Landslide (displacement, reservoir level, precipitation)

The above analysis shows that the fluctuation of reservoir water level and rainfall are the main factors affecting the deformation of landslide. Therefore, in this study, a total of 6 hydrologic factors were chosen to carry out the data mining analysis, including monthly cumulative rainfall (\(q^{{{\text{month}}}}\)), monthly maximum daily rainfall (\(q_{{{\text{max}}}}^{{{\text{day}}}}\)), monthly maximum continuous rainfall (\(q_{{{\text{continuous}}}}\)), monthly average water level (\(\overline{h}\)), monthly variation of water level (\(\Delta h\)), monthly maximum daily variation of water level (\(\Delta h_{{{\text{max}}}}^{{{\text{daily}}}}\)), as shown in Table 2.

Table 2 Division of the induced factors for the landslide

4 Results

4.1 Clustering results

Based on the two-step clustering algorithm, the 6 triggering factors were clustered. The maximum and minimum categories of each triggering factors were set as 10 and 2, respectively. The distance measurement method used in the two-step clustering algorithm was Euclidean distance, and the cluster criterion was Bayesian Information Criterion (BIC). Clustering results of the hydrologic factors were shown in Table 3 and 4. Among them, monthly cumulative rainfall (\(q^{{{\text{month}}}}\)) was clustered into Heavy Rainfall (183.5–517.6 mm), Moderate-Rainfall (69.9–179.8 mm), and Light Rainfall (3.1–66.1 mm). Monthly maximum daily rainfall (\(q_{{{\text{max}}}}^{{{\text{day}}}}\)) was clustered into Heavy-Daily-Rainfall (55.9–160.7 mm), Moderate-Daily-Rainfall (26.5–55.2 mm), and Light-Daily-Rainfall (1.3–25.6 mm). Monthly maximum continuous rainfall (\(q_{{{\text{continuous}}}}\)) was clustered into Heavy-Effective Rainfall (110.5–239.4 mm), Moderate-Effective Rainfall (36.6–109.8 mm), and Light-Effective Rainfall (1.5–36.1 mm).

Table 3 Clustering results of the rainfall factors
Table 4 Clustering results of the reservoir water level factors

Monthly average water level (\(\overline{h}\)) was clustered into High-Water-Level (160.14–174.74 m), Medium-Water-Level (144.21–158.47 m), and Low-Water-Level (135.13–138.95 m). Monthly variation of water level (\(\Delta h\)) was clustered into Sharply-Rise (13.26–17.35 m/month), Medium-Rise (7.23–11.36 m/month), Slowly-Rise (1.57–5.89 m/month), Smooth-Fluctuation (− 1.56 to1.31 m/month), Medium-Drop (− 7.09 to − 3.41 m/month), and Sharply-Drop (− 13.02 to − 8.59 m/month). Monthly maximum daily variation of water level (\(\Delta h_{\max }^{{{\text{daily}}}}\)) was clustered into Sharply-Daily-Rise (1.66–3.223 m/day), Medium-Daily-Rise (0.744–1.513 m/day), Slowly-Daily-Rise (0.063–0.63 m/day), Slowly-Daily-Rrop (− 0.414 to 0 m/day), and Medium-Daily-Drop (− 1.697 to − 0.49 m/day), as shown in Table 3.

Monthly velocity (v) was clustered in Table 5. The initial stage of deformation (Low I) indicates that the monitoring points deform at a rate of − 0.195 to 0.078 mm/month, which accounts for 42.3% of the total data set. The stable deformation stage (Medium II) indicates that the monitoring points deform at a rate of 0.092–0.939 mm/month, which accounts for 40.5% of the total data set. The acceleration deformation (High III) indicates that the monitoring points deform at a rate of 1.042–10.669 mm/month, which accounts for 17.2% of the total data set.

Table 5 Clustering results of the monthly velocity

4.2 Data mining and analysis

In the data mining process, hydrologic factors of the landslide (\(q^{{{\text{month}}}}\), \(q_{{{\text{max}}}}^{{{\text{day}}}}\), \(q_{{{\text{continuous}}}}\), \(\overline{h}\), \(\Delta h\),\(\Delta h_{{{\text{max}}}}^{{{\text{daily}}}}\)) are set as the former item of association rules, and the deformation rate (Monthly velocity v) is set as the consequent item. The support and confidence threshold of Apriori algorithm are set at 1.5%, 80% to mine the association rules of the Baishuihe landslide. A total of 173 association rules are generated, most of which were I and II stages of the landslide. In these two stages, the deformation rate of the landslide is low, only from − 0.195 to 0.939 mm/day, and the displacement of monitoring points is almost stable. The III stage of landslide deformation should be paid more attention. Therefore, this paper only lists a few typical association rules about deformation stability stage. Selected association rules were shown in Table 6. The association rules of 1–6, 7–13 and 14–20 are the I, II, III stage, respectively.

Table 6 Association rule results for the Baishuihe Landslide

Rules 1–6 are the association rules for landslide deformation with a low velocity of − 0.195 to 0.078 mm/month. Among them, rules 1–3 mean that if the water level is 160.14–174.74 m (high-water-level), and one of the three rainfall induced factors is light rainfall(\(q^{{{\text{month}}}}\), \(q_{{{\text{max}}}}^{{{\text{day}}}}\), \(q_{{{\text{continuous}}}}\)), the landslide is likely to deform at a low rate (I stage). Rules 4–6 indicate that if the monthly variation of water level is between 1.57 and 5.89 m/month (slowly-rise), and the monthly maximum daily variation of water level is 0.063–0.63 m/day (slowly-daily-rise), the landslide is likely to deform at a low rate (I stage).

Rules 7–13 are the association rules for landslide deformation with a medium velocity of 0.092–0.939 mm/month. Hydrologic factors included in these rules are mainly the low-medium effective continuous rainfall (Moderate-Rainfall, Light-Daily-Rainfall, Moderate-Daily-Rainfall, Light-Effective-Rainfall, Moderate-Effective-Rainfall) and low-medium rate variation process of reservoir water level (Low-Water-Level, Slowly-Rise, Smooth-Fluctuation, Slowly-Daily-Rise, Medium-Daily-Drop). These factors cannot induce the large deformation of the landslide. The confidence level of each rule is as high as 100%. This confidence index shows that when the front item of the rule occurs, the back item of the rule will be sure to occur.

Rules 14–20 are the association rules for landslide deformation with a high velocity of 1.042–10.669 mm/month. The hydrological induced factors included in the rules are mainly composed of heavy rainfall and high effective rainfall. These rules mean that when one of the three rainfall induced factors is reach to heavy rainfall(\(q^{{{\text{month}}}}\), \(q_{{{\text{max}}}}^{{{\text{day}}}}\), \(q_{{{\text{continuous}}}}\)), the landslide will deform at a high velocity (III stage), which indicates that rainfall controls the deformation rate of the Baishuihe Landslide.

4.3 Threshold values of the induced factors

In the Decision Tree C5.0 model, the hydrological factors of the landslide (\(q^{{{\text{month}}}}\), \(q_{{{\text{max}}}}^{{{\text{day}}}}\), \(q_{{{\text{continuous}}}}^{{}}\), \(\overline{h}\), \(\Delta h\),\(\Delta h_{{{\text{max}}}}^{{{\text{daily}}}}\)) are set as the input parameters, and the deformation rate is set as the output parameter. The 80% of the total data is defined as training samples to build the decision tree model, and the rest 20% data are set as the testing samples to check the accuracy of the model. In order to improve the generalization ability of the model and prevent the model from over fitting, this paper adopts the method of combining cross validation and boosting technology in the construction of decision tree C5.0 model. The number of tests for boosting is set to 10, the number of cross validation folds is set to 10, and the expected noise is set to 10%. In this paper, a total of 8 decision tree models are built, and the model with the highest accuracy is selected for analysis. In this model, only 3 hydrological factors are contained, including \(q^{{{\text{month}}}}\), \(q_{{{\text{max}}}}^{{{\text{day}}}}\), \(\overline{h}\). The importance degree of each factor is shown in Fig. 8. The monthly cumulative rainfall (\(q^{{{\text{month}}}}\)) plays a significant role in controlling landslide deformation.

Fig. 8
figure 8

The importance degree of each induced factor in decision tree C5.0 model

A total of 10 threshold criteria for deformation characteristics of Baishuihe landslide have been established in the decision tree model, as shown in Table 7. The threshold criteria of 1–4, 5–8, and 9–10 is the I, II, III stage, respectively.

Table 7 Threshold values of the induced factors based on decision tree C5.0 model

The threshold criteria of 1–4 indicate that the deformation of the landslide is in the I stage. Criterion 1 can be interpreted as: when the average value of reservoir water level is less than 151.94 m, the landslide area suffers the accumulated rainfall with the intensity less than 23.6 mm, and the monthly maximum value of daily rainfall is less than 28.1 mm, the landslide enters the initial stage of deformation. Criterion 2 can be interpreted as: the slope monitoring point enters the initial stage of deformation when the intensity of accumulated rainfall in the landslide area is greater than 73.9 mm, the average value of reservoir water level is greater than 155.55 m, and the monthly maximum value of daily rainfall is less than 36.1 mm. The prediction accuracy of this criterion is as high as 92.8%. Criterion 3 can be interpreted as: the landslide enters the initial stage of deformation when the intensity of accumulated rainfall in the landslide area is less than 73.9 mm, the average value of reservoir water level is less than 151.94 m, and the monthly maximum value of daily rainfall is greater than 28.1 mm. The accuracy of this criterion is very high, reaching 100%. Criterion 4 can be interpreted as follows: the slope monitoring point enters the initial stage of deformation when the strength of the landslide area is less than 73.9 mm, and the average water level of the reservoir is more than 151.94 m. The number of samples of this criterion is the largest, and the accuracy is 83.3%.

Criterions 5–8 indicate that the deformation of the landslide is in the II stage. Criterion 5 can be interpreted as follows: the landslide area suffers from the accumulated rainfall with the intensity greater than 73.9 mm, and the landslide maintains the II stage when the average reservoir water level is less than 149.91 m. The accuracy of this criterion is 73.2%, and many deformation examples are included. Criterion 6 can be interpreted as: when the average water level of the reservoir is less than 151.94 m, the landslide area suffers rainfall with the intensity of 23.6–73.9 mm, and the monthly maximum rainfall is less than 28.1 mm, the landslide maintains the II stage. The accuracy of the criterion is high, reaching 100%. Criterion 7 can be interpreted as: when the mean value of reservoir water level is greater than 155.55 m, the landslide area suffers rainfall with intensity greater than 73.9 mm, and the monthly maximum value of daily rainfall is greater than 36.1 mm, the landslide maintains the II stage. The accuracy of the criterion is high, reaching 100%. Criterion 8 can be interpreted as: when the average water level of the reservoir is 151.11–153.12 m, and the landslide area suffers rainfall with intensity greater than 73.9 mm, the landslide maintains the II stage.

Criteria 9–10 are the criteria for the landslide to enter the deformation acceleration stage (III). Criterion 9 can be interpreted as: when the average reservoir water level is 149.91–151.11 m, and the landslide area suffers rainfall with intensity greater than 73.9 mm, the landslide enters the deformation acceleration stage. Criterion 10 can be interpreted as: when the average reservoir water level is 153.12–155.55 m, and the landslide area suffers rainfall with intensity greater than 73.9 mm, the landslide enters the deformation acceleration stage. The accuracy of the criterion is as high as 100%.

Accuracy rate of the training and testing samples based on decision tree C5.0 model is shown in Table 8. In the data mining process, 130 samples were selected for training, of which the correct number was 106 (81.5%). 33 samples were selected for testing, of which the correct number was 28 (84.9%). The accuracy of training samples and test samples is higher than 80%, which can be used as the basis of judgment. The deformation and failure of landslide are affected by many external factors. In addition to the reservoir water level and rainfall selected in this paper, the landslide is also affected by human engineering activities. This is also the main reason for the training and prediction errors in this paper.

Table 8 Accuracy rates for the training and testing samples based on decision tree C5.0 model

5 Discussion

Baishuihe landslide is a large-scale consequent slope, which is classified as traction landslide in the stress form. The deformation and failure mode can be basically divided into the following stages: (1) the long-term effect of reservoir water level causes the toe slope to soften and the slide residence section to gradually fail; (2) after the loss of the original sliding residence force at the toe slope, the upper rock, and soil mass lose their original support. At the same time, the rainfall and seepage pressure of reservoir water level cause the deformation and weakening of soil mass in the fluctuation zone, and the unstable area is moving upward; (3) when the deformation develops to a certain stage, the state of soil reaches the limit, resulting in tension shear cracks and stress redistribution; (4) when the cracks extend to the rear slope, the whole landslide is in an extreme unstable state, and it is likely to slide along the sliding surface under the action of external factors.

In general, the acceleration stage of landslide deformation is the focus of researchers and engineers. According to the data mining results, monthly cumulative rainfall (\(q^{{{\text{month}}}}\)) plays an important role in controlling landslide deformation. Monthly cumulative rainfall of 73.9 mm can be regarded as the threshold of rainfall (Table 6). When the rainfall in that month does not reach this threshold, the Baishuihe landslide will not enter the deformation acceleration stage. The second factor to control landslide deformation is the monthly average water level (\(\overline{h}\)). According to monitoring data of Baishuihe landslide, generally, the rainfall is concentrated in June to September every year, and the reservoir will be controlled from high water level to low water level before this period. In other words, there is a certain negative correlation between monthly rainfall and monthly average water level elevation. This period is also the time when the landslide deformation is severe. Therefore, the monthly average water level of 155.55 m can be regarded as the threshold of \(\overline{h}\). When the reservoir level in that month does not reach this threshold, the Baishuihe landslide will not enter the deformation acceleration stage. Monthly maximum daily rainfall (\(q_{{{\text{max}}}}^{{{\text{day}}}}\)) usually has an impact on the I and II states of landslide deformation, but has no direct control over the acceleration stage of landslide deformation.

6 Conclusion

In this research, a data mining method combining two-step clustering, Apriori algorithm and decision tree C5.0 was proposed. The Baishuihe Landslide in the Three Gorges Reservoir area was taken as the research object to analyze the relationship between induced factors and landslide deformation. The following conclusions can be reached:

  1. 1.

    The fluctuation of reservoir water level and rainfall were the main factors affecting the deformation of landslide, and 6 hydrologic factors were chosen to carry out the data mining analysis, including three factors related to rainfall, and three factors related to reservoir water level.

  2. 2.

    A total of 173 association rules were generated based on the data mining, and 20 rules were selected to be analyzed. The association rules showed that rainfall controls the deformation rate of the Baishuihe Landslide.

  3. 3.

    Monthly cumulative rainfall played an important role in controlling landslide deformation, and 73.9 mm can be regarded as its threshold. Monthly average water level was the second factor to control landslide deformation. The monthly maximum daily rainfall had no direct control over the acceleration stage of landslide deformation.

  4. 4.

    The data mining method proposed in this paper has a high accuracy in the study of Baishuihe landslide monitoring data. Therefore, it is of great significance for the data analysis and prediction of the accumulative landslide in the Three Gorges Reservoir area.