Introduction

Taiwan is an island with one-third of its area located in mountainous zones. The scarcity of usable land resulted in many housing units and farmhouses being built into the hillsides. In addition, earthquakes and typhoons frequently occur because Taiwan is located on the Circum-Pacific Earthquakes Belt and Western-Pacific Typhoon Path. First, landslides and severe erosion over the years on steep hills consisting of relatively erosive geological materials have resulted in abundant colluvial accumulation, especially after the 921 Chi-Chi Earthquake. Second, the average annual rainfall is more than 2,500 mm with a significant contribution from typhoons. The heavy rainfalls result in fragile geological materials and colluvium. The site conditions of steep landform and heavy rainfall will easily produce debris flow. Consequently, the casualties, property loss, and structure damage caused by debris flows have dramatically increased in recent years.

In reality, the mechanism of debris flows is quite difficult to analyze. Efforts to develop adequate theoretical models for debris flow have been limited by the lack of understanding of how such flows occur with regard to the given site conditions such as (1) geomorphology, (2) geology, (3) hydraulic properties, and (4) soil conditions (Pierson 1994; Floris et al. 2004). However, since the debris flows have led to many disasters, the literatures have to be reviewed for a possible resolution. Based on the previous literature, the analysis processes of debris flow can be classified into one of three categories:

  1. 1.

    Indicators of factors. Chang and Hsieh (1997) and Chang (1998) investigated the debris flow events in several potential regions in Taiwan. Hsieh et al. (1995) used critical precipitation line for predicting the occurrence or non-occurrence of a debris flow problem. In addition, the occurrence predictions of debris flow with instant rainfall intensity could not be estimated precisely, i.e., rainfall intensity was not the only dominant factor causing debris flows. As a matter of fact, the obstacle for analyzing the debris flow involved many environmental factors which require a good strategy to extract their importance (Pachauri and Pant 1992; Donati and Turrini 2002; Carrara et al. 2003). Floris et al. (2004) reported the core triggering factors of debris flow in Northern Apennines (Italy). The slope, which consists of mainly clayey and clayey-marly terrains, has been affected by landslide triggering phenomena. However, the development of an analysis model on prediction can be of help to create a warning system for landslide risks mitigation.

  2. 2.

    Statistical factor analysis. Johnson and Rodine (1984) presented the slope stability is an important factor causing debris flow. Besides these two factors of rainfall and slope, soil weight and water level are also important factors (Wang 1994). Recently, Melelli and Taramelli (2004) started from an inventory of hillslope hollows from air photographs and fieldwork for mapping debris-flow events down-slope from the initiation sites. Their analysis of the morphogenetic factors influencing slope instability processes was used to define a representative elementary area (REA) and to examine causal relations between the factors and debris-flow events. Some of the GIS-based methods used for landslide susceptibility mapping draw a great attention to scientists and engineers; for example, the weights of evidence method (Van Westen et al. 2003), the multi-variable statistical analysis with logistic regression (Dai et al. 2001; Dai and Lee 2003; Wang and Sassa 2005), frequency ratio method (Carrara et al. 1999; Lee and Sambath 2006) and the discriminant analysis (Baeza and Corominas 2001; Santacana et al. 2003) are most widely used.

  3. 3.

    Artificial intelligent analysis approaches. Lee and Chang (1995) presented a fuzzy model for the prediction of debris flow, but did not consider the cumulative rainfall as an important factor. Moreover, this fuzzy model after defuzzification methods cannot provide a boundary of clear ranges which can be used for the indication of occurrence of a debris flow. Chang and Lee (1997) analyzed the instant rainfall intensity in the areas of debris flow using the group method of data handling (GMDH; developed by Ivakhnenko (1970)) with artificial neural networks (ANN).The ANN model with physical terrain factors was applied to the study of landslides, in particular, indirect determination of the triggering parameters and landslide susceptibility mapping (Mayoraz et al. 1996; Fernàndez-Steeger et al. 2002; Ermini et al. 2005). ANN provides a quick and effective way to estimate the occurrence in a given zone; however, it adopts a black-box model, and the trained network (weights and thresholds) cannot be reused.

With the progress of spatial data survey techniques in geosciences, massive amounts of data or information can be easily collected and monitored. Thus, the analysis of influenced variables of debris flow/landslides becomes complicated. There are many possible techniques for data classification. However, those variables of depictive data in a given watershed have many uncertainties which may request a preprocessing analysis for enhancing their accuracy. Further, those variables also need to be sieved out their characteristics and features. Basically, common concepts for the dimension reduction approaches are to reduce the dimensionality but, in the process, irreversibly transform the descriptive dataset features. These methods include. (1) Hard dimension reduction problems: typical methods include principle component analysis (PCA) (Devijver and Kittler 1982) and rough set analysis (Nguyen and Skowron 1995; Chouchoulas and Shen 2001). (2) Soft dimension reduction problems: the typical method is factor analysis (Friedman and Tukey 1974). (3) Visualization problems: typical methods include projection pursuit (Mardia et al. 1979) and multidimensional scaling (Torgerson 1952). However, these data reduction methods are widely applied in many fields, especially when the data contain (1) reduced chaotic information in the data set, (2) redundant depictive factors, and (3) incomplete measurements in the factor analysis. Accordingly, a possible mining technique (dimension reduction) has to be developed to extract the influencing factor(s) on debris occurrence. In the mean time, once the factor analysis can be successfully produced, a preliminary analysis for collecting the influencing factors will be greatly reduced, i.e., a valuable concept and knowledge may become a crucial idea to tackle debris flow hazards.

As part of this study, a well-known statistical method, combining PCA and linear discriminant analysis (LDA), is used to study the data classification and dimension reduction problem involved in debris flow. The PCA is a linear combination of variables (attributes) to access a compromised output with the purpose of reducing with data dimensions. To sum up, PCA is a dimensionality reduction tool in common use, perhaps due to its conceptual simplicity and the existence of relatively efficient algorithms for computation. PCA aims to find a new set of dimensions (attributes) that better captures the variability in multiple dimensions. However, the drawback of PCA is that all attributes influence the output decision. PCA aims to find a new set of dimensions (attributes) that better captures the variability in multiple dimensions, i.e., the first dimension is chosen to capture as much of the variability as possible. It requests the reduction of the dimensions of data (Tian et al. 2005; Mundt et al. 2005). Therefore, data mining could be a possible solution to tackle the tedious computational work in reducing dimensionality.

Data mining (Lei et al. 2008; Wan et al. 2008, 2009) had become a brand new approach in analyzing landslides and geosciences. This research used the discrete rough sets (DRS; Nguyen and Skowron 1995; Nguyen and Nguyen 1998a, b) to tackle the uncertainties arising from the materials and parameters involved in an observed landslide. The concept of DRS arises from conventional rough sets. The conventional rough set can only resolve data that are pre-classified into certain levels of groups. As a matter of fact, natural or environmental data are distributed either uniformly or normally. However, the revolution of DRS turns those continuous data into appropriate levels of groups mathematically. In other words, the separate point of DRS can successfully break the real world data into several levels and transform them to the Information Table. It (1) extracts numerous debris flow description factors for core influencing factor(s), (2) searches the segmentation points (thresholds) with regard to the core influencing factor(s), (3) establishes debris flow occurrence of the knowledge description (interpretation rule). In accordance with the previous steps, the accuracy of the debris flow can be greatly improved.

The study has been divided into four parts. In the first part, the development of the study area, geomorphology and land-cover factors for database, is discussed. In the second part, combined LDA + PCA methods (LDA + PCA) are introduced. The third part will briefly introduce the DRS method. The fourth part shows the results of a parallel analysis of the landslide problem through (a) LDA + PCA method and (b) DRS method. The data analysis is carried out by the DRS method and rational results are obtained.

The geomorphology, land-cover factors and vegetation index for the debris flow problem

Varnes (1978) defined a debris avalanche as a rapid flow of predominantly coarse debris consisting of soil and/or weathered bedrock. Debris flow originates when poorly sorted debris (rock, soil, woody debris, etc.) is mobilized from hill slopes and channels by sufficient moisture in the soil. Cruden and Varnes (1996) proposed a classification process: landslide events can be classified as rotational–translational movements with respect to earth slides and earth flows. They are primarily affected by their formations, often highly deformed, that widely outcrop in the mountain chain. Lin et al. (1993) presented the characteristics of debris flow in gravelly deposits as stream slope, rainfall, rainfall intensity, geological condition, grain size distribution, void ratio, shear strength, vegetation condition, and channeled topography. Lin et al. (1998) also discussed the contributing factors of debris flow events for the application of spatial information techniques (Remote Sensing and GIS). To sum up, from the past instances, it is quite difficult to determine a possible solution in finding the induced-factors or core factors with regard to the mechanism of the debris flow events.

Accordingly, the observed influenced factors from debris flow should be discussed rationally. The most likely contributing factors toward debris flow are topography, geology, watershed geometry factors and remote sensing data for vegetation condition (Lee and Choi 2004; Lin et al. 2007; Tian et al. 2005; Wan et al. 2008). However, the influenced factors can be organized into three major areas: (1) the geomorphology of surrounding watershed, (2) the geomorphology of surrounding stream, and (3) the land-cover by vegetation. In this research, the potential debris-flow streams distribution from the Water Conservation Bureau (WCB) database was categorized as the sub-watershed data of Chen-Yu-Lan stream (WCB Website 2008a). Further, in this study, a GIS database to describe debris flow in this study area is generated that can be further divided using 18 factors, including (1) watershed area, (2) watershed perimeter, (3) watershed of average elevation, (4) watershed of average slope, (5) watershed of primary length, (6) stream length, (7) geology index, (8) watershed width, (9) form factor, (10) stream density, (11) stream sinuosity, (12) average slope of stream, (13) total length of stream, (14) NDVI, (15) cover and management factor, (16) bare-soil land area, (17) bare-soil land evaluation rate, and (18) bare-soil land geology index. The index and symbol definition are given in Table 1. Some of the definitions of the symbols are quite straightforward, but some require further explanation as described in the following:

Table 1 The environmental factors of study site
  1. 1.

    Form factor. Form factor (also known as shape factor) is the ratio of the minor axis to the major axis of the watershed area (Pareschia et al. 2002). Shrestha (2001) proposed a study on the restoration of vegetation for the conservation of the dilapidated mountainous regions of Nepal. He found that the slender-shaped (form factor is 0.14) watershed is useful in investigating the watershed characteristics. Herein, it was shown that the debris flow is highly correlated to the shape of the watershed area and his study selected this attribute to evaluate the occurrence of debris flow.

  2. 2.

    Normalized difference vegetation index (NDVI). To determine the density of vegetation on a patch of land, researchers must observe the distinct colors (wavelengths) of visible and near-infrared sunlight reflected by the plants (Lin et al. 2006a). As can be seen through a prism, many different wavelengths make up the spectrum of sunlight. Nearly all satellite vegetation indices employ this difference formula to quantify the density of plant growth on the earth—near-infrared radiation (NIR) minus red radiation (R) divided by near-infrared radiation plus red radiation (Bannari et al. 1995). The result of this formula is called the normalized difference vegetation index (NDVI). The values for NDVI are obtained from SPOT image. The range of this value is [−1, 1].

  3. 3.

    Cover and management factor. Cover and management factor (C value) is taken from the plant-cover condition of the universal soil loss equation (USLE) (WCB Website 2008b). The C value is in the 0–1 range. When the land is bared, C value is assigned as one. On the contrary, when the land has good vegetation condition, C value is approaching to zero. The C value varies with the vegetation type, season change, and the percentage of covered land (Lin et al. 2002a, b, 2006b; Özhan et al. 2005).

    $$ C = {\frac{{1 - {\text{NDVI}}}}{2}}. $$
    (1)
  4. 4.

    Bare land geology index (DE). The weight factor of geology index was reported by CGS (CGS 2005). The Bare land geology index was computed by the following equation:

    $$ {\text{DE}} = {\frac{{\sum {E_{i} \times A_{i} } }}{A}}, $$
    (2)

    where E i is the value with regard to the associated soil type, A i is the observed sub-area of bare land, and A is the total watershed area. For instance, E = 10 for Schist or Slate type soil, E = 6 for old tertiary sedimentary rock, and E = 3 for New Tertiary Pleistocene (Lin et al. 2006b).

  5. 5.

    Bare-soil land evaluation rate. The bare-soil land evaluation rate is a ratio defined as the new landslide area divided by the water basin area. If this ratio is high, the watershed area becomes fragile. In this study, these areas of landslide change are collected and then identified based on two different periods of SPOT data (1999, 2001).This can be a detection process using them as a material to monitor the land-cover change area of debris flow occurrence.

  6. 6.

    Stream density. This is defined as the total length of all the streams and rivers in a watershed divided by the total area. Considering a certain level of stream density, the factor of soil permeability and the underlying rock type affect the runoff in a watershed which become a dominant factor on debris flow occurrence. Stream density can also affect the shape of a river’s hydrograph during a rain storm.

  7. 7.

    Stream sinuosity. This is defined as the extent to which a river meanders within its valley, calculated by dividing the total stream length by the valley length. A high value of stream sinuosity is most likely to be flooded. Therefore, this value can be attained from GIS and DEM data will be helpful to understand the occurrence of debris flow.

Study area and material

The environmental features of debris flow events

The watershed of Chen-Yu-Lan River, located in the central part of Taiwan, was selected to be the study site as shown in Fig. 1. The Chen-Yu-Lan River originates from the north peak of Yu Mountain with an elevation of 3,910 m. Chen-Yu-Lan River is one of the upper rivers of the Zhuoshui River system, which is the largest river system in Taiwan. Furthermore, Chen-Yu-Lan River has a length of 42.4 km with an average declination slope of 5%, and its watershed area is about 45,000 ha. From 31 July through 1 August (1996), the heavy rainfall brought by Typhoon Herb which induced 34 debris flows in the watershed of the Chen-Yu-Lan River (see Fig. 2a). As aforementioned, this area was already very fragile from the strong ground motion of Chi-Chi earthquake. Afterwards, a large precipitation of about 1,291 mm (peak discharge 195 m3/s of 73 mm in 1 h) brought into the Chen-Yu-Lan River.

Fig. 1
figure 1

The study area of Chen-Yu-Lan Stream

Fig. 2
figure 2

The potential stream and geology distributions of Chen-Yu-Lan Stream. a Potential stream of debris flow. b Geological map. c River system. d Bounder line of sub-watershed

In this study, the research data consists of two formats: (1) vector and (2) raster data. The vector data includes (a) potential stream of debris flow, (b) geology, (c) river system, and (d) boundary line of sub-watershed (see Fig. 2). The potential stream of debris flow (see Fig. 2a) has ever taken place the debris disaster and the disaster may occur in the near future. However, this result from WCB (2008a) and Central Geological Survey (2005) and has developed a series evolution process of the environment factors. The scale of the geology diagram (Fig. 2b) is 1/250,000 displaying the geology distribution of the study region from CGS (2005). In fact, geology and morphology conditions can affect the occurrence of landslides and debris flow problem. In this map, different colors represent the different geological conditions of the Chen-Yu-Lan River. For example, the symbol of Q6 represents the Holocene epoch and the soil that is mainly constituted of gravel and sand. Other geological conditions are listed in Fig. 2a. Basically, these regions having complicated geology and fault crossovers can be good materials for fracture geology and the debris flow problem. Figure 2c demonstrates the river system and Fig. 2d illustrates the boundary line of the sub-watershed in the study area. On the other hand, the raster data consists of digital elevation model (DEM) data and remote sensing (SPOT4) data. To handle the geomorphology characteristics of the study area, a well-developed DEM data are generated. These data will be used to construct a knowledge rule for landslides. The DEM data were extracted from the aerial photos which adopted the HEC-Geo HMS module with the DEM data (40 m × 40 m resolution). The geomorphology factors of aspect, evaluation, slope and river system maps are then extracted from the DEM data. The SPOT Image of Chen-Yu-Lan River is monitored from upstream to downstream, as shown in Fig. 3. These data render a good evaluation on the overall range of these study samples of debris flow. In Taiwan, the SPOT image resolution cell was 12.5 m × 12.5 m (Center for Space and Remote Sensing Research of National Central University in Taiwan, CSRSR 2008). It was decided to reduce to 12.5 m × 12.5 m to meet the standard grid size. Additionally, the size of debris flow area was generated through artificial diagnosis (Arc-GIS file). Meanwhile, the boundary was generated. One of the advantages of using these areas is easily identified in the SPOT satellite image data by means of their spectral characteristics and regular field geometry. In addition, these areas of landslide change are identified based on two periods of SPOT data (1999, 2001). These images were collected after a typhoon or heavy rains with cumulative precipitations of over 100 mm, which was taken in the aftermath of Bilis Typhoon (21/08/2000), Toraji Typhoon (28/07/2001) and Nari Typhoon (16/09/2001). When the Toraji Typhoon struck the Central Part of Taiwan, it triggered extensive landslides. The typhoon delivered very heavy rain to the Chen-Yu-Lan River (1,217 mm over 3 days). The area was already very disturbed following the strong ground motion of Chi-Chi earthquake and this led to a major calamity. Thus, this can be a detection process for using them as a material by monitoring the land-cover change area of debris flow occurrence.

Fig. 3
figure 3

The origin image and classification result of SPOT satellite data. a Origin image and classification result on 31/10/1999. b Origin image and classification result on 05/03/2001

Land-cover data extraction from the SPOT image

In this work, a two-stage study is designed to interpret the landslide pattern of satellite images. In the first stage, the traditional supervised classification method of Maximum Likelihood Classification (MLC) is used to obtain four major land-cover categories of (1) water, (2) forest, (3) landslide and (4) bare-soil area (almost river round). The detailed classification results are shown in Fig. 3. In addition, the use of the above measurements enables the monitoring of the land-cover change area of debris flow occurrence. To obtain a complete land change distribution in the study site, a series of aerial photos are used to identify the classification process from SPOT images. These ground truth map is a raw data from the Aerial Survey Office, Provincial Department of Agriculture and Forestry of Taiwan (ASO 2008). Table 2 presents the accuracy of the image classification results (or the so-called error matrix) of sampling. It displays the accuracy outcomes from those four categories of satellite images. There are 400 samples randomly selected through ERDAS that are reliable enough to present the classification accuracy for each category. However, the overall accuracy is 92.2% (1999) and 91.75% (2001), respectively. The results and extractions are satisfactory for the generation of landslide data to the spatial information database. To observe the variation of vegetation on the land surface, the NDVI and cover and management factor are required. Lin et al. (2002a, b, 2006b) study some similar researches about debris flow occurrence indices. However, the study may involve redundant and surplus variables (Hsieh et al. 1995; Lin et al. 1998). Hence, the techniques for feature extraction and feature selection of core factors from debris flow events should be applied and discussed. The authors have a good experience in using feature extraction in paddy rice image classification (Lei et al. 2008) and feature selection debris flow classification analysis (Wan et al. 2008), and landslide susceptibility map (Wan et al. 2009). Applying this concept, feature extraction and feature selection are proposed in the following section.

Table 2 Classification results of error matrix evaluation

Basic principle of classification method

Feature extraction process (PCA + LDA)

PCA

PCA is a well-known multivariate analysis technique for reducing data dimensions. The use of PCA allows a smaller number of variables in a multivariate data set (Fukunaga 1990; Mundt et al. 2005; Tian et al. 2005). Mathematically, PCA is a process that decomposes the covariance matrix of a matrix into two parts: eigenvalues and column eigenvectors. The reduction process is achieved by taking p variables X 1, X 2, …, X p which are then combined to produce principal components (PCs) PC1, PC2, …, PC p , that are uncorrelated. These PCs are also termed eigenvectors. The lack of correlation is a useful property as it means that the PCs are measuring different “dimensions” in the data. Nevertheless, PCs are ordered so that PC1 exhibits the largest amount of variation, PC2 exhibits the second largest amount of variation, PC3 exhibits the third largest amount of variation, and so on. When using PCA, it is hoped that the eigenvalues of most of the PCs will be low so that they are virtually ignorable. Accordingly, sieving small amounts of variables in the original number of variables (X variables) can be described using the smaller number of PCs (Fukunaga 1990).

LDA

LDA is a classical statistical approach for classifying samples of unknown classes, based on training samples with known classes. LDA-related Fisher’s linear discriminant (Fukunaga 1990) and machine learning to find the linear combination of features which best separate two or more classes of objects or events. Discriminant analysis differs from factor analysis in that it is not an interdependent technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made. The detailed processes can be found in Fukunaga (1990).

PCA + LDA

PCA aims to project the data in the direction of maximal variance. LDA is supervised and is used as the project axes. Among these extensions, PCA + LDA, a two-stage method, received relatively more attention in handling decision science. For instance, face recognition and analysis are extracted successfully from the features of the face patterns (Sahoolizadeh et al. 2008). Moreover, the LDA + PCA is the main trend in feature extraction has been representing the data in a lower dimensional space computed through a linear transformation satisfying certain properties (Yang and Yang 2003; Sahoolizadeh et al. 2008). Few of the studies used this method to solve the environmental problems in Geosciences. In this study, the evaluation of the debris flow (Baeza and Corominas 2001; Santacana et al. 2003) on Chen-Yu-Lan River through the PCA and LDA method which presents a similar concept of face recognition and analysis. Both methods have advantages and disadvantages for debris flow recognition. However, this study compared the performance of the two methods: (1) PCA + LDA method and (2) DRS for debris flow factors recognition, and the steps for the construction of the DRS theory are introduced in the following section.

Feature selection process (DRS)

This section introduces the progress of DRS. Unfortunately, the conventional rough set can only resolve data that are pre-classified into certain levels of groups. As a matter of fact, the actual environmental data are distributed uniformly. Hence, DRS is employed as an appropriate tool to evaluate them.

Quantization problems (Nguyen and Skowron 1995)

If A = (UA ∪ {d}) is a decision table with a large number of values of objects from U for some a ∈ A, then there is a very small probability that a new object will be recognized by matching its attribute value vector with the rows of this table. Hence, for decision table with real value attributes, some discretization strategies are built to achieve a higher quality of classification.

Let A = (UA ∪ {d}) be a decision table where U = {x 1, x 2, …, x n }. It is assumed that V a  = [l a , r a ) ⊂ ℜ for any a ∈ A, where ℜ is the set of real numbers. A is assumed to be a consistent decision table. Let P a be a partition on V a (for a ∈ A) into subintervals i.e.

$$ P_{a} = \left\{ {\left[ {C_{0}^{a} ,C_{1}^{a} } \right],\left[ {C_{1}^{a} ,C_{2}^{a} } \right], \ldots ,\left[ {C_{k}^{a} ,C_{k + 1}^{a} } \right)} \right\} $$
(3)

for some integer k, where \( l_{a} = C_{0}^{a} < C_{1}^{a} < C_{2}^{a} < \cdots < C_{k}^{a} < C_{k + 1}^{a} = r_{a} \) and,

$$ V_{a} = \left[ {C_{0}^{a} , C_{1}^{a} } \right) \cup \left[ {C_{1}^{a} , C_{2}^{a} } \right) \cup \cdots \cup \left[ {C_{k}^{a} , C_{k + 1}^{a} } \right). $$
(4)

Any P a is uniquely defined by the set C a  = {C a0 , C a1 , C a2 , …, C a k , C ak+1 } called the set of cuts on V a [the set of cuts is empty if card (P a ) = 1]. In the sequel, one identify P a with the set of cuts on V a defined by C a . Any family {P a : a ∈ A}, where P a is a partition on V a called a partition on A. Then any family P = {P a : a ∈ A} of partitions can be represented by P = ∪ {a} × C a . Any pair (ac) ∈ P will be called a cut on V a .

Any family P = {P a : a ∈ A} of partitions on A defines from A = (UA ∪ {d}) a new decision table A P = (UA P ∪ {d}), where A P = (a P : a ∈ A) and \( a^{P} (x) = i \Leftrightarrow a(x) \in [C_{i}^{a} ,C_{i + 1}^{a} ) \) for any x ∈ U and i ∈ {0, …, k}. The table A P is called P-quantization of A.

Two families of partitions P′, P on A are equivalent, i.e. P′ ≡  A P, if and only if \( A^{P} = A^{{P^{\prime}}} \). The equivalence relation ‘≡ A ’ has a finite number of equivalence classes. In the sequel, it is not being distinguished between equivalent families of partitions.

The quantization problems of real value attributes of A can be described as decision problem:

Complexity of quantization problems (Nguyen and Skowron 1995)

Let A = (UA ∪ {d}) be a decision table where U = {x 1, x 2, …, x n }. An arbitrary attribute a ∈ A defines a sequence V a1  < V a2  < ··· < V a na , where {V a1 , V a2 , …, V a na } = {a(x) : x ∈ U} and n a  ≤ n

Let P a k be a prepositional variable corresponding to the interval [v a k ; v ak+1 ) for any k ∈ {1, …, n a  − 1}and a ∈ A. By BV(A), it is denoted as the set of all prepositional variables of the above form.

Any partition P ⊆ ∪ aA{a} × V a defines a valuation val-p of prepositional variables P a k by valp(P a k ) = true iff there exists a cut (a, c a ) ∈ P satisfying v a k  ≤ c a  < v ak+1 . Instead of valp(P a k ) = true, it can be also written as P| = P a k .

By φ{a, i, j} it is denoted as a disjunction of all Boolean variables from the set:

$$ P_{k}^{a} :\left[ {v_{k}^{a} ,v_{k + 1}^{a} } \right) \subseteq \left[ {\min (a(x_{i} ),a(x_{j} ));\max (a(x_{i} ),a(x_{j} ))} \right. $$
(5)

Hence valp{φ(a, i, j)} = true, iff there is a cut in P on V a between a(x i ) and \( a(x^{\prime}_{j} ) \).

By ℜk(i, j), it is denoted as a disjunction of all φ{a, i, j},where a ∈ A and a(x i ) ≠ a(x j ). Formula ℜk(i, j) is called the discernibility formula for objects x i, x j (it is assumed that the disjunction of the empty set of variables to be equivalent to true).

The discernibility Boolean prepositional formula of A is denned by:

$$ \Upphi^{A} = \Uplambda \left\{ {\psi (i,j):d(x_{i} ) \ne d(x_{j} )} \right\} $$
(6)

Any non-empty set S = {P a1k1 , P a2k2 , …, P ar kr }of Boolean propositional variables from BV(A) defines a family of partition P(S) as follows:

$$ P(S) = \left\{ {\left( {a_{1,} {\frac{{v_{k1}^{a1} + v_{k1 + 1}^{a1} }}{2}}} \right),\left( {a_{2,} {\frac{{v_{k1}^{a1} + v_{k2 + 1}^{a2} }}{2}}} \right), \ldots ,\left( {a_{1} {\frac{{v_{kr}^{ar} + v_{kr + 1}^{ar} }}{2}}} \right)} \right\}. $$
(7)

To make the theory more easily understandable, an example based upon the above is demonstrated.

Stages of DRS

There are four stages in DRS analysis. In the first stage, the development of the “Information Table” is required for describing the characteristic attributes. In this table, the relation in the multi-attribute set is displayed. In the information table, each row represents a new case (or object). Each of the columns represents the respective variables (or condition attributes). In this study, the variables can be the site condition such as the geomorphology, land-cover, and river density. The outcome (also called the concept or decision attribute) of each object is either 1 or 0, indicating whether the particular case of debris-flow has occurred.

In the second stage, all the attributes must be clustered into appropriate classes to construct a “Decision Attribute”. The crucial aspect is to find the appropriate classes. In other words, if the separate points can be found, then the appropriate classes of attributes can be determined. Please refer to Nguyen and Skowron (1995) or Eq. 7 for the detailed process of how the separate points are calculated. The rough set provides a possible solution in discretizing the chaotic information. In this study, a new concept is proposed to deal with the uncertainty of classification in the debris flow problem.

The third stage is to attain the cores and reducts of the data attributes. There are two fundamental concepts related to attribute reduction. The minimal subsets of attributes that discern all equivalent classes of the relation, which is discernable by the entire set of attributes, are called reducts. The core is the common part of all reducts.

The last stage is the most important application of rough sets which is the generation of decision rules for a given Information Table to predict the classes for new objects that are beyond visual observation. Using a reduced Information Table, the rules could be found through determining the decision attributes value based on the condition attributes values. Therefore, the rules are presented in an “IF condition(s) THEN decision(s)” format. If the condition(s) in the IF part matches with the given fact(s), the decision(s) in the THEN part will be performed.

Results

The results of this study are divided into five parts: (1) strategy for selecting effective samples, (2) application of multi-variable analysis, (3) results of PCA + LDA, (4) results of DRS, and (5) comparison of results for (3) and (4).

Strategy for selecting effective samples

One thousand and four hundred and twenty (1,420) creeks in Taiwan are classified as hazardous debris-flow creeks according to the maps published by the Council of Agriculture. They are referred to the report of the Council of Agriculture (Lin et al. 2002a, b, 2003). This research found there are 73 potential streams (including 146 recorded data from 2000 and 2001) in the study region. The maps are generated by the government of Taiwan and are generally recognized as a relatively accurate source of resource data (WCB Website 2008a). They are the raw data (study material) collected for the spatial information database. In this study, 18 selected typical streams [a total of 36 samples including 9 occurrences and 9 non-occurrences of debris flow from the WCB Website (2008a)] were judged to be debris-flow hazards based on the evaluation of the aforementioned factors (see Table 1) reported by Lin et al. (2002a, b, 2003), Jan and Chen (2005), CGS (2005) and WCB Website (2008a). In this study, it is decided to select the most vulnerable catchments as the training dataset for watershed areas of different sizes suggested by Lin et al. (2002a, b, 2003) and CGS (2005). The rest of the data (testing data for verification) were used from Lin et al. (2002a, b), which have been evaluated as the most fragile debris flow area. The attributes represent the in situ conditions and the decisions represent the occurrence (d = 1) or non-occurrence (d = 0) of the debris flow (refer to Table 3). These selected data are recognized as the most representative data of the training data. On the other hand, the rest of the data (55 streams for 110 testing data) are used for verification. During the heavy rainfall of Toraji Typhoon (28/07/2001), the geological materials and colluvium are easily weakened, which often leads to a debris flow. In this study, related data concerning the Chen-Yu-Lan River were collected on November 2001, through the spatial database and site investigation of a typhoon that occurred on 22 May 2002.

Table 3 The 36 testing samples of debris flow decision table from study site

Application of multi-variable analysis

Initially, multi-variable analysis is used to solve the debris flow training data. In the study cases, PCA is used to compute its data reduction and the feature extraction method of the training data. Table 4 shows the total variance outcomes of the dataset. Among those components, after the fourth component is selected, it can be observed that the trend of the accumulated given value becomes smoother. That is, the components after fifth do not significantly influence the decision. Consequently, the first to fourth components govern 84.87% of the variation in outcomes. Finally, this case selected the first four factors (PCA1–PCA4) to present the major factors of the study site. On the other hand, in reviewing the past literatures, Pratsinis et al. (1988) announced that when the eigenvector is greater than 0.7, then the related factors can be considered as prominent factors of datasets for the debris flow problem. Applying this concept, Table 5 shows the results of the influencing factors and their contribution to the occurrence of debris flow. Meanwhile, the influencing factors of this case reduced the dimension of environmental factors from 18 to 14. Finally, the (1) stream sinuosity, (2) average slope of stream, (3) bare-soil land evaluation rate, and (4) bare-soil land geology index are removed from the dataset by means of data reduction process.

Table 4 Total variance outcomes
Table 5 Component matrix

Results of PCA + LDA (feature extraction)

Four prominent factors (PCA1–PCA4) and LDA are used to generate the discrimination function, i.e., it can provide information on 14 major factors influencing debris flow occurrence (refer to Fig. 4a). It has to be pointed out while using the discrimination, the cover and management factor is not utilized in the analysis process. In fact, the cover and management factor is derived from the NDVI, thus we decided to use the factor NDVI as a substitute for the cover and management factor (see Eq. 1). Equation 8 shows the results of PCA + LDA:

$$ Z = 630.2x_{1} - 453.2x_{2} + 480.7x_{3} - 337.6x_{4} + 34.1x_{5} + 140.6x_{6} + 106.9x_{7} - 4.4x_{8} - 28.6x_{9} + 104.8x_{10} + 9.1x_{11} + 34.6x_{12} - 25.6x_{13} - 64.8, $$
(8)

where Z is the decision function of debris flow occurrence or non-occurrence, x 1 the primary length of the watershed, x 2 the stream length, x 3 the total stream length, x 4 the watershed perimeter, x 5 the bare-soil land area, x 6 the watershed area, x 7 the form factor, x 8 the geology index, x 9 the watershed of average elevation, x 10 the stream density, x 11 the watershed width, x 12 the watershed of average slope, and x 13 is the NDVI.

Fig. 4
figure 4

Dimensional reduction approach to described the debris flow factors. a PCA + LDA, b DRS

Using Eq. 8, the dataset outcomes can be divided into two categories (occurrence and non-occurrence) and the accuracy is 100% for the training dataset when using those 36 training samples. There are 18 occurrence samples and 18 non-occurrence samples for the study site in 2000 and 2001, respectively. The discrimination function is Z generated as:

$$ IF\,Z \ge 0\,Then\,{\text{Decision}}\,{\text{as}}\,{\text{debris}}\,{\text{flow}}\,{\text{is}}\,{\text{classified}}\,{\text{as}}\,{\text{an}}\,{\text{occurrence}}\,{\text{sample}} $$
$$ IF\,Z < 0\,Then\,{\text{Decision}}\,{\text{as}}\,{\text{debris}}\,{\text{flow}}\,{\text{is}}\,{\text{classified}}\,{\text{as}}\,{\text{a}}\,{\text{non-occurrence}}\,{\text{sample}} $$

From the foregoing statements, the discrimination function is also applied to the testing dataset (110 testing samples) for attaining the accuracy (see Table 6). There are 24 occurrences and 86 non-occurrences. In the occurrence sample, there are 13 accurate samples verified by PCA + LDA with an accuracy rate of 54.2%. Also, in the non-occurrence sample, there are 42 accurate samples verified by PCA + LDA with an accuracy rate of 48.8%. LDA is one of the well-known linear projection techniques for feature extraction in classification problems. The basic concept is to use a process of generalized eigenvalue decomposition. The major drawback of applying LDA is often degraded the “Small Sample Size” (SSS) problem (Lu et al. 2003). Generally, one popular solution to the SSS problem is to combine with a PCA method. However, this procedure cannot effectively eliminate the uncertain information in the debris flow data set (Information Table). On the other hand, rough set analysis can provide an effective feature selection procedure to keep useful information in the data analysis process. Thus, as a part of this study, the DRS method is used to extract a better outcome and then the performance is compared.

Table 6 The 110 testing samples classification result by PCA + LDA method

Results of DRS (feature selection)

This study also used DRS as a parallel study for comparison. The first step is to create the Information Table. Then, through the Boolean operation, the discernibility matrix is generated. The second step is to calculate the core factors of the most influenced factors to the decisions. The third step is to calculate the separate points with respect to the core factors (refer to Fig. 4b). The results of DRS can be stated as:

  1. 1.

    The core factors are form factor (F) and stream density (SD).

  2. 2.

    The cutting points for form factor and stream density are 0.04335 (the real value from the dataset is 0.3244) and 0.54645 (the real value from the dataset is 0.0018), respectively. In general, the alarm values of form factor usually display differently for various study areas over the world, i.e., the debris flow occurrences are governed by the environmental conditions such as geological factors, hydraulic situations and vegetation terms. These variables will totally affect the form factor of a watershed in this case. Specifically, the form factors are subjectively or statistically observed in the range of 0.1–0.6 (Jan and Chen 2005) and 0.2–0.5 (Chen et al. 2004). This threshold is more rational than some previous studies (Wu 1999; Chen et al. 2004; Jan and Chen 2005; Chen and Jan 2008) in the central part of Taiwan. However, these ranges require a systematic manner to analyze them. Fortunately, DRS renders a great help in solving them. Stream density is the total length of all the streams or rivers in a watershed divided by the total area of this region. In addition, stream density can effect the erosion of soil during a rainstorm. From another viewpoint, physically, high stream density will correlate to poor permeable soil properties because the water runoff is quite large in this area.

  3. 3.

    The Information Table is transferred to the Boolean Table (see Table 3; the data must be preprocessed by the normalization process then plugged into the RSES program). If the attributes values are greater than the cutting point, the Boolean value will be assigned as 2; otherwise, the Boolean value will be assigned as 1. For instance, if the form factor is greater than 0.04335, the Boolean value will be assigned as 2 or the Boolean value will be assigned as 1 (refer to columns F and SD in Table 7). The rules will be created as:

    Table 7 Debris flow decision rule recreated by DRS method
    $$ \left\{ \begin{gathered} IF\quad F < 0.04335\quad {\text{and}}\quad {\text{SD}} < 0.54645\quad {\text{or}} \hfill \\ \,\,\,\,\,\quad F < 0.04335\quad {\text{and}}\quad {\text{SD}} > 0.54645\quad {\text{or}} \hfill \\ \,\,\,\,\,\quad F < 0.04335\quad {\text{and}}\quad {\text{SD}} > 0.54645\quad {\text{then}}\;{\text{occurrence}} \hfill \\ \hfill \\ IF\quad F > 0.04335\quad {\text{and}}\quad {\text{SD}} < 0.54645\quad {\text{then}}\;{\text{non{-}occurrence}}\, \hfill \\ \end{gathered} \right. $$
    (9)

Also refer to the new decision information system in Table 7. From Eq. 9, if the form factor is greater than 0.04335 (the real value from the dataset is 0.3244), then it may induce debris flow. In other words, if the shape of the watershed area appears circular, it may have a higher capability to retain water. In addition, if the stream density is lower than 0.54645 (the real value from the dataset is 0.0018), it may show the rainfall has a low probability of infiltrating the zone between the soil-layer and laccolith. If the stream density is lower than the threshold value, it seems infiltration will be higher than in the usual case. Physically, higher infiltration will induce debris occurrence.

Table 8 shows the classification results of debris flow events using the DRS method. There are 24 occurrences and 86 non-occurrences. In the occurrence sample, there are 17 accurate samples verified by the DRS method with an accuracy 70.8%. Also, in the non-occurrence sample, there are 51 accurate samples verified by DRS with an accuracy rate of 59.3%. The outcomes of verification accuracy are much higher than PCA + LDA. The reason for better performance is the core factor(s) and separate point(s) are successfully attained and then the redundant spatial data are eliminated. Through this process, the evaluation of the spatial data requested a mining technique to diminish the uncertainties and useless attributes.

Table 8 The 110 testing Samples classification result by DRS method

How to apply DRS to creat hazard levels in debris flow

In the past, Lin et al. (2002a, b, 2003); Auer and Shakoor (1993) and Lin et al. (2006a, b) used different observations and flowcharts to distinguish various levels of hazards of debris flow. Despite the results from their approaches being attained by statistical analysis, this study proposes a new concept for classifying three hazard levels and determining the DRS results. There are four types in three levels (see Table 9; Fig. 5) and the detailed outcomes can be categorized as:

Table 9 Classification on decision matrix by DRS method
Fig. 5
figure 5

The reclassification 3 levels of debris flow distribution by DRS method from WCB data and 2000, 2001 records

Type A. It is classified as level 1 (red line). The observed data points were from two given typhoon events. The in situ conditions (geomorphology and land-cover) are relatively fragile and sensitive to the debris occurrence. It is requested to install monitoring devices since there are dangerous areas.

Type B. It is classified as level 2 (yellow line). Using the knowledge rules from the DRS method, the conditions (geomorphology and land-cover) are identical to Type A but the outcomes are non-occurrence. This is the same as the term “an error of commission”. Thus, in the next typhoon event, this area could be dangerous for human beings.

Type C. It is classified as level 2 (yellow line). Using the knowledge rules from the DRS method, the conditions (geomorphology and land-cover) are in terms of non-occurrence, but it had a debris flow occurrence in Toraji typhoon. This is same as the term “an error of omission”. Thus, in the next typhoon event, this area might also be dangerous.

To sum up, in the level 2 of this study, it is found that nine streams are the potentially occurred debris flow and seven of them are not. It can be concluded seven of these high potential streams should be monitored in the next storm because there are many uncertainties in this analysis. Applying this method, the original numbers of the potential stream should be modified.

Type D. It is classified as level 3 (green line). The samples extracted from the case excluded Types A–C. They are relatively safe regions. In Fig. 5, the hazard levels of the entire river system can be plotted and the vulnerable watersheds are rationally found.

Summary and conclusions

Although various useful methods have been applied to debris flow events over the whole world, it is important to choose an effective and quick approach in advance to understand the debris flow problem. In particular, one must comprehend the landslide mechanisms spatially, especially in regions frequently affected by earthquakes and rainfalls. From the literature review, the factors influencing debris flow are quite difficult to understand and analyze their occurrence. In this study, the statistical outcomes from PCA + LDA are insufficient, owing to the thresholds of the variables not being evaluated. Thus, an advanced data mining (DRS) approach is used to attain the thresholds. This approach not only showed satisfactory results for the thresholds of influenced variables of the debris flow, but the occurrence rules were also successfully generated. However, in this study, the authors encountered two major problems in their debris flow investigations. First, most of the debris flow occurred in inaccessible places, thus making the measurement of site data difficult or impossible. The geology, geomorphology, water system and vegetation conditions were attained from GIS and remote sensing techniques. Second, conventional statistical methods such as PCA + LDA are very difficult to use, while some spatial acquisition data is surplus. An effective classifier is required to diminish the amount of useless in situ information. DRS can offer a positive knowledge description of the debris flow problems. In other words, following this reduction process, the new knowledge eliminates a lot of noise and chaotic information substitution, and this process improves the classification result. In this study, form factor and stream density are the dominant factors that affect the occurrence of debris flow. The threshold values are 0.3244 and 0.0018, respectively. The DRS can sieve out useless information of measures. Thus, the performance on classification accuracy of DRS is higher than PCA + LDA by about 15%. Also, four different types are classified to illustrate the level of hazards in the debris flow. Clearly, applying this concept, the susceptibility (potential) maps are generated to visualize the overall distribution of debris flow area. This could be of help to the decision-makers to evacuate the affected population away from the disaster zone. Thus, improvements are made and levels of hazards are rationally clarified for various in situ conditions.