The multiple logistic regression recognition model for mine water inrush source based on cluster analysis

Zhang, Hao; Xing, Haofeng; Yao, Duoxi; Liu, Liangliang; Xue, Daorui; Guo, Fei

doi:10.1007/s12665-019-8624-2

The multiple logistic regression recognition model for mine water inrush source based on cluster analysis

Original Article
Published: 13 October 2019

Volume 78, article number 612, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Environmental Earth Sciences Aims and scope Submit manuscript

The multiple logistic regression recognition model for mine water inrush source based on cluster analysis

Download PDF

Hao Zhang¹,
Haofeng Xing¹,
Duoxi Yao²,
Liangliang Liu¹,
Daorui Xue³ &
…
Fei Guo⁴

767 Accesses
29 Citations
Explore all metrics

Abstract

Mine water inrush is one of the major geological hazards that threaten safe production in coal mines. The accurate identification of mine water inrush sources plays a vital role in mine water disaster control, and it is the key to preventing mine water inrush incidents. Ninety-three water samples were extracted from the three types of aquifers in the Qinan coal mine. The cluster analysis method was then used to analyze 82 of the original water samples, and the other 11 water samples that did not meet the requirements were removed. Then, the remaining 82 water samples were regarded as training samples, and the principal component analysis was completed. Taking the scores of the principal components as the independent variable and the types of water inrush sources as the dependent variable, the multiple logistic regression recognition model was established. Meanwhile, this recognition model was used to recognize the types of mine water inrush sources and verify the recognition accuracy for the 82 training samples. The comprehensive recognition accuracy reached 86.6%, which is much higher than the traditional recognition methods of water inrush sources. Based on cluster analysis, the multiple logistic regression recognition model fully considers the ion content measurement errors and the complex relationships between the internal ions, and this recognition model is more reasonable and improves the accuracy of water inrush source recognition. This paper provides a new method for recognizing the problem of water inrush sources, which also provides an effective basis for mine water inrush prevention and control.

Source identification of mine water inrush based on principal component analysis and grey situation decision

Article 11 February 2021

Study on recognition of mine water sources based on statistical analysis

Article 18 December 2019

The PCA-KD-KNN-based water chemistry identification model of water inrush source type in mine and its application

Article 12 March 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Mine water hazards are one of the main geological hazards that can threaten the safety of coal mining. Coal mine water inrush often causes the partial submergence of a coal mine, causing huge economic losses and human casualties (Gui and Lin 2016; Hu et al. 2011; Wu et al. 2016). The Qinan coal mine is located in the Suxian mining area. Since the coal mine has been operational, there have been fewer occurrences of water bursting, and the water inflow has not been large. However, the Taoyuan coal mine, which belongs to the same hydrogeological unit as the Qinan coal mine, experienced a mine water inrush accident with a maximum water inflow of 29,000 m³/h in 2013, which caused a serious flooding incident. To prevent the occurrence of similar large-scale mine water inrush accidents in the Qinan coal mine, it is necessary to carry out mine water prevention and control work. Among these tasks, the accurate judgment of mine water inrush sources is a prerequisite of coal mine water inrush prevention and control work, as well as an important part of preventing mine water inrush accidents (Ganyaglo et al. 2011; Zhang et al. 2017).

For a long time, experts and scholars had proposed many methods for judging water inrush sources in the problem of “Recognition of mine water inrush”. Water inrush source recognition methods include geological analysis, hydrodynamic analysis, hydrochemical characteristics analysis, water temperature, water level dynamic observation and geophysical prospecting (Biswas and Sharma 2017; Farnham et al. 2000; Panagopoulos et al. 2016; Keskin et al. 2015). Among them, the hydrochemical characteristics analysis method is a simple and effective way to identify a mine water inrush source. Li et al. (2017) used a hydrochemical approach to ascertain the mine water sources and to locate the potential seawater inrush seepage channels in the Xinli Mine. Wei et al. (2015) identifies a water source by analyzing hydrochemical characteristic ions at water inrush points. But, at present, the methods of multivariate statistical analysis have been relatively mature. And the following methods are applied mostly in multivariate statistical analysis: principal component analysis reduces an original set of variables into a smaller number of uncorrelated components without losing much information (Jolliffe 2002; Kim et al. 2005; Meglen 1992; Qian et al. 2016), cluster analysis can measure the similarities among samples (Bu et al. 2010; Reghunath et al. 2002), discriminant analysis (includes Distance discriminant, Fisher discriminant and Bayes discriminant) can establish an intuitive discriminant relation (Chen et al. 2009; Huang and Chen 2011; Lu et al. 2012; Huang and Wang 2018). Xu et al. (2012) selected six sets of ions (K⁺ + Na⁺, Ca²⁺, Mg²⁺, Cl⁻, SO₄²⁻, HCO₃⁻) and their total dissolved solids (TDS) as discriminant factors for designing a GRA–SDA coupled model. Liu et al. (2013) proposed a Fisher recognition analysis for identifying a coal mining inrush water source under mining-induced disturbances. A comprehensive identification model combining hydrochemistry analysis, water source detection, and water channel exploration was proposed by Liu et al. (2018). Based on the constant ion content test results, including the pH values and total dissolved solid (TDS), Yin et al. (2006) used systemic clustering and stepwise distinguishing to analyze the sources of the inrush water in the Wanbei Mining area. The PCA–BP neural network model, based on laser-induced fluorescence technology, was also used to identify a water inrush source by Wang et al. (2017a, b).

However, the current recognition methods of mine water inrush sources did not fully consider the measurement errors of the ion content caused by external factors and have ignored the complex relationships between the ions. These recognition methods have certain deviations from the actual identification process for mine water inrush sources. Therefore, the objectives of this paper are to propose a new method for the accurate identification of mine water inrush source, and it is the multiple logistic regression recognition model based on cluster analysis, which fully considers the measurement errors of the ion content and the complex internal relationships of ions. This method uses cluster analysis to measure the similarities among original water samples, and its purpose is to screen the original water samples. Principal component analysis is used to extract the information of hydrochemical indexes, and multiple correlated indicator variables are converted into new independent sample indicators. The multiple logistic regression recognition model can predict and classify based on existing water samples. So it can effectively extract the variation information of the original water samples, eliminate the influence caused by the superposition of the information among variables and realize the recognition of mine water inrush sources. In addition, the recognition model was applied to the water samples to be discriminated to verify its accuracy. The results show that the multiple logistic regression recognition model based on cluster analysis has high accuracy. And it is easy to operate in the actual water source discrimination process, with straightforward discrimination results.

Hydrogeological conditions in the study area

The Qinan coal mine is located in the middle of Huaibei plain, and it is distributed in the Huaihe River valley, positions shown in Fig. 1. The Huihe River, a tributary of the Huaihe River, flows through the mining area, and It has high vegetation coverage. The study area belongs to the north temperature monsoon region ocean—continental climate and has distinctive four seasons. As a typical central plain climate, the annual average temperature and annual average precipitation are about 14.6 °C and 756 mm, respectively. Rainfall is concentrated in July and August. The evaporation capacity is higher than the precipitation, and the annual average relative humidity is 71%.

The Qinan coal mine is located in the southwest region of the Sunan syncline. The inclination of the strata in the northern coal mine is steep, generally ranging from 20° to 30°. However, the inclination of the strata in the middle and eastern regions of the Qinan coal mine is gentler, generally ranging from 7° to 15°. The coal-bearing strata within the study area are covered by a loose layer from the Cenozoic period.

The groundwater regime in the mining areas of Qinan consists of four subsystems: the loose aquifer of the Cenozoic, the coal-bearing sandstone fissure aquifer of the Permian, the limestone-karst fissure aquifer in the Taiyuan formation of the Carboniferous and the limestone karst fracture aquifer of the Ordovician. The hydrogeological characteristics of the aquifer and aquifuge are shown in Fig. 2.

Among them, the limestone karst fracture aquifer of the Ordovician is furthest away from the coal seam. Thus, under normal conditions, there is no direct water filling effect on the coal mine.

Materials and methods

Sampling and test

We collected a total of 93 original water samples from the Qinan coal mine during the period of 2000–2017 (positions shown in Fig. 1), and the 93 original water samples were evenly distributed between 2000 and 2017. These water samples were used to establish the recognition model. Among them, there were 9 water samples of the fourth aquifer in the loose layer of the Cenozoic (referred to as “the fourth aquifer”), 39 water samples of the coal-bearing sandstone fissure aquifer of the Permian (referred to as “the coal-bearing sandstone aquifer”) and 45 water samples of the limestone-karst fissure aquifer in the Taiyuan formation of the Carboniferous (referred to as “the limestone aquifer”). In addition, 16 water samples from the Qinan mining area were taken from the site for verification model. Among them, there were 2 water samples of the fourth aquifer, 4 water samples of the coal-bearing sandstone aquifer and 10 water samples of the limestone aquifer.

When the water samples were collected, plastic bottles and covers were rinsed three to five times using sampling water. Later, water samples were stored in a clean 550 ml plastic bottle. Before the test, the water samples were processed at low temperature to inhibit the redox reaction and biochemical action (Chen et al. 2013; Faghih Nasiri et al. 2018). The conventional water chemistry tests include the contents of K⁺ + Na⁺, Ca²⁺, Mg²⁺, Cl⁻, SO₄²⁻, HCO₃⁻ and CO₃²⁻. Among them, HCO₃⁻ and CO₃²⁻ were tested by dilute sulfuric acid-methyl orange titrimetry, Cl⁻ and SO₄²⁻ were tested by ion chromatography, Ca²⁺ and Mg²⁺ were tested by EDTA titration method and K⁺ + Na⁺ was tested by flame atomic absorption spectrophotometry. It was known from hydrogeological data that the 109 water samples were taken from drain holes, hydrogeological observation wells, extracting coal faces and underground roadways. The water levels of the observation wells did not show any abnormal changes during the collection of water samples. As such, this study was only concerned with the 109 water samples from a static perspective. The water sample data are shown in Table 1. Among them, X₁, X₂, X₃, X₄, X₅, X₆ and X₇ represent the contents of K⁺ + Na⁺, Ca²⁺, Mg²⁺, Cl⁻, SO₄²⁻, HCO₃⁻ and CO₃²⁻, respectively.

Table 1 The water inrush source samples from the Qinan coal mine

Full size table

Cluster analysis

The principle of cluster analysis is that n different samples are regarded as n different classes, and the two classes with the closest properties (or the shortest distance) can be merged into the same class. Then, the next two classes with the closest properties (or the shortest distance), from the n − 1 classes, are combined. This process continues until all the samples have been merged into a single class. In the cluster analysis, we usually divide it into Q-type cluster analysis and R-type cluster analysis based on the differences of classification objects. And Q-type cluster analysis is the classification of samples, while R-type cluster analysis is the classification of variables. The basic algorithm steps of cluster analysis are shown below:

1.
At the beginning, each sample is a separate class, and the distance matrix between two pairs of n classes is calculated, denoted as:
$$D_{0} = \left[ {\begin{array}{*{20}l} 0 \hfill & {} \hfill & {} \hfill & {} \hfill & {} \hfill \\ {d_{21} } \hfill & 0 \hfill & {} \hfill & {} \hfill & {} \hfill \\ {d_{31} } \hfill & {d_{32} } \hfill & 0 \hfill & {} \hfill & {} \hfill \\ \vdots \hfill & \vdots \hfill & \vdots \hfill & {} \hfill & {} \hfill \\ {d_{n1} } \hfill & {d_{n2} {\kern 1pt} } \hfill & {d_{n3} } \hfill & \ldots \hfill & 0 \hfill \\ \end{array} } \right].$$
2.
Find the minimum distance value d_ij in the distance matrix, and denoted as $d_{{i_{1} j_{1} }}$, and combine the i₁ and j₁ classes into the n − 1 class.
3.
Calculate the distance between class n − 1 and other classes;
4.
Merge rows i₁, j₁ in the initial distance matrix D₀ into new row, and columns i₁, j₁ into new column, the number of classes is reduced by one. We can get the new distance matrix D₁.
5.
Repeat steps (2) (3) and (4) until n samples are clustered into one class.
6.
The clustering process was made into a cluster analysis diagram. And the original samples were screened according to the cluster analysis diagram to eliminate the samples that did not meet the requirements.

Principal component analysis

Principal component analysis is a method for original data compression and characteristic information extraction. It can replace many correlated variables with several comprehensive variables. These comprehensive variables not only express a great amount of information of the original variables but can also remain mutually independent (Jolliffe 2002; Kim et al. 2005; Meglen 1992; Qian et al. 2016; Huang et al. 2019). The basic principle is:

If X₁, X₂, …, X_n are defined as the original variables and Y₁, Y₂, …, Y_m (m ≤ n) are new variables, the relationship between the original and new variables is

$$\left. {\begin{array}{*{20}l} {Y_{1} = A_{11} X_{1} + A_{12} X_{2} + \cdots + A_{1n} X_{n} } \hfill \\ {Y_{2} = A_{21} X_{1} + A_{22} X_{2} + \cdots + A_{2n} X_{n} } \hfill \\ \ldots \hfill \\ {Y_{m} = A_{m1} X_{1} + A_{m2} X_{2} + \cdots + A_{mn} X_{n} } \hfill \\ \end{array} } \right\},$$

(1)

where Y_i is independent from Y_j (i ≠ j; i, j = 1, 2, …, m). Y₁ is the item with the highest variance in all linear combinations of X₁, X₂, …, X_n, and Y₂ is the item with the maximum variance in all linear combinations of X₁, X₂, …, X_n independent from Y₁. The rest can be performed in the same way. The new variables Y₁, Y₂, …, Y_m are the first, second, …, and the m principal component of the original variables X₁, X₂, …, X_n.

Multiple logistic regression analysis

Among n multiple logistic regression analysis models, assuming P_i (i = 1, 2, …, n) is the probability of the sample belonging to the type i (i = 1, 2, …, n). Taking the reference type that the sample belongs to type n, the multiple logistic regression analysis models are as follows (Wang 2010; Wang and Guo 2001; Zhang 2002):

$$\left. {\begin{array}{*{20}l} {G_{1} = \ln \frac{{P_{1} }}{{P_{n} }} = A_{1} X_{1} + A_{2} X_{2} + \cdots + A_{t} X_{t} + C} \hfill \\ {G_{2} = \ln \frac{{P_{2} }}{{P_{n} }} = B_{1} X_{1} + B_{2} X_{2} + \cdots + B_{t} X_{t} + D} \hfill \\ \ldots \hfill \\ {G_{n} = \ln \frac{{P_{n} }}{{P_{n} }} = 0} \hfill \\ \end{array} } \right\}.$$

(2)

Because the sum of the probabilities that the samples belong to n types is 1, so we could get Formula (3):

$$P_{1} + P_{2} + \cdots P_{n} = 1.$$

(3)

Simultaneous Formulas (2) and (3), we were then able to derive the following Formula (4):

$$\left. \begin{aligned} P_{1} = \frac{{e^{{G_{1} }} }}{{1 + e^{{G_{1} }} + e^{{G_{2} }} + \cdots + e^{{G_{n - 1} }} }} \hfill \\ P_{2} = \frac{{e^{{G_{2} }} }}{{1 + e^{{G_{1} }} + e^{{G_{2} }} + \cdots + e^{{G_{n - 1} }} }} \hfill \\ \ldots \hfill \\ P_{n} = \frac{1}{{1 + e^{{G_{1} }} + e^{{G_{2} }} + \cdots + e^{{G_{n - 1} }} }} \hfill \\ \end{aligned} \right\},$$

(4)

where, P₁, P₁, …, P_n are the probability functions of the respective recognition models of types 1, 2, …, n; X_i represent the value of independent variables; A_i and B_i represent the coefficient of constant ion contents, respectively.

Establishment and verification of recognition model

The sequence of steps taken using the water inrush source recognition methodology is described as follows (Fig. 3).

Q-type cluster analysis of the original water samples

To reduce the deviation of ion content caused by the external factors, for example, polluted water samples, large water evaporation because of poor sealing of the container, and the measurement deviation caused by human error in the testing process, the ion content of the original water samples were used as the analysis variables, we used MATLAB to complete the Q-type cluster analysis of the 93 original water samples. The results of the cluster analysis are shown in Fig. 4.

From the results of Q-type cluster analysis of the original water samples displayed in Fig. 4 and according to the distance between the original water samples (Güler et al. 2002). We can re-classify the original water samples and get new classification results. Among them, there are differences between the new classification results and the original classification results for 93 original water samples, and those water samples are 1, 4, 13, 31, 38, 39, 42, 45, 72, 82 and 84. In the process of discrimination, these water samples will have an impact on the results of discrimination, so we eliminate these water samples which are not consistent with the original classification results, and improve the accuracy of discrimination.

R-type cluster analysis of training samples

This paper used MATLAB to complete the R-type cluster analysis of the training samples. We regarded the content of K⁺ + Na⁺, Ca²⁺, Mg²⁺, Cl⁻, SO₄²⁻, HCO₃⁻ and CO₃²⁻ as the clustering bases, and the R-type cluster analysis results for the three types of aquifer were obtained (Fig. 5).

It can be seen from the results of the R-type cluster analysis that the degree of similarity is high between Ca²⁺ and Mg²⁺ and between Cl⁻ and CO₃²⁻ in the fourth aquifer. In addition, there is a relationship between certain ions in the coal-bearing sandstone aquifer, Ca²⁺, Mg²⁺ and SO₄²⁻ are closely related, as are K⁺ + Na⁺ and Cl⁻. The cause of this phenomenon is the origin of the samples: Ca²⁺, Mg²⁺ and SO₄²⁻ were derived from the dissolution of sulfate rocks, and K⁺ + Na⁺ and Cl⁻ came from soluble sodium–potassium salt rocks. In the limestone aquifer, ions such as Ca²⁺ and Mg²⁺ as well as K⁺ + Na⁺ and Cl⁻ are also closely related. The reason being that Ca²⁺ and Mg²⁺ came from the partial dissolution of insoluble carbonate rocks. However, the relationship between CO₃²⁻, Ca²⁺ and Mg²⁺ was relatively small because of the reaction of CO₃²⁻ + H₂O ⇌ HCO₃⁻ + OH⁻ proceeding in the positive direction when the concentration of CO₃²⁻ is increased. Thus, the concentration of CO₃²⁻ in the groundwater decreased, and the relationship between Ca²⁺ and Mg²⁺ was low. It is consistent with the alkalinity of the water samples from the limestone-karst fissure aquifer in the Taiyuan formation of the Carboniferous.

The ions in coal mine water have certain internal connections between them, and these inherent connections were often ignored in the process of establishing recognition models of mine water inrush sources, which led to excessive deviation in the practical applications of recognition models for water inrush sources. This deviation has brought a series of serious influences on the actual production of coal mines. To reduce this deviation, the training samples were preprocessed using the method of factor analysis. Finally, the recognition model was established.

Principal component analysis of the training samples

To verify the results of the R-type cluster analysis, a Pearson correlation analysis was conducted on the training samples (Chen et al. 2013; Huang and Wang 2018; Kim et al. 2005; Qian et al. 2016), and the Pearson correction coefficient of the three types of water samples were then obtained (Tables 2, 3, 4).

Table 2 Pearson correlation coefficients of the fourth aquifer water samples

Full size table

Table 3 Pearson correlation coefficients of the coal bearing sandstone aquifer water samples

Full size table

Table 4 Pearson correlation coefficients of the limestone aquifer water samples

Full size table

From Tables 2, 3 and 4, we could see that the correlations between some ions in each aquifer were remarkable (Qian et al. 2016). In the water sample of the fourth aquifer, the concentrations of Ca²⁺ and Mg²⁺ were positively correlated (r = 0.767, p < 0.01), Cl⁻ and CO₃²⁻ were significantly correlated (r = 0.971, p < 0.01); In the water sample of the coal-bearing sandstone aquifer, both Ca²⁺ and Mg²⁺ were moderately correlated with SO₄²⁻ (Ca²⁺ vs. SO₄²⁻: r = 0.399, p < 0.05; Mg²⁺ vs. SO₄²⁻: r = 0.359, p < 0.05; Table 3), and K⁺ + Na⁺ showed positive correlations with SO₄²⁻ (r = 0.481, p < 0.01). In addition, Ca²⁺ and Mg²⁺ were also significant correlations (r = 0.877, p < 0.01) in the water sample of the limestone aquifer. Comparing the results of the R-type cluster analysis with the Pearson correlation coefficient, the correlation between the ions of each aquifer was basically consistent. It was, therefore, fully suggested that there is an internal connection between the ions in coal mine water.

To solve any problems with the connections among internal ions, the factor analysis of the training samples was then processed using SPSS. We used the principal component analysis to reduce the number of factors to 7, and the 7 original factors were then combined into 3 independent indicators to reflect the hydrochemical information.

Using the principal component analysis in factor analysis, the initial factors were extracted from the ion’s correction coefficient matrix, and the initial eigenvalue and the variances explained by the principal component analysis were obtained (Table 5).

Table 5 Interpreting total variance

Full size table

The number of principle components could be determined by the cumulative variance of the principle components. It is generally thought that the cumulative variance of extracting principal components is more than 80%, which means that the selected number of principal components can fully reflect the hydrochemical information of the training samples (Chen et al. 2013; Wang et al. 2017a, b; Yin et al. 2006; Zhang et al. 2017). Therefore, we extracted three principal components, which were consistent with the results of selecting the number of principal components according to the eigenvalues. To some extent, the number of principal components could be determined using eigenvalues greater than 1 as criteria. The eigenvalues of the principal components are shown in Fig. 6.

The maximum variance algorithm was used for the orthogonal rotation of the initial load matrix of factors so that loads of each ion on the same factor were distinctly different. The orthogonal rotation converges after 6 iterations, and the orthogonal rotation factor loading matrix (Table 6) and the orthogonal rotation factor loading diagram (Fig. 7) could then be obtained. After the orthogonal rotation of three types of water samples, each principal component represented the hydrochemical information of different ions. Among them, principal component 1 represented Ca²⁺, Mg²⁺ and SO₄²⁻ and reflected the information of 42.637% of the training samples. Principal component 2 represented Cl⁻ and CO₃²⁻ and reflected the information of 25.347% of the training samples. Principal component 3 represented K⁺ + Na⁺ and reflected the information of 14.661% of the training samples.

Table 6 Orthogonal rotation factor loading matrix

Full size table

Principal component analysis of the 82 training water samples was carried out using SPSS, and we obtained scores for three principal components from the 82 training water samples (Table 7). The scores of the principal components were expressed as Y₁, Y₂ and Y₃, respectively. The correlation coefficients between the three types of principal components and the original variables are shown in Table 8.

Table 7 The scores of the principal components

Full size table

Table 8 Principal component score coefficients

Full size table

According to the principal component score coefficients, we could get the expression of principal component scores, relational expressions of the three extracted principal components Y₁, Y₂ and Y₃ with the original variables X₁, X₂, X₃, X₄, X₅, X₆ and X₇ were obtained as follows:

$$\left. \begin{aligned} Y_{1} & = 0.153X_{1} + 0.321X_{2} + 0.296X_{3} + 0.111X_{4} + 0.485X_{5} \\ {\kern 1pt} & \quad - 0.015X_{6} - 0.058X_{7} \\ Y_{2} & = - 0.033X_{1} + 0.107X_{2} + 0.079X_{3} + 0.531X_{4} - 0.130X_{5} \\ & \quad - 0.146X_{6} + 0.479X_{7} \\ Y_{3} & = 0.707X_{1} - 0.167X_{2} - 0.208X_{3} - 0.083X_{4} + 0.475X_{5} \\ & \quad - 0.185X_{6} - 0.010X_{7} \\ \end{aligned} \right\},$$

(5)

where Y₁, Y₂ and Y₃ represent the scores of principal component 1, principal component 2 and principal component 3 of the training samples; X₁, X₂, X₃, X₄, X₅, X₆ and X₇ represent the contents of K⁺ + Na⁺, Ca²⁺, Mg²⁺, Cl⁻, SO₄²⁻, HCO₃⁻ and CO₃²⁻, respectively.

Construction of the recognition model

We regarded the principal component scores Y₁, Y₂ and Y₃ of the 82 training water samples as independent variables for implementing the multiple logistic regression recognition analysis. The parameters of the multiple logistic regression recognition model are shown in Table 9.

Table 9 Multiple logistic regression recognition analysis model parameters

Full size table

The recognition function of the solution is as follows:

$$\left. \begin{aligned} G_{1} = - 0.940Y_{1} - 0.312Y_{2} + 0.675Y_{3} - 1.561 \hfill \\ G_{2} = - 6.630Y_{1} + 1.416Y_{2} + 3.166Y_{3} - 3.479 \hfill \\ \end{aligned} \right\}.$$

(6)

Formula (6) could be simplified using Formula (5). We were then able to derive the following Formula (7).

$$\left. \begin{aligned} G_{1} & = 0.343X_{1} - 0.448X_{2} - 0.443X_{3} - 0.326X_{4} \\ & \quad - 0.095X_{5} - 0.065X_{6} - 0.102X_{7} - 1.561 \\ G_{2} & = 0.027X_{1} - 3.395X_{2} - 3.254X_{3} - 1.016X_{4} \\ & \quad - 3.699X_{5} - 0.314X_{6} + 0.741X_{7} - 3.479 \\ \end{aligned} \right\}.$$

(7)

Finally, the expressions for the probability functions of the three types of water inrush sources are as follows:

$$\left. \begin{aligned} P_{1} = \frac{{e^{{G_{1} }} }}{{1 + e^{{G_{1} }} + e^{{G_{2} }} }} \hfill \\ P_{2} = \frac{{e^{{G_{2} }} }}{{1 + e^{{G_{1} }} + e^{{G_{2} }} }} \hfill \\ P_{3} = \frac{1}{{1 + e^{{G_{1} }} + e^{{G_{2} }} }} \hfill \\ \end{aligned} \right\},$$

(8)

where P₁, P₂ and P₃ are the probability functions of the respective recognition models of types 1, 2, and 3; X₁, X₂, X₃, X₄, X₅, X₆ and X₇ represent the contents of K⁺ + Na⁺, Ca²⁺, Mg²⁺, Cl⁻, SO₄²⁻, HCO₃⁻ and CO₃²⁻, respectively; and the final item of the discriminant function is a constant.

Verification of water inrush source recognition model

The 82 groups of training samples in Table 7 were integrated into the established multiple logistic regression recognition model based on cluster analysis one by one for cross-validation (Table 10). The results showed that all water samples were discriminated with a discrimination rate of 87.8%. Among them, the recognition accuracy of water samples from the fourth aquifer is 42.8%, the recognition accuracy of water samples from the coal-bearing sandstone aquifer is 96.9% and the recognition accuracy of water samples from the limestone aquifer is 88.1%. The reason for the difference in the recognition accuracy of various aquifers lies in the difference in the number of training water samples. Because coal mining is less threatened by water inrush from the fourth aquifer, the limited number of water samples were collected from the fourth aquifer. However, the recognition model is established based on a certain amount of water samples. Therefore, the recognition accuracy of water samples from the fourth aquifer significantly different from the coal-bearing sandstone aquifer and the limestone aquifer. Meanwhile, this result can be compared to the traditional multiple logistic regression recognition model, which incurred multiple errors in its rediscrimination steps and had a correct discrimination rate of less than 78.5%. Therefore, the multiple logistic regression recognition model based on cluster analysis was more accurate, had a higher degree of stability, and could meet the actual requirements of water inrush source recognition.

Table 10 Classification results of cross-validation

Full size table

In addition, to further verify the accuracy of the established multiple logistic regression recognition model based on cluster analysis, 16 water samples to be discriminated from the Qinan mining area were substituted into the multiple logistic regression recognition model for discrimination (Table 11). Table 11 shows that 16 water samples are classified accurately by the established multiple logistic regression recognition model based on cluster analysis and only one sample is wrongly discriminated, showing an accuracy of 93.8%. Water sample A2 is actually the fourth aquifer water sample, but it is discriminated as the coal-bearing sandstone aquifer water sample in the model. Through comprehensive comparison, the multiple logistic regression recognition model based on cluster analysis was seen to be more accurate and to have greater extensive applicability than those of the traditional multiple logistic regression recognition model. Therefore, the multiple logistic regression recognition model based on cluster analysis has significant engineering relevance.

Table 11 Classification results of the water inrush source discriminant model

Full size table

Results and discussion

Based on the hydrogeological conditions of the mining area, cluster analysis of water quality samples was carried out in this paper. The analysis results were then utilized to analyze and extract typical water samples. At last, the multiple logistic regression recognition model based on cluster analysis was established. According to the results of the model recognition and the engineering application, the conclusions were drawn as follows:

1.
Through the cluster analysis of the original water samples, the nonconforming water samples were eliminated. The 82 water samples that accurately reflect the hydrochemical characteristics of the water inrush aquifer were screened from 93 original water samples, and they were used as training samples to establish the recognition model, which reduced the influence of the errors caused by the water quality analysis on the accuracy of the mode.
2.
In the process of establishing recognition model, to eliminate the internal connections between the ions, this paper adopted the principal component analysis method to cut down the dimension of the initial seven types of variables and combine the original seven factors into a few independent indexes to comprehensively reflect the hydrochemical information.
3.
The overall recognition accuracy of the multiple logistic regression recognition model based on cluster analysis reaches 87.8% and has high accuracy. It is easy to operate in the actual water source discrimination process, with straightforward discrimination results. This recognition model provides a new way to discriminate mine water inrush sources and has important guiding significance for mine water prevention and control work.
4.
Because the recognition model is based on hydrological data from a certain amount. And the quantity of water sample has certain influence on the accuracy of the recognition model. Therefore, we should collect more water sample data to improve accuracy. In addition, given the complexity of hydrogeological conditions, temperature, and human activities on aquifers in the study area, future studies should fully consider the impact of these factories to promote the applications of the model.

References

Biswas A, Sharma SP (2017) Geophysical surveys for identifying source and pathways of subsurface water inflow at the Bangur chromite mine, Odisha, India. Nat Hazards 88(2):947–964
Article Google Scholar
Bu HM, Tan X, Li SY, Zhang QF (2010) Water quality assessment of the Jinshui River (China) using multivariate statistical techniques. Environ Earth Sci 60(8):1631–1639
Article Google Scholar
Chen HJ, Li XB, Liu AH, Peng SQ (2009) Identifying of mine water inrush sources by Fisher discriminant analysis method. J Cent South Univ 40:1114–1120
Google Scholar
Chen LW, Yin XX, Liu X, Gui HR (2013) Multivariate statistical analysis on hydrochemical evolution of groundwater in the concealed coal mines in North China. Coal Geol Explor 41(6):43–51
Google Scholar
Faghih Nasiri E, Yousefi Kebria D, Qaderi F (2018) An experimental study on the simultaneous phenol and chromium removal from water using titanium dioxide photocatalyst. Civ Eng J 4(3):585
Article Google Scholar
Farnham IM, Stetzenbach KJ, Singh AK, Johannesson KH (2000) Deciphering groundwater flow systems in Oasis Valley, Nevada, using trace element chemistry, multivariate statistics, and geographical information system. Math Geosci 32(8):943–968
Google Scholar
Ganyaglo SY, Banoeng-Yakubo B, Osae S, Dampare SB (2011) Water quality assessment of groundwater in some rock types in parts of the eastern region of Ghana. Environ Earth Sci 62:1055–1069
Article Google Scholar
Gui HR, Lin ML (2016) Types of water hazards in China coalmines and regional characteristics. Nat Hazards 84(2):1501–1512
Article Google Scholar
Güler C, Thyne GD, McCray JE, Turner KA (2002) Evaluation of graphical and multivariate statistical methods for classification of water chemistry data. Hydrogeol J 10(4):455–474
Article Google Scholar
Hu W, Dong S, Yan L (2011) Water hazard control technology for safe extraction of coal resources influenced by faulted zone. Procedia Earth Planet Sci 3:1–10
Article Google Scholar
Huang PH, Chen JS (2011) Fisher identify and mixing model based on multivariate statistical analysis of mine water inrush sources. J China Coal Soc 36(S1):131–136
Google Scholar
Huang PH, Wang XY (2018) Piper-PCA-Fisher recognition model of water inrush source: a case study of the Jiaozuo mining area. Geofluids 2018:1–10
Google Scholar
Huang PH, Yang ZY, Wang XY, Ding FF (2019) Research on Piper-PCA-Bayes-LOOCV discrimination model of water inrush source in mines. Arab J Geosci 12:334
Article Google Scholar
Jolliffe IT (2002) Principal component analysis. Wiley, Hoboken
Google Scholar
Keskin TE, Düğenci M, Kaçaroğlu F (2015) Prediction of water pollution sources using artificial neural networks in the study areas of Sivas, Karabük and Bartın (Turkey). Environ Earth Sci 73(9):5333–5347
Article Google Scholar
Kim JH, Kim RH, Lee J, Cheong TJ, Yum BW, Chang HW (2005) Multivariate statistical analysis to identify the major factors governing groundwater quality in the coastal area of Kimje. South Korea. Hydrol Process 19(6):1261–1276
Article Google Scholar
Li GQ, Meng ZP, Wang XQ, Yang J (2017) Hydrochemical prediction of mine water inrush at the Xinli Mine, China. Mine Water Environ 36(1):78–86
Article Google Scholar
Liu X, Chen LW, Lin ML, Li SD (2013) Fisher recognition analysis for coal mining inrush water source under mining-induced disturbance and inversion of groundwater recharge relation. Hydrol Eng Geol 40(4):36–43
Google Scholar
Liu Q, Sun YJ, Xu ZM, Xu G (2018) Application of the comprehensive identification model in analyzing the source of water inrush. Arab J Geosci 11(9):189
Article Google Scholar
Lu JT, Li XB, Gong FQ (2012) Recognizing of mine water inrush sources based on principal components analysis and fisher discrimination analysis method. China Saf Sci J 22(7):109–115
Google Scholar
Meglen RR (1992) Examining large databases: a chemometric approach using principal components analysis. Mar Chem 39(1):217–237
Article Google Scholar
Panagopoulos GP, Angelopoulou D, Tzirtzilakis EE, Giannoulopoulos P (2016) The contribution of cluster and discriminant analysis to the classification of complex aquifer systems. Environ Monit Assess 188:591
Article Google Scholar
Qian J, Wang L, Ma L, Lu YH, Zhao WD, Zhang Y (2016) Multivariate statistical analysis of water chemistry in evaluating groundwater geochemical evolution and aquifer connectivity near a large coal mine, Anhui, China. Environ Earth Sci 75(9):747
Article Google Scholar
Reghunath R, Murthy TRS, Raghavan BR (2002) The utility of multivariate statistical techniques in hydrogeochemical studies: an example from Karnataka, India. Water Res 36(10):2437–2442
Article Google Scholar
Wang LB (2010) Multivariate statistical analysis: models, case study and application of SPSS. Economic Science Press, Beijing
Google Scholar
Wang JC, Guo ZG (2001) Logistic regression model-methods and applications. Higher Education Press, Beijing
Google Scholar
Wang XY, Zhao W, Liu XM, Wang TT, Zhang JG, Guo JW, Chen GS, Zhang B (2017a) Identification of water inrush source from coalfield based on entropy weight-fuzzy variable set theory. J China Coal Soc 42(9):2433–2439
Google Scholar
Wang Y, Zhou MR, Yan PC, He CY, Liu D (2017b) Identification of coalmine water inrush source with PCA-BP model based on laser-induced fluorescence technology. Spectrosc Spectr Anal 37(3):978–983
Google Scholar
Wei WX, Han J, Shi LQ, Lu XM, Zhang XJ (2015) Application of modern data analysis in mine water gushing prediction. Coal Industry Press, Beijing
Google Scholar
Wu Q, Guo XM, Shen JJ, Xu S, Liu SQ, Zeng YF (2016) Risk assessment of water inrush from aquifers underlying the Gushuyuan coal mine, China. Mine Water Environ 36(1):1–8
Google Scholar
Xu B, Zhang Y, Jiang L (2012) Coupled model based on grey relational analysis and stepwise discriminant analysis for water source. Rock Soil Mech 33(10):3122–3228
Google Scholar
Yin XX, Xu GQ, Gui HR, Chen LW (2006) Analyzing for sources of inrush-water in Wanbei Mining Area by systemic clustering and stepwise distinguishing. Coal Geol Explor 34(2):61–64
Google Scholar
Zhang WT (2002) SPSS 11.0 statistical analysis tutorial (advanced). Beijing Hope Electronic Press, Beijing
Google Scholar
Zhang H, Yao DX, Lu HF, Zhu NN, Xue L (2017) Application of principal component analysis and bayes discrimination approach in water source identification. Coal Geol Explor 45:87–93
Google Scholar

Download references

Acknowledgements

The project was supported by the National Natural Science Foundation of China (Grant nos. 41672273, 51474008), the Fundamental Research Funds for the Central Universities (22120180313) and the Anhui Natural Science Foundation of China (1508085QE89). The research was also substantially supported by the Key Laboratory of Geotechnical and Underground Engineering of Ministry of Education (Tongji University).

Author information

Authors and Affiliations

Department of Geotechnical Engineering, Tongji University, Shanghai, 200092, China
Hao Zhang, Haofeng Xing & Liangliang Liu
College of Earth and Environment, Anhui University of Science and Technology, Huainan, 232001, China
Duoxi Yao
Shanghai Municipal Engineering Design Institute (Group) Co., Ltd, Shanghai, 200092, China
Daorui Xue
Institute of Crustal Dynamics, China Earthquake Administration, Beijing, 100085, China
Fei Guo

Authors

Hao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haofeng Xing
View author publications
You can also search for this author in PubMed Google Scholar
Duoxi Yao
View author publications
You can also search for this author in PubMed Google Scholar
Liangliang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Daorui Xue
View author publications
You can also search for this author in PubMed Google Scholar
Fei Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haofeng Xing.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, H., Xing, H., Yao, D. et al. The multiple logistic regression recognition model for mine water inrush source based on cluster analysis. Environ Earth Sci 78, 612 (2019). https://doi.org/10.1007/s12665-019-8624-2

Download citation

Received: 24 February 2019
Accepted: 27 September 2019
Published: 13 October 2019
DOI: https://doi.org/10.1007/s12665-019-8624-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The multiple logistic regression recognition model for mine water inrush source based on cluster analysis

Abstract

Similar content being viewed by others

Source identification of mine water inrush based on principal component analysis and grey situation decision

Study on recognition of mine water sources based on statistical analysis

The PCA-KD-KNN-based water chemistry identification model of water inrush source type in mine and its application

Introduction

Hydrogeological conditions in the study area