Introduction

Mine water hazards are one of the main geological hazards that can threaten the safety of coal mining. Coal mine water inrush often causes the partial submergence of a coal mine, causing huge economic losses and human casualties (Gui and Lin 2016; Hu et al. 2011; Wu et al. 2016). The Qinan coal mine is located in the Suxian mining area. Since the coal mine has been operational, there have been fewer occurrences of water bursting, and the water inflow has not been large. However, the Taoyuan coal mine, which belongs to the same hydrogeological unit as the Qinan coal mine, experienced a mine water inrush accident with a maximum water inflow of 29,000 m3/h in 2013, which caused a serious flooding incident. To prevent the occurrence of similar large-scale mine water inrush accidents in the Qinan coal mine, it is necessary to carry out mine water prevention and control work. Among these tasks, the accurate judgment of mine water inrush sources is a prerequisite of coal mine water inrush prevention and control work, as well as an important part of preventing mine water inrush accidents (Ganyaglo et al. 2011; Zhang et al. 2017).

For a long time, experts and scholars had proposed many methods for judging water inrush sources in the problem of “Recognition of mine water inrush”. Water inrush source recognition methods include geological analysis, hydrodynamic analysis, hydrochemical characteristics analysis, water temperature, water level dynamic observation and geophysical prospecting (Biswas and Sharma 2017; Farnham et al. 2000; Panagopoulos et al. 2016; Keskin et al. 2015). Among them, the hydrochemical characteristics analysis method is a simple and effective way to identify a mine water inrush source. Li et al. (2017) used a hydrochemical approach to ascertain the mine water sources and to locate the potential seawater inrush seepage channels in the Xinli Mine. Wei et al. (2015) identifies a water source by analyzing hydrochemical characteristic ions at water inrush points. But, at present, the methods of multivariate statistical analysis have been relatively mature. And the following methods are applied mostly in multivariate statistical analysis: principal component analysis reduces an original set of variables into a smaller number of uncorrelated components without losing much information (Jolliffe 2002; Kim et al. 2005; Meglen 1992; Qian et al. 2016), cluster analysis can measure the similarities among samples (Bu et al. 2010; Reghunath et al. 2002), discriminant analysis (includes Distance discriminant, Fisher discriminant and Bayes discriminant) can establish an intuitive discriminant relation (Chen et al. 2009; Huang and Chen 2011; Lu et al. 2012; Huang and Wang 2018). Xu et al. (2012) selected six sets of ions (K+ + Na+, Ca2+, Mg2+, Cl, SO42−, HCO3) and their total dissolved solids (TDS) as discriminant factors for designing a GRA–SDA coupled model. Liu et al. (2013) proposed a Fisher recognition analysis for identifying a coal mining inrush water source under mining-induced disturbances. A comprehensive identification model combining hydrochemistry analysis, water source detection, and water channel exploration was proposed by Liu et al. (2018). Based on the constant ion content test results, including the pH values and total dissolved solid (TDS), Yin et al. (2006) used systemic clustering and stepwise distinguishing to analyze the sources of the inrush water in the Wanbei Mining area. The PCA–BP neural network model, based on laser-induced fluorescence technology, was also used to identify a water inrush source by Wang et al. (2017a, b).

However, the current recognition methods of mine water inrush sources did not fully consider the measurement errors of the ion content caused by external factors and have ignored the complex relationships between the ions. These recognition methods have certain deviations from the actual identification process for mine water inrush sources. Therefore, the objectives of this paper are to propose a new method for the accurate identification of mine water inrush source, and it is the multiple logistic regression recognition model based on cluster analysis, which fully considers the measurement errors of the ion content and the complex internal relationships of ions. This method uses cluster analysis to measure the similarities among original water samples, and its purpose is to screen the original water samples. Principal component analysis is used to extract the information of hydrochemical indexes, and multiple correlated indicator variables are converted into new independent sample indicators. The multiple logistic regression recognition model can predict and classify based on existing water samples. So it can effectively extract the variation information of the original water samples, eliminate the influence caused by the superposition of the information among variables and realize the recognition of mine water inrush sources. In addition, the recognition model was applied to the water samples to be discriminated to verify its accuracy. The results show that the multiple logistic regression recognition model based on cluster analysis has high accuracy. And it is easy to operate in the actual water source discrimination process, with straightforward discrimination results.

Hydrogeological conditions in the study area

The Qinan coal mine is located in the middle of Huaibei plain, and it is distributed in the Huaihe River valley, positions shown in Fig. 1. The Huihe River, a tributary of the Huaihe River, flows through the mining area, and It has high vegetation coverage. The study area belongs to the north temperature monsoon region ocean—continental climate and has distinctive four seasons. As a typical central plain climate, the annual average temperature and annual average precipitation are about 14.6 °C and 756 mm, respectively. Rainfall is concentrated in July and August. The evaporation capacity is higher than the precipitation, and the annual average relative humidity is 71%.

Fig. 1
figure 1

Map showing the location of the study area, with the distribution of faults, folds, and sampling sites within the Qinan coal mine

The Qinan coal mine is located in the southwest region of the Sunan syncline. The inclination of the strata in the northern coal mine is steep, generally ranging from 20° to 30°. However, the inclination of the strata in the middle and eastern regions of the Qinan coal mine is gentler, generally ranging from 7° to 15°. The coal-bearing strata within the study area are covered by a loose layer from the Cenozoic period.

The groundwater regime in the mining areas of Qinan consists of four subsystems: the loose aquifer of the Cenozoic, the coal-bearing sandstone fissure aquifer of the Permian, the limestone-karst fissure aquifer in the Taiyuan formation of the Carboniferous and the limestone karst fracture aquifer of the Ordovician. The hydrogeological characteristics of the aquifer and aquifuge are shown in Fig. 2.

Fig. 2
figure 2

The synthesis column map of aquifer and aquifuge

Among them, the limestone karst fracture aquifer of the Ordovician is furthest away from the coal seam. Thus, under normal conditions, there is no direct water filling effect on the coal mine.

Materials and methods

Sampling and test

We collected a total of 93 original water samples from the Qinan coal mine during the period of 2000–2017 (positions shown in Fig. 1), and the 93 original water samples were evenly distributed between 2000 and 2017. These water samples were used to establish the recognition model. Among them, there were 9 water samples of the fourth aquifer in the loose layer of the Cenozoic (referred to as “the fourth aquifer”), 39 water samples of the coal-bearing sandstone fissure aquifer of the Permian (referred to as “the coal-bearing sandstone aquifer”) and 45 water samples of the limestone-karst fissure aquifer in the Taiyuan formation of the Carboniferous (referred to as “the limestone aquifer”). In addition, 16 water samples from the Qinan mining area were taken from the site for verification model. Among them, there were 2 water samples of the fourth aquifer, 4 water samples of the coal-bearing sandstone aquifer and 10 water samples of the limestone aquifer.

When the water samples were collected, plastic bottles and covers were rinsed three to five times using sampling water. Later, water samples were stored in a clean 550 ml plastic bottle. Before the test, the water samples were processed at low temperature to inhibit the redox reaction and biochemical action (Chen et al. 2013; Faghih Nasiri et al. 2018). The conventional water chemistry tests include the contents of K+ + Na+, Ca2+, Mg2+, Cl, SO42−, HCO3 and CO32−. Among them, HCO3 and CO32− were tested by dilute sulfuric acid-methyl orange titrimetry, Cl and SO42− were tested by ion chromatography, Ca2+ and Mg2+ were tested by EDTA titration method and K+ + Na+ was tested by flame atomic absorption spectrophotometry. It was known from hydrogeological data that the 109 water samples were taken from drain holes, hydrogeological observation wells, extracting coal faces and underground roadways. The water levels of the observation wells did not show any abnormal changes during the collection of water samples. As such, this study was only concerned with the 109 water samples from a static perspective. The water sample data are shown in Table 1. Among them, X1, X2, X3, X4, X5, X6 and X7 represent the contents of K+ + Na+, Ca2+, Mg2+, Cl, SO42−, HCO3 and CO32−, respectively.

Table 1 The water inrush source samples from the Qinan coal mine

Cluster analysis

The principle of cluster analysis is that n different samples are regarded as n different classes, and the two classes with the closest properties (or the shortest distance) can be merged into the same class. Then, the next two classes with the closest properties (or the shortest distance), from the n − 1 classes, are combined. This process continues until all the samples have been merged into a single class. In the cluster analysis, we usually divide it into Q-type cluster analysis and R-type cluster analysis based on the differences of classification objects. And Q-type cluster analysis is the classification of samples, while R-type cluster analysis is the classification of variables. The basic algorithm steps of cluster analysis are shown below:

  1. 1.

    At the beginning, each sample is a separate class, and the distance matrix between two pairs of n classes is calculated, denoted as:

    $$D_{0} = \left[ {\begin{array}{*{20}l} 0 \hfill & {} \hfill & {} \hfill & {} \hfill & {} \hfill \\ {d_{21} } \hfill & 0 \hfill & {} \hfill & {} \hfill & {} \hfill \\ {d_{31} } \hfill & {d_{32} } \hfill & 0 \hfill & {} \hfill & {} \hfill \\ \vdots \hfill & \vdots \hfill & \vdots \hfill & {} \hfill & {} \hfill \\ {d_{n1} } \hfill & {d_{n2} {\kern 1pt} } \hfill & {d_{n3} } \hfill & \ldots \hfill & 0 \hfill \\ \end{array} } \right].$$
  2. 2.

    Find the minimum distance value dij in the distance matrix, and denoted as \(d_{{i_{1} j_{1} }}\), and combine the i1 and j1 classes into the n − 1 class.

  3. 3.

    Calculate the distance between class n − 1 and other classes;

  4. 4.

    Merge rows i1, j1 in the initial distance matrix D0 into new row, and columns i1, j1 into new column, the number of classes is reduced by one. We can get the new distance matrix D1.

  5. 5.

    Repeat steps (2) (3) and (4) until n samples are clustered into one class.

  6. 6.

    The clustering process was made into a cluster analysis diagram. And the original samples were screened according to the cluster analysis diagram to eliminate the samples that did not meet the requirements.

Principal component analysis

Principal component analysis is a method for original data compression and characteristic information extraction. It can replace many correlated variables with several comprehensive variables. These comprehensive variables not only express a great amount of information of the original variables but can also remain mutually independent (Jolliffe 2002; Kim et al. 2005; Meglen 1992; Qian et al. 2016; Huang et al. 2019). The basic principle is:

If X1, X2, …, Xn are defined as the original variables and Y1, Y2, …, Ym (m ≤ n) are new variables, the relationship between the original and new variables is

$$\left. {\begin{array}{*{20}l} {Y_{1} = A_{11} X_{1} + A_{12} X_{2} + \cdots + A_{1n} X_{n} } \hfill \\ {Y_{2} = A_{21} X_{1} + A_{22} X_{2} + \cdots + A_{2n} X_{n} } \hfill \\ \ldots \hfill \\ {Y_{m} = A_{m1} X_{1} + A_{m2} X_{2} + \cdots + A_{mn} X_{n} } \hfill \\ \end{array} } \right\},$$
(1)

where Yi is independent from Yj (i ≠ j; i, j = 1, 2, …, m). Y1 is the item with the highest variance in all linear combinations of X1, X2, …, Xn, and Y2 is the item with the maximum variance in all linear combinations of X1, X2, …, Xn independent from Y1. The rest can be performed in the same way. The new variables Y1, Y2, …, Ym are the first, second, …, and the m principal component of the original variables X1, X2, …, Xn.

Multiple logistic regression analysis

Among n multiple logistic regression analysis models, assuming Pi (i = 1, 2, …, n) is the probability of the sample belonging to the type i (i = 1, 2, …, n). Taking the reference type that the sample belongs to type n, the multiple logistic regression analysis models are as follows (Wang 2010; Wang and Guo 2001; Zhang 2002):

$$\left. {\begin{array}{*{20}l} {G_{1} = \ln \frac{{P_{1} }}{{P_{n} }} = A_{1} X_{1} + A_{2} X_{2} + \cdots + A_{t} X_{t} + C} \hfill \\ {G_{2} = \ln \frac{{P_{2} }}{{P_{n} }} = B_{1} X_{1} + B_{2} X_{2} + \cdots + B_{t} X_{t} + D} \hfill \\ \ldots \hfill \\ {G_{n} = \ln \frac{{P_{n} }}{{P_{n} }} = 0} \hfill \\ \end{array} } \right\}.$$
(2)

Because the sum of the probabilities that the samples belong to n types is 1, so we could get Formula (3):

$$P_{1} + P_{2} + \cdots P_{n} = 1.$$
(3)

Simultaneous Formulas (2) and (3), we were then able to derive the following Formula (4):

$$\left. \begin{aligned} P_{1} = \frac{{e^{{G_{1} }} }}{{1 + e^{{G_{1} }} + e^{{G_{2} }} + \cdots + e^{{G_{n - 1} }} }} \hfill \\ P_{2} = \frac{{e^{{G_{2} }} }}{{1 + e^{{G_{1} }} + e^{{G_{2} }} + \cdots + e^{{G_{n - 1} }} }} \hfill \\ \ldots \hfill \\ P_{n} = \frac{1}{{1 + e^{{G_{1} }} + e^{{G_{2} }} + \cdots + e^{{G_{n - 1} }} }} \hfill \\ \end{aligned} \right\},$$
(4)

where, P1, P1, …, Pn are the probability functions of the respective recognition models of types 1, 2, …, n; Xi represent the value of independent variables; Ai and Bi represent the coefficient of constant ion contents, respectively.

Establishment and verification of recognition model

The sequence of steps taken using the water inrush source recognition methodology is described as follows (Fig. 3).

Fig. 3
figure 3

Flowchart of mine water inrush source recognition methodology

Q-type cluster analysis of the original water samples

To reduce the deviation of ion content caused by the external factors, for example, polluted water samples, large water evaporation because of poor sealing of the container, and the measurement deviation caused by human error in the testing process, the ion content of the original water samples were used as the analysis variables, we used MATLAB to complete the Q-type cluster analysis of the 93 original water samples. The results of the cluster analysis are shown in Fig. 4.

Fig. 4
figure 4

Q-type cluster analysis diagram

From the results of Q-type cluster analysis of the original water samples displayed in Fig. 4 and according to the distance between the original water samples (Güler et al. 2002). We can re-classify the original water samples and get new classification results. Among them, there are differences between the new classification results and the original classification results for 93 original water samples, and those water samples are 1, 4, 13, 31, 38, 39, 42, 45, 72, 82 and 84. In the process of discrimination, these water samples will have an impact on the results of discrimination, so we eliminate these water samples which are not consistent with the original classification results, and improve the accuracy of discrimination.

R-type cluster analysis of training samples

This paper used MATLAB to complete the R-type cluster analysis of the training samples. We regarded the content of K+ + Na+, Ca2+, Mg2+, Cl, SO42−, HCO3 and CO32− as the clustering bases, and the R-type cluster analysis results for the three types of aquifer were obtained (Fig. 5).

Fig. 5
figure 5

Ion content R-type cluster analysis diagram

It can be seen from the results of the R-type cluster analysis that the degree of similarity is high between Ca2+ and Mg2+ and between Cl and CO32− in the fourth aquifer. In addition, there is a relationship between certain ions in the coal-bearing sandstone aquifer, Ca2+, Mg2+ and SO42− are closely related, as are K+ + Na+ and Cl. The cause of this phenomenon is the origin of the samples: Ca2+, Mg2+ and SO42− were derived from the dissolution of sulfate rocks, and K+ + Na+ and Cl came from soluble sodium–potassium salt rocks. In the limestone aquifer, ions such as Ca2+ and Mg2+ as well as K+ + Na+ and Cl are also closely related. The reason being that Ca2+ and Mg2+ came from the partial dissolution of insoluble carbonate rocks. However, the relationship between CO32−, Ca2+ and Mg2+ was relatively small because of the reaction of CO32− + H2O ⇌ HCO3 + OH proceeding in the positive direction when the concentration of CO32− is increased. Thus, the concentration of CO32− in the groundwater decreased, and the relationship between Ca2+ and Mg2+ was low. It is consistent with the alkalinity of the water samples from the limestone-karst fissure aquifer in the Taiyuan formation of the Carboniferous.

The ions in coal mine water have certain internal connections between them, and these inherent connections were often ignored in the process of establishing recognition models of mine water inrush sources, which led to excessive deviation in the practical applications of recognition models for water inrush sources. This deviation has brought a series of serious influences on the actual production of coal mines. To reduce this deviation, the training samples were preprocessed using the method of factor analysis. Finally, the recognition model was established.

Principal component analysis of the training samples

To verify the results of the R-type cluster analysis, a Pearson correlation analysis was conducted on the training samples (Chen et al. 2013; Huang and Wang 2018; Kim et al. 2005; Qian et al. 2016), and the Pearson correction coefficient of the three types of water samples were then obtained (Tables 2, 3, 4).

Table 2 Pearson correlation coefficients of the fourth aquifer water samples
Table 3 Pearson correlation coefficients of the coal bearing sandstone aquifer water samples
Table 4 Pearson correlation coefficients of the limestone aquifer water samples

From Tables 2, 3 and 4, we could see that the correlations between some ions in each aquifer were remarkable (Qian et al. 2016). In the water sample of the fourth aquifer, the concentrations of Ca2+ and Mg2+ were positively correlated (r = 0.767, p < 0.01), Cl and CO32− were significantly correlated (r = 0.971, p < 0.01); In the water sample of the coal-bearing sandstone aquifer, both Ca2+ and Mg2+ were moderately correlated with SO42− (Ca2+ vs. SO42−: r = 0.399, p < 0.05; Mg2+ vs. SO42−: r = 0.359, p < 0.05; Table 3), and K+ + Na+ showed positive correlations with SO42− (r = 0.481, p < 0.01). In addition, Ca2+ and Mg2+ were also significant correlations (r = 0.877, p < 0.01) in the water sample of the limestone aquifer. Comparing the results of the R-type cluster analysis with the Pearson correlation coefficient, the correlation between the ions of each aquifer was basically consistent. It was, therefore, fully suggested that there is an internal connection between the ions in coal mine water.

To solve any problems with the connections among internal ions, the factor analysis of the training samples was then processed using SPSS. We used the principal component analysis to reduce the number of factors to 7, and the 7 original factors were then combined into 3 independent indicators to reflect the hydrochemical information.

Using the principal component analysis in factor analysis, the initial factors were extracted from the ion’s correction coefficient matrix, and the initial eigenvalue and the variances explained by the principal component analysis were obtained (Table 5).

Table 5 Interpreting total variance

The number of principle components could be determined by the cumulative variance of the principle components. It is generally thought that the cumulative variance of extracting principal components is more than 80%, which means that the selected number of principal components can fully reflect the hydrochemical information of the training samples (Chen et al. 2013; Wang et al. 2017a, b; Yin et al. 2006; Zhang et al. 2017). Therefore, we extracted three principal components, which were consistent with the results of selecting the number of principal components according to the eigenvalues. To some extent, the number of principal components could be determined using eigenvalues greater than 1 as criteria. The eigenvalues of the principal components are shown in Fig. 6.

Fig. 6
figure 6

Scree plot of the principal components

The maximum variance algorithm was used for the orthogonal rotation of the initial load matrix of factors so that loads of each ion on the same factor were distinctly different. The orthogonal rotation converges after 6 iterations, and the orthogonal rotation factor loading matrix (Table 6) and the orthogonal rotation factor loading diagram (Fig. 7) could then be obtained. After the orthogonal rotation of three types of water samples, each principal component represented the hydrochemical information of different ions. Among them, principal component 1 represented Ca2+, Mg2+ and SO42− and reflected the information of 42.637% of the training samples. Principal component 2 represented Cl and CO32− and reflected the information of 25.347% of the training samples. Principal component 3 represented K+ + Na+ and reflected the information of 14.661% of the training samples.

Table 6 Orthogonal rotation factor loading matrix
Fig. 7
figure 7

Orthogonal rotation factor loading diagram

Principal component analysis of the 82 training water samples was carried out using SPSS, and we obtained scores for three principal components from the 82 training water samples (Table 7). The scores of the principal components were expressed as Y1, Y2 and Y3, respectively. The correlation coefficients between the three types of principal components and the original variables are shown in Table 8.

Table 7 The scores of the principal components
Table 8 Principal component score coefficients

According to the principal component score coefficients, we could get the expression of principal component scores, relational expressions of the three extracted principal components Y1, Y2 and Y3 with the original variables X1, X2, X3, X4, X5, X6 and X7 were obtained as follows:

$$\left. \begin{aligned} Y_{1} & = 0.153X_{1} + 0.321X_{2} + 0.296X_{3} + 0.111X_{4} + 0.485X_{5} \\ {\kern 1pt} & \quad - 0.015X_{6} - 0.058X_{7} \\ Y_{2} & = - 0.033X_{1} + 0.107X_{2} + 0.079X_{3} + 0.531X_{4} - 0.130X_{5} \\ & \quad - 0.146X_{6} + 0.479X_{7} \\ Y_{3} & = 0.707X_{1} - 0.167X_{2} - 0.208X_{3} - 0.083X_{4} + 0.475X_{5} \\ & \quad - 0.185X_{6} - 0.010X_{7} \\ \end{aligned} \right\},$$
(5)

where Y1, Y2 and Y3 represent the scores of principal component 1, principal component 2 and principal component 3 of the training samples; X1, X2, X3, X4, X5, X6 and X7 represent the contents of K+ + Na+, Ca2+, Mg2+, Cl, SO42−, HCO3 and CO32−, respectively.

Construction of the recognition model

We regarded the principal component scores Y1, Y2 and Y3 of the 82 training water samples as independent variables for implementing the multiple logistic regression recognition analysis. The parameters of the multiple logistic regression recognition model are shown in Table 9.

Table 9 Multiple logistic regression recognition analysis model parameters

The recognition function of the solution is as follows:

$$\left. \begin{aligned} G_{1} = - 0.940Y_{1} - 0.312Y_{2} + 0.675Y_{3} - 1.561 \hfill \\ G_{2} = - 6.630Y_{1} + 1.416Y_{2} + 3.166Y_{3} - 3.479 \hfill \\ \end{aligned} \right\}.$$
(6)

Formula (6) could be simplified using Formula (5). We were then able to derive the following Formula (7).

$$\left. \begin{aligned} G_{1} & = 0.343X_{1} - 0.448X_{2} - 0.443X_{3} - 0.326X_{4} \\ & \quad - 0.095X_{5} - 0.065X_{6} - 0.102X_{7} - 1.561 \\ G_{2} & = 0.027X_{1} - 3.395X_{2} - 3.254X_{3} - 1.016X_{4} \\ & \quad - 3.699X_{5} - 0.314X_{6} + 0.741X_{7} - 3.479 \\ \end{aligned} \right\}.$$
(7)

Finally, the expressions for the probability functions of the three types of water inrush sources are as follows:

$$\left. \begin{aligned} P_{1} = \frac{{e^{{G_{1} }} }}{{1 + e^{{G_{1} }} + e^{{G_{2} }} }} \hfill \\ P_{2} = \frac{{e^{{G_{2} }} }}{{1 + e^{{G_{1} }} + e^{{G_{2} }} }} \hfill \\ P_{3} = \frac{1}{{1 + e^{{G_{1} }} + e^{{G_{2} }} }} \hfill \\ \end{aligned} \right\},$$
(8)

where P1, P2 and P3 are the probability functions of the respective recognition models of types 1, 2, and 3; X1, X2, X3, X4, X5, X6 and X7 represent the contents of K+ + Na+, Ca2+, Mg2+, Cl, SO42−, HCO3 and CO32−, respectively; and the final item of the discriminant function is a constant.

Verification of water inrush source recognition model

The 82 groups of training samples in Table 7 were integrated into the established multiple logistic regression recognition model based on cluster analysis one by one for cross-validation (Table 10). The results showed that all water samples were discriminated with a discrimination rate of 87.8%. Among them, the recognition accuracy of water samples from the fourth aquifer is 42.8%, the recognition accuracy of water samples from the coal-bearing sandstone aquifer is 96.9% and the recognition accuracy of water samples from the limestone aquifer is 88.1%. The reason for the difference in the recognition accuracy of various aquifers lies in the difference in the number of training water samples. Because coal mining is less threatened by water inrush from the fourth aquifer, the limited number of water samples were collected from the fourth aquifer. However, the recognition model is established based on a certain amount of water samples. Therefore, the recognition accuracy of water samples from the fourth aquifer significantly different from the coal-bearing sandstone aquifer and the limestone aquifer. Meanwhile, this result can be compared to the traditional multiple logistic regression recognition model, which incurred multiple errors in its rediscrimination steps and had a correct discrimination rate of less than 78.5%. Therefore, the multiple logistic regression recognition model based on cluster analysis was more accurate, had a higher degree of stability, and could meet the actual requirements of water inrush source recognition.

Table 10 Classification results of cross-validation

In addition, to further verify the accuracy of the established multiple logistic regression recognition model based on cluster analysis, 16 water samples to be discriminated from the Qinan mining area were substituted into the multiple logistic regression recognition model for discrimination (Table 11). Table 11 shows that 16 water samples are classified accurately by the established multiple logistic regression recognition model based on cluster analysis and only one sample is wrongly discriminated, showing an accuracy of 93.8%. Water sample A2 is actually the fourth aquifer water sample, but it is discriminated as the coal-bearing sandstone aquifer water sample in the model. Through comprehensive comparison, the multiple logistic regression recognition model based on cluster analysis was seen to be more accurate and to have greater extensive applicability than those of the traditional multiple logistic regression recognition model. Therefore, the multiple logistic regression recognition model based on cluster analysis has significant engineering relevance.

Table 11 Classification results of the water inrush source discriminant model

Results and discussion

Based on the hydrogeological conditions of the mining area, cluster analysis of water quality samples was carried out in this paper. The analysis results were then utilized to analyze and extract typical water samples. At last, the multiple logistic regression recognition model based on cluster analysis was established. According to the results of the model recognition and the engineering application, the conclusions were drawn as follows:

  1. 1.

    Through the cluster analysis of the original water samples, the nonconforming water samples were eliminated. The 82 water samples that accurately reflect the hydrochemical characteristics of the water inrush aquifer were screened from 93 original water samples, and they were used as training samples to establish the recognition model, which reduced the influence of the errors caused by the water quality analysis on the accuracy of the mode.

  2. 2.

    In the process of establishing recognition model, to eliminate the internal connections between the ions, this paper adopted the principal component analysis method to cut down the dimension of the initial seven types of variables and combine the original seven factors into a few independent indexes to comprehensively reflect the hydrochemical information.

  3. 3.

    The overall recognition accuracy of the multiple logistic regression recognition model based on cluster analysis reaches 87.8% and has high accuracy. It is easy to operate in the actual water source discrimination process, with straightforward discrimination results. This recognition model provides a new way to discriminate mine water inrush sources and has important guiding significance for mine water prevention and control work.

  4. 4.

    Because the recognition model is based on hydrological data from a certain amount. And the quantity of water sample has certain influence on the accuracy of the recognition model. Therefore, we should collect more water sample data to improve accuracy. In addition, given the complexity of hydrogeological conditions, temperature, and human activities on aquifers in the study area, future studies should fully consider the impact of these factories to promote the applications of the model.