1 Introduction

Environmental quality assessment of rivers is a major issue in China’s water security. In China, the South-North Water Transfer Middle Route, as an important water supply route for the Beijing-Tianjin-Hebei region, is crucial to the economic development and people’s health of the region. Therefore, an effective water quality prediction method is very important to prevent water quality deterioration and water pollution from the South-north Water diversion project. In fact, we observe that the 13 automatic monitoring stations along the South-North Water Transfer Middle Route have long and continuous water quality time-series data, which are very suitable for using deep learning methods to train and predict the data. However, the existing platform is only capable of simple processing of water quality monitoring data, which has the following shortcomings: 1) it is unable to effectively use data analysis tools to extract valuable information from complex and huge time-series data; 2) it is unable to utilize the monitoring data safety thresholds to alarm in advance; and 3) it is unable to provide effective and comprehensive decision-making support for the staff. In this paper, based on the time-series data of 13 automatic monitoring stations, we study how to reduce the dimensionality of water quality indicators through rough sets theory, eliminate redundant attributes and extract core attributes, in view of the time-series, instability and complexity of water quality data. We complete the water quality prediction through the deep learning model, to achieve the water quality forecast and early warning capability of the water supply and transmission system in the middle route. In the following, we introduce some necessary knowledge related to the paper.

1.1 A short review of rough sets theory

Water quality data automatically sampled by automatic monitoring stations are characterized by the complexity of indicator dimensions. For this feature, there is a need to reduce the dimensionality through effective methods. In fact, rough sets theory is a very effective tool for attribute reduction. In 1982, Pawlak [1] proposed the rough sets theory by introducing lower and upper approximations through equivalence relations in information systems.

However, the classical Pawlak rough sets only apply to the situation of discrete data, and if we want to discretize the continuous data, it will inevitably cause the missing data and the fragment of information. So it cannot directly deal with the numerical data in water quality data [2]. For this reason, based on Pawlak rough sets, Lin [3] proposed the neighborhood rough sets in 1988, which can effectively deal with continuous numerical data. Yao [4] and Wu [5] proposed the concept of neighborhood particles, which is different from the concept of neighborhood of metric space in 1998 and 2002, respectively. The nature of neighborhood rough sets was discussed by Hu et al. [6] in 2008. Since then many scholars have devoted themselves to the study of neighborhood rough sets. For example, in 2012, Ma [7] proposed a neighborhood rough set by using a fuzzy covering neighborhood. In 2016, Li et al. [8] proposed a decision-theoretic rough sets model based on neighborhood rough sets, and used neighborhood granularity to derive decision rules. In 2018, Miao et al. [9] proposed a three-way decision model with fuzzy neighborhood rough sets. In 2018, Fujita et al. [10] presented a proposal that allows to systematically integrate state of the art solutions of granular computing (GrC) into different phases of resilience analysis of critical infrastructures (CIs). In 2022, Yang et al. [11] explored a novel framework of sequential three-way decision for the fusion of mixed data from the subjective and objective dynamic perspectives. In 2023, Liu et al. [12] proposed a novel multi-label feature selection based on label distribution and neighborhood rough set, known as LDRS. In 2023, Yin et al. [13] studied the noise-tolerant fuzzy neighborhood rough set model for multi-label learning and its feature selection strategy.

In terms of applications, a very important application of rough sets theory is attribute reduction. For instance, Wang et al. [14] proposed an attribute reduction method based on neighborhood discrimination index and significance measure of a candidate feature in 2017. In 2019, Wang et al. [15] proposed another attribute reduction method which combines the advantages of both \(\delta \)-neighborhood and k-nearest neighbor. In 2021, Wan et al. [16] considered the characteristic of interaction in the neighborhood rough sets and proposed a novel hybrid feature reduction method. In 2021, Sang et al. [17] proposed an attribute reduction method based on fuzzy dominance neighborhood rough set. In 2022, Yang et al. [18] combined the incremental technology and the accelerated strategy in attribute reduction. In the same year, Yang et al. [19] proposed an attribute reduction model based on Student-t kernel and fuzzy divergence. Hu et al. [20] proposed an attribute reduction model based on weighted neighborhood rough sets in 2021. Then in the following year, Hu et al. [21] proposed another attribute reduction model based on k-nearest neighbor rough sets. However, there is a non-negligible drawback of neighborhood rough sets in complex data, that is, the complex data set makes the application of strict lower approximation definition has great limitations and low error tolerance. In 1987, Wong et al. [22] introduced the 0.5-probabilistic rough set model to describe the generalized notion (probabilistic approximate classification) of rough sets by the concept of fuzzy sets. In 2015, Wang et al. [23] proposed an attribute reduction model based on monotonic uncertainty measure in probabilistic rough set model. In 2022, Xie et al. [24] proposed an attribute reduction approach based on a weighted neighborhood probabilistic rough set, which makes the application of attribute reduction more extensive.

1.2 A brief introduction of deep learning

Water quality prediction methods have evolved through the process of mechanistic analysis, statistical analysis and deep learning methods. With the improvement of water quality database and the long-term development of deep learning methods, related methods can be applied in the direction of water quality prediction in depth [25]. The first development of neural networks began in the 1940s. McCulloch and Pitts [26] in 1943 and Hebb [27] in 1949 made great progress in biological learning theory. In 1958, Rosenblatt [28] proposed the first neural network model perceptron, which enables the training of individual neurons. The second development started in the 1980s when Rumelhart et al. [29] proposed BP neural network in 1986. BP neural network is very prone to gradient disappearance, falling into local optimum solutions or overfitting [30]. In 2006, Hinton [31], Bengio [32], Ranzato [33] proved that multilayer neural network can be applied to machine learning frameworks both theoretically and in applications. In 2012, Hinton et al. [34] proposed the Dropout method to solve overfitting and time consuming. Similarly, many other approaches have been proposed by other scholars to solve these problems. For example, Zhang et al. [35] proposed a neural network model based on particle swarm algorithm optimization. Wang et al. [36] proposed an evolutionary neural network based on evolutionary algorithm optimization.

The introduction of rough sets in neural network can not only effectively improve the neural network’s ability to deal with noisy, redundant or uncertain data input patterns, but also show the advantage of rough sets’ higher efficiency in processing data without prior knowledge. In 2011, Xu et al. [37] established a novel least squares based on neural network. In 2018, Lei [38] combined the rough set-based attribute reduction method with wavelet neural networks and successfully used it for stock prediction. In 2021, He et al. [39] utilized rough set theory to overcome the problem of exploding and vanishing gradients of GRU neural networks, which was successfully applied to the dynamic load prediction of scraper conveyor. The above studies well illustrate that the combination of rough set theory and neural network methods can be well applied to classification and prediction problems.

On the other hand, neural networks have also been used in water quality prediction in a large number of applications. In 2017, Wang et al. [40] used Long Short-Term Memory (LSTM) neural network for water quality prediction. In 2019, Ren et al. [41] used Recurrent Neural Network (RNN) for water quality prediction and early warning. In 2022, Peleato et al. [42] applied neural network with fluorescence spectral downscaling to the prediction of by-products of drinking water disinfection. In 2022, Wang et al. [43] used a neural network process-based model to analyze estuarine water quality. In 2023, Wang et al. [44] proposed a new method for water anomaly detection based on fluctuation feature analysis and LSTM. In 2023, Tao et al. [45] used Variational Mode Decomposition(VMD) to extract signal fluctuation characteristics and reduce noise interference, which improved accuracy of LSTM for water quality prediction. The network structure with RNN as the main body can make better use of the historical information of water quality data, and can better extract the memory information of time-series, which becomes the mainstream method for predicting the data set with time attributes such as stock and water quality.

1.3 Motivation and contribution of this research

In this paper, based on the demand of water quality data prediction in the middle route, we propose the water quality prediction method based on k-nearest-neighbor probability rough sets (KNPRSs) and PSO-LSTM. The KNPRSs are used to eliminate the redundant attributes and extract the key attributes for the water quality indicators with complex indicator system and difficult mechanism analysis to improve the accuracy of the data set. The PSO-LSTM is used to optimize the corresponding neural network hyperparameters for the water quality of different stations, to achieve the prediction of water quality target indicators, and to construct a model structure that meets the actual situation of different stations, and to improve the accuracy of water quality prediction.

The paper’s motivations are delineated in the following.

  1. 1.

    For the complexity of the indicators of water quality data, the k-nearest neighbor algorithm is introduced to eliminate the restriction of subjective factors to select the neighborhood threshold, and to improve the error tolerance ability of the model through the construction of probabilistic rough sets. In order to complete the attribute redundant of the target water quality characteristics, eliminate redundant attributes, extract the key attributes, and streamline the input data dimension of the subsequent neural network prediction model.

  2. 2.

    For the time-series and instability of water quality data, the PSO algorithm is introduced to optimize the LSTM neural network. While the time-series characteristics of water quality data are well handled, the model hyperparameters are adaptively optimized for different indexes at different stations, so that the model can achieve good results on the prediction of each indicator at each station in the middle route.

The paper’s contributions are delineated in the following.

  1. 1.

    Facing the complex water quality dataset, compared with the traditional attribute reduction method, the attribute reduction method based on k-nearest neighbor probabilistic rough sets rank the distance of each object. It ensures that the objects are similar, while being able to be free from the interference of the subjective factors and the complexity of the water quality, better streamlines the data dimensions. It’s better to improve the accuracy of the subsequent water quality prediction.

  2. 2.

    PSO-LSTM water quality prediction model is constructed by combining the particle swarm algorithm and the long short-term memory network. The model is based on LSTM neural network, which can well cope with the temporal order of water quality data, and selectively retain historical information through the “gate" structure and memory cells. At the same time, the PSO algorithm is used to optimize the hyperparameters of the LSTM neural network, and the model is adaptively optimized for different indicators at different stations, which greatly improves the stability of water quality prediction.

1.4 Layout of this paper

The content of this paper is organized as follows. Section 2 introduces the basic theoretical knowledge involved in water quality prediction method based on KNPRSs and PSO-LSTM. Section 3 introduces the construction process of attribute reduction model based on KNPRSs. Then, we describe how to use the PSO algorithm to optimize the hyperparameters of the LSTM that performs well on long time-series, thus completing the construction of the PSO-LSTM water quality prediction model. In Section 4, we compare our data experiments with three other water quality prediction methods to verify the feasibility and superiority of the KNPRSs-PSO-LSTM. Section 5 summarizes the work of the entire paper.

2 Basic concepts

This section consists of three subsections, which mainly introduce the basic theoretical knowledge involved in the subsequent construction of KNPRSs and PSO-LSTM: probabilistic rough sets (PRS), Long Short-Term Memory(LSTM) and particle swarm optimization (PSO).

2.1 Probabilistic rough sets

In this subsection, we introduce the basic concept of probabilistic rough sets.

Definition 1

[6, 46] A quaternion system \(IS=\langle U,A,V,I \rangle \) is said to be an information system, where \(U\!\!=\!\!\{x_{1},x_{2},...,x_{n}\}\) indicates a non-empty finite sample set, also called universe, \(A=\{a_{1},a_{2},...,a_{n}\}\) denotes a non-empty finite set of attributes, V stands for the value range and I indicates a mapping function.

For brevity, an information system can be written as \(IS=\langle U,A \rangle \). Specifically, if \(A=C\cup D\), which includes both the conditional attribute set C and the decision attribute set D, \(C\cap D=\varnothing \) and \(D\ne \varnothing \), then we call system \(DS=\langle U,C\cup D \rangle \) a decision information system.

Definition 2

[47] Given an information system \(\langle U, A \rangle \), \(0\le \beta <\alpha \le 1\), then \(\forall X\subseteq U\), the lower and upper approximations of X are defined as

$$\begin{aligned} \underline{Apr^{\alpha }}X=\{x\in U~|~P(X~|~[x]_{R})\ge \alpha \}, \end{aligned}$$
(1)
$$\begin{aligned} \overline{Apr^{\beta }}X=\{x\in U~|~P(X~|~[x]_{R})> \beta \}, \end{aligned}$$
(2)

where \(P(X~|~[x]_{R})=\frac{|[x]_{R} \cap X|}{|[x]_{R}|}\) and \([x]_{R}\) is an equivalence class by x with respect to R, i.e., \([x]_{R}=\{y\in U ~|~(x,y)\in IND(R)\}\) and \(IND(R)=\{(x,y)\in U \times U~|~ \forall a \in R, I(x,a)=I(y,a)\}\).

Meanwhile, from Definition 2, we can get that the universe is divided into three parts, which are the positive region POS(X), the boundary region BND(X) and the negative region NEG(X).

$$\begin{aligned}&POS(X)=\{x\in U~|~P(X~|~[x]_{R})\ge \alpha \},\end{aligned}$$
(3)
$$\begin{aligned}&BND(X)=\{x\in U~|~\beta<P(X~|~[x]_{R})< \alpha \},\end{aligned}$$
(4)
$$\begin{aligned}&NEG(X)=\{x\in U~|~P(X~|~[x]_{R})\le \beta \}. \end{aligned}$$
(5)

2.2 LSTM

LSTM is a kind of temporal recurrent neural network, which is specialized in processing prediction of long time-series data sets and is good at learning long time dependent information, and is a variant form of recurrent neural network. The core concept of LSTM is the “gate" structure and memory cells. The “gate" structure determines how to learn to save and forget information during the training process. Thus, it is able to add and delete historical information. Memory cells, on the other hand, act as a pathway for information transfer, allowing historical information to be passed forward through the cell structure. As a result, the long history information before the longer time step can also be transferred to the memory cells of the latest time step, which eliminates the effect of short term memory. Such a core structure allows the LSTM to selectively forget some of the unimportant information during the process, and to associate important information that can be self-measured and then selected to be forgotten. Thus, the LSTM is able to retain the state of the cells farther away from the current moment, and remember longer sequences [48]. The structure of one of the hidden layer cells is shown in Fig. 1.

Fig. 1
figure 1

LSTM neural network structure schematic

Where \(X_{t}\) is the input of the input layer at t; \(h_{t}\) is the output of the hidden layer cell at t; \(C_{t}\) is the cell state at t; \(f_{t}\) is the forget gate state at t; \(i_{t}\) is the input gate state at t; \(Ce_{t}\) is the candidate cell state at t; \(o_{t}\) is the output gate state at t; \(\sigma _{1},\sigma _{2},\sigma _{3}\) and tanh are the activation functions. The work of each of these gates and memory cells is as follows.

  1. 1.

    Forget Gate: decide which information should be saved and forgotten. The output of the previous hidden layer cell and the input information of the current hidden layer cell are integrated into the \(\sigma _{1}\) activation function through the forget gate, and the output value \(f_{t}\) is between 0 and 1, which is used to remove the information \(C_{t-1}\) from the previous cell. Where \(f_{t}\) is calculated as

    $$\begin{aligned} f_{t}=\sigma _{1}(W_{f}[h_{t-1},X_{t}]+b_{f}). \end{aligned}$$
    (6)
  2. 2.

    Input Gate: the role of the input gate is to update the cell state. The input gate integrates the output of the previous hidden layer cell and the input of the current hidden layer cell into the \(\sigma _{2}\) activation function, with the output value \(i_{t}\) between 0 and 1 to determine which information to update, closer to 0 means less important and closer to 1 means more important. Next, the output of the previous hidden layer cell and the input of the current hidden layer cell are integrated into the tanh function to create a new cell candidate state \(Ce_{t}\). Finally, the output value of \(\sigma _{2}\) activation function is multiplied with the output value of tanh, and the output value \(i_{t}\) determines which of the output values \(Ce_{t}\) is the important information to be saved. The equations for \(i_{t}\) and \(Ce_{t}\) are

    $$\begin{aligned}&i_{t}=\sigma _{2}(W_{i}[h_{t-1},X_{t}]+b_{i}),\end{aligned}$$
    (7)
    $$\begin{aligned}&Ce_{t}=tanh(W_{Ce}[h_{t-1},X_{t}]+b_{Ce}). \end{aligned}$$
    (8)
  3. 3.

    Cell Status: the cell state \(C_{t-1}\) of the previous layer is multiplied point by point with the forgetting vector \(f_{t}\). If the result is closer to 0, then it means that the old cellular memory information needs to be forgotten in the new cellular state. This result is then summed point by point with the output value of the input gate, and the newly discovered important information of the cell structure is updated into the cell state. By updating the cell state through the above steps, the new cell state \(C_{t}\) is calculated as

    $$\begin{aligned} C_{t}=f_{t}\cdot C_{t-1}+i_{t}\cdot Ce_{t}. \end{aligned}$$
    (9)
  4. 4.

    Output Gate: the role of the output gate is to form the input value of the next hidden layer cell and the output value of the current hidden layer cell. The output of the previous hidden layer cell and the input of the current hidden layer cell are first integrated into the \(\sigma _{3}\) activation function to form a new cell state \(C_{t}\), and then input to the tanh function. Finally, the output of tanh is multiplied by the output of the \(\sigma _{3}\) activation function to form the output of the hidden layer cell and the information to be passed. The output \(h_{t}\) is used as the output of the current cell and the input information of the next hidden layer cell. The value of the output gate state \(o_{t}\) and the output result \(h_{t}\) are calculated as follows

    $$\begin{aligned}&o_{t}=\sigma _{3}(W_{o}[h_{t-1},X_{t}]+b_{o}),\end{aligned}$$
    (10)
    $$\begin{aligned}&h_{t}=o_{t}\cdot tanh(C_{t}). \end{aligned}$$
    (11)

In the LSTM neural network, the information of the historical moment is not equally important to the current moment. By combining the above-mentioned memory cell and “gate” structure, the information of the historical moment can be selectively filtered and filtered to complete the update of the current state. When the forget vector \(f_{t}\ne 0\) is output from the forget gate, some of the information of the historical moments will be preserved and fused with the current state, which makes the LSTM cope with the long-range dependence problem and complete the long-distance time- series prediction problem.

2.3 Particle swarm optimization

Particle Swarm Optimization (PSO), also known as a flock foraging algorithm, was first introduced by Eberhart and Kennedy [49] after observational studies of flock feeding behavior in 1995. In a characteristic area, there are different sizes of food targets, and the individuals in the flock share the best food available, so that the whole flock can search for the best food target in the area. The essence of the PSO algorithm is based on the behavior of such a group of animals, a simplified model of group intelligence, using the sharing of optimal information of individuals in the group, so as to achieve the whole group in a specific space from disorderly to orderly continuous movement evolution, and finally find the optimal solution of such a process.

Definition 3

[50] Assume that in a d-dimensional target space, the search community consists of n particles, where the i-th d-dimensional particle can be expressed as

$$\begin{aligned} X_{i}=\{x_{i1},x_{i2},...,x_{id}\}. \end{aligned}$$
(12)

The flight speed can be expressed as

$$\begin{aligned} V_{i}=\{v_{i1},v_{i2},...,v_{id}\}. \end{aligned}$$
(13)

The historical best position can be expressed as

$$\begin{aligned} P_{i}=\{p_{i1},p_{i2},...,p_{id}\}. \end{aligned}$$
(14)

As the particle evolves from generation t to generation \(t+1\), its velocity and vector are updated as follows

$$\begin{aligned}&v_{ij}(t+1)=wv_{ij}(t)+c_{1}r_{1}(t)(p_{ij}(t)-x_{ij}(t))\nonumber \\&\qquad \qquad \qquad \qquad \qquad \quad \, \,\,+c_{2}r_{2}(t)(p_{gj}(t)-x_{ij}(t)),\end{aligned}$$
(15)
$$\begin{aligned}&x_{ij}(t+1)=x_{ij}(t)+v_{ij}(t+1), \end{aligned}$$
(16)

where w represents the inertia weight, which determines the degree of influence of the individual’s previous iteration speed on this iteration, and can play the role of balancing the global search and local search; \(P_{g}=\{p_{g1},p_{g2},... ,p_{gd}\}\) represents the optimal position of the community up to the current iteration; \(c_{1},c_{2}\) represent the self-summary improvement of the particle and the learning from the best individuals in the community, so as to achieve closer to its own historical best and the global best of the community; \(r_{1},r_{2}\) take the value of [0, 1] random number. In addition, the upper and lower bounds of \(X_{i}\) need to be bounded to ensure that the search is within a certain region.

According to Definition 3, we can get the implementation flow of the PSO algorithm as follows:

Step 1. Initialize: randomly generate the velocity and position of each individual in the population;

Step 2. Evaluate the particles: calculate the fitness of each individual. In general, the optimization goal is to minimize a function value, so the smaller the calculated function value, the greater the fitness;

Step 3. Update the best: compare the individual fitness with its historical best, if it is better than the historical best, then update its historical best \(P_{i}\); compare the individual fitness with the population best, if it is better than the population best, then update the population best \(P_{g}\);

Step 4. Update the particle: update the velocity and position of the particle according to (15) and (16);

Step 5. Output optimal value: determine whether the termination condition is satisfied, if it is satisfied, output optimal value, if not, return to Step 2.

3 A water quality prediction method based on k-nearest-neighbor probability rough sets and PSO-LSTM

First, we establish an attribute reduction approach based on the k-nearest-neighbor probability rough sets (KNPRSs), to complete the attribute reduction of water quality indicators, so as to complete the pre-processing of many high-dimensional water quality indicators, so that the input dimension of the water quality prediction model is more concise, so as to improve the accuracy and time efficiency of the water quality prediction model.

3.1 K-nearest-neighbor probability rough sets

First of all, we introduce the parameter k to define the k-nearest-neighbor class and the corresponding k-nearest-neighbor relation on the universe.

Definition 4

Given a decision information system \(DS=\langle U,C\cup D\rangle \). For any \(x\in U\), \(B\subseteq C\), define the k-nearest-neighbor class \(R^{k}_{B}(x)\) of the attribute subset B with respect to the object x as

$$\begin{aligned} R^{k}_{B}(x)=\bigcap \limits _{a\in B}R^{k}_{a}(x), \end{aligned}$$
(17)

where \(R^{k}_{a}(x)\) denotes the set of k objects (including x itself) nearest to x with respect to the attribute a in the universe U and k is a given positive integer.

Besides, \(R^{k}_{B}(x)\) is also called the k-nearest-neighbor information granule of object x, and its size is controlled by the parameter k. Especially, k-nearest-neighbor information granule is the smallest when \(k=1\), \(R^{k}_{B}(x)=\{x\}\); k-nearest-neighbor information granule is the largest when \(k=|U|\), \(R^{k}_{B}(x)=U\).

Then, we can define k-nearest-neighbor relation on the universe U.

Definition 5

A certain k-nearest-neighbor relation on the universe U can be defined using the matrix relation \(N^{k}_{B}=(r^{k}_{ij})_{n\times n}\), where

$$\begin{aligned} r^{k}_{ij}={\left\{ \begin{array}{ll} 1 &{} x_{j}\in R^{k}_{B}(x_{i}),\\ 0 &{} \text{ Otherwise. } \end{array}\right. } \end{aligned}$$
(18)

Obviously, \(N^{k}_{B}\) satisfies \(r^{k}_{ii}=1\), but may not satisfy \(r^{k}_{ij}=r^{k}_{ji}\), which may be an asymmetric matrix.

It is natural that we can obtain the following property.

Proposition 1

Given a decision information system \(\langle U,C\cup D\rangle \), the following statements hold.

  1. 1.

    If \(B_{1}\subseteq B_{2}\subseteq C\), then \(N^{k}_{B_{1}}\supseteq N^{k}_{B_{2}}\);

  2. 2.

    If \(k_{1}\le k_{2}\), then \(N^{k_{1}}_{B}\subseteq N^{k_{2}}_{B}\).

Proof

  1. 1.

    By Definition 4, it follows that \(R^{k}_{B_{1}\cup B_{2}}(x)=\bigcap \limits _{a\in B_{1}\cup B_{2}}R^{k}_{a}(x)=R^{k}_{B_{1}}(x)\bigcap R^{k}_{B_{2}}(x)\). Thus, one gets that \(N^{k}_{B_{1}\bigcup B_{2}}=N^{k}_{B_{1}}\bigcap N^{k}_{B_{2}}\). Since \(B_{2}=B_{1}\bigcup (B_{2}-B_{1})\), one gets that \(N^{k}_{B_{2}}=N^{k}_{B_{1}}\bigcap N^{k}_{B_{2}-B_{1}}\). Therefore, one concludes that \(N^{k}_{B_{1}}\supseteq N^{k}_{B_{2}}\).

  2. 2.

    Since \(k_{1}\le k_{2}\), one gets that \(\forall a\in B\), \(R^{k_{1}}_{a}(x)\subseteq R^{k_{2}}_{a}(x)\) and \(R^{k_{1}}_{B}(x)\subseteq R^{k_{2}}_{B}(x)\). Thus, one concludes that \(N^{k_{1}}_{B}\subseteq N^{k_{2}}_{B}\).

\(\square \)

In what follows, we give the definition of the k-nearest-neighbor probability rough sets by using the k-nearest-neighbor class.

Definition 6

Suppose that \(\langle U,C\cup D\rangle \) is a decision information system and \(R^{k}_{B}(x)\) is a k-nearest-neighbor class of any \(x\in U\) generated by the conditional attribute set \(B\subseteq C\). Then, for a pair of parameters \(\alpha \) and \(\beta \) with \(0\le \beta <\alpha \le 1\), the lower approximation and upper approximation of any \(X\subseteq U\) with respect to the conditional attribute set B are defined, respectively, as follows:

$$\begin{aligned}&\underline{Apr^{k,\alpha }_{B}}(X)=\{x\in U~|~P(X|R^{k}_{B}(x))\ge \alpha \},\end{aligned}$$
(19)
$$\begin{aligned}&\overline{Apr^{k,\beta }_{B}}(X)=\{x\in U~|~P(X|R^{k}_{B}(x))> \beta \}. \end{aligned}$$
(20)

If \(\underline{Apr^{k,\alpha }_{B}}(X)=\overline{Apr^{k,\beta }_{B}}(X)\), then we say that X is a definable set. Otherwise, X is a rough set.

Assume that the universe U is divided into the crisp equivalence classes \(U/D=\{X_{1},X_{2},...,X_{n}\}\) by the decision attribute D. Then, the upper and lower approximations of D can be defined as follows.

Definition 7

Given a decision information system \(\langle U,C,D\rangle \), \(U/D=\{X_{1},X_{2},... ,X_{n}\}\), and \(R^{k}_{B}(x)\) is a k-nearest-neighbor class generated by the conditional attribute \(B\subseteq C\), with parameters \(0\le \beta <\alpha \le 1\). Then, we can define the lower approximation and upper approximation of D with respect to the conditional attribute set B as

$$\begin{aligned}&\underline{Apr^{k,\alpha }_{B}}(D)=\bigcup \limits _{i=1}^{n}\underline{Apr^{k,\alpha }_{B}}(X_{i}),\end{aligned}$$
(21)
$$\begin{aligned}&\overline{Apr^{k,\beta }_{B}}(D)=\bigcup \limits _{i=1}^{n}\overline{Apr^{k,\beta }_{B}}(X_{i}), \end{aligned}$$
(22)

where

$$\begin{aligned}&\underline{Apr^{k,\alpha }_{B}}(X_{i})=\{x\in U~|~P(X_{i}|R^{k}_{B}(x))\ge \alpha \},\end{aligned}$$
(23)
$$\begin{aligned}&\overline{Apr^{k,\beta }_{B}}(X_{i})=\{x\in U~|~P(X_{i}|R^{k}_{B}(x))> \beta \}. \end{aligned}$$
(24)

At the same time, we can also obtain the decision positive region of D with respect to the conditional attribute set B, which is listed as follows:

$$\begin{aligned} POS_{B}(D)=\underline{Apr^{k,\alpha }_{B}}(D)=\bigcup \limits _{i=1}^{n}\underline{Apr^{k,\alpha }_{B}}(X_{i}). \end{aligned}$$
(25)

The following example is employed to illustrate the novel model of KNPRSs.

Example 1

As shown in Table 1, given a decision information system \(\langle U,C\cup D\rangle \), \(B\subseteq C\), there are 8 samples \(\{x_{1},x_{2},... .x_{8}\}\), 4 conditional attributes \(\{a_{1},a_{2},a_{3},a_{4}\}\) and a decision attribute d. According to the values of the decision attribute d, U is divided into \(U/D=\{D_{1},D_{2}\}\), where \(D_{1}=\{x_{1},x_{2},x_{3},x_{4}\}\), \(D_{2}=\{x_{5},x_{6},x_{7},x_{8}\}\). Also, we divide the conditional attributes into two groups \(B_{1}=\{a_{1},a_{2}\}\), \(B_{2}=\{a_{3},a_{4}\}\).

By Definition 4, given \(k=4\), using Euclidean distance, we can get that \(R^ {4}_{a_{1}}(x_{1})=\{x_{1},x_{2},x_{3},x_{5}\}\), \(R^{4}_{a_{2}}(x_{1})=\{x_{1},x_{2},x_{3},x_{4}\}\), \(R^{4}_{a_{3}}(x_{1})=\{x_{1},x_{2},x_{5},x_{6}\}\) and \(R^{4}_{a_{4}}(x_{1})=\{x_{1},x_{2},x_{3},x_{6}\}\). Therefore, \(R^ {4}_{B_{1}}(x_{1})=R^ {4}_{a_{1}}(x_{1})\cap R^ {4}_{a_{2}}(x_{1})=\{x_{1},x_{2},x_{3}\}\), \(R^ {4}_{B_{2}}(x_{1})=R^ {4}_{a_{3}}(x_{1})\cap R^ {4}_{a_{4}}(x_{1})=\{x_{1},x_{2},x_{6}\}\). The result is shown in Table 2. Then, we can separately obtain the k-nearest-neighbor relation \(N^ {4}_{B_{1}}\) and \(N^ {4}_{B_{2}}\) of the attribute subsets \(B_{1}\) and \(B_{2}\) as follows:

$$\begin{aligned} N^ {4}_{B_{1}}= \begin{pmatrix} {1} &{} {1} &{} {1} &{} {0} &{} {0} &{} {0} &{} {0} &{} {0}\\ {0} &{} {1} &{} {1} &{} {0} &{} {0} &{} {0} &{} {0} &{} {0}\\ {1} &{} {1} &{} {1} &{} {0} &{} {0} &{} {0} &{} {0} &{} {0}\\ {0} &{} {0} &{} {0} &{} {1} &{} {0} &{} {0} &{} {0} &{} {0}\\ {0} &{} {0} &{} {0} &{} {0} &{} {1} &{} {1} &{} {0} &{} {0}\\ {0} &{} {0} &{} {0} &{} {0} &{} {0} &{} {1} &{} {0} &{} {1}\\ {0} &{} {0} &{} {0} &{} {0} &{} {1} &{} {1} &{} {1} &{} {0}\\ {0} &{} {0} &{} {0} &{} {0} &{} {0} &{} {1} &{} {0} &{} {1}\\ \end{pmatrix}, N^ {4}_{B_{2}}= \begin{pmatrix} {1} &{} {1} &{} {0} &{} {0} &{} {0} &{} {1} &{} {0} &{} {0}\\ {1} &{} {1} &{} {0} &{} {0} &{} {0} &{} {1} &{} {0} &{} {0}\\ {0} &{} {0} &{} {1} &{} {0} &{} {0} &{} {0} &{} {1} &{} {1}\\ {0} &{} {0} &{} {0} &{} {1} &{} {0} &{} {0} &{} {0} &{} {0}\\ {0} &{} {0} &{} {0} &{} {0} &{} {1} &{} {0} &{} {1} &{} {0}\\ {1} &{} {1} &{} {0} &{} {0} &{} {0} &{} {1} &{} {0} &{} {0}\\ {0} &{} {0} &{} {1} &{} {0} &{} {0} &{} {0} &{} {1} &{} {1}\\ {0} &{} {0} &{} {1} &{} {0} &{} {0} &{} {0} &{} {1} &{} {1}\\ \end{pmatrix}. \end{aligned}$$

We can find that the k-nearest-neighbor relations \(N^ {4}_{B_{1}}\) and \(N^ {4}_{B_{2}}\) are asymmetric. Let \(\alpha =0.8\) and \(\beta =0.3\). Then, according to Definition 6, the lower and upper approximations of \(D_{1}\) and \(D_{2}\) with respect to the attribute subsets \(B_{1}\) and \(B_{2}\) can be obtained as follows, respectively,

  1. 1.

    \( \underline{Apr^{4,0.8}_{B_{1}}}(D_{1})=\{x_{1},x_{2},x_{3},x_{4}\}\), \( \underline{Apr^{4,0.8}_{B_{1}}}(D_{2})=\{x_{5},x_{6},x_{7},x_{8}\}\);

  2. 2.

    \( \overline{Apr^{4,0.3}_{B_{1}}}(D_{1})=\{x_{1},x_{2},x_{3},x_{4}\}\), \( \overline{Apr^{4,0.3}_{B_{1}}}(D_{2})=\{x_{5},x_{6},x_{7},x_{8}\}\);

  3. 3.

    \( \underline{Apr^{4,0.8}_{B_{2}}}(D_{1})=\{x_{4}\}\), \( \underline{Apr^{4,0.8}_{B_{2}}}(D_{2})=\{x_{5}\}\);

  4. 4.

    \( \overline{Apr^{4,0.3}_{B_{2}}}(D_{1})=\{x_{1},x_{2},x_{3},x_{4},x_{6},x_{7},x_{8}\}\), \( \overline{Apr^{4,0.3}_{B_{2}}}(D_{2})=\{x_{1},x_{2},x_{3},x_{6},x_{7},x_{8}\}\).

Table 1 Decision information system
Table 2 k-nearest-neighbor class

Now, we present some fundamental properties of KNPRSs.

Proposition 2

Given a decision information system \(\langle U,C,D\rangle \), with parameters \(0\le \beta <\alpha \le 1\). Then, for any \(B\subseteq C\), the following statements hold.

  1. 1.

    \(\underline{Apr^{k,\alpha }_{B}}(\emptyset )=\overline{Apr^{k,\beta }_{B}}(\emptyset )=\emptyset \);

  2. 2.

    \(\underline{Apr^{k,\alpha }_{B}}(U)=\overline{Apr^{k,\beta }_{B}}(U)=U\);

  3. 3.

    \(\forall X\subseteq U\), \(\underline{Apr^{k,\alpha }_{B}}(X)\subseteq X\subseteq \overline{Apr^{k,\beta }_{B}}(X)\);

  4. 4.

    \(\forall X_{1},X_{2}\subseteq U\), \(\underline{Apr^{k,\alpha }_{B}}(X_{1}\cup X_{2})\supseteq \underline{Apr^{k,\alpha }_{B}}(X_{1})\cup \underline{Apr^{k,\alpha }_{B}}(X_{2})\);

  5. 5.

    \(\forall X_{1},X_{2}\subseteq U\), \(\overline{Apr^{k,\beta }_{B}}(X_{1}\cup X_{2})\supseteq \overline{Apr^{k,\beta }_{B}}(X_{1})\cup \overline{Apr^{k,\beta }_{B}}(X_{2})\);

  6. 6.

    \(\forall X_{1},X_{2}\subseteq U\), \(\underline{Apr^{k,\alpha }_{B}}(X_{1}\cap X_{2})\subseteq \underline{Apr^{k,\alpha }_{B}}(X_{1})\cap \underline{Apr^{k,\alpha }_{B}}(X_{2})\);

  7. 7.

    \(\forall X_{1},X_{2}\subseteq U\), \(\overline{Apr^{k,\beta }_{B}}(X_{1}\cap X_{2})\subseteq \overline{Apr^{k,\beta }_{B}}(X_{1})\cap \overline{Apr^{k,\beta }_{B}}(X_{2})\).

Proof

We only prove items 4 and 6. The others can be directly obtained by Definitions 6 and 7.

4. \(\forall x\in \underline{Apr^{k,\alpha }_{B}}(X_{1})\cup \underline{Apr^{k,\alpha }_{B}}(X_{2})\), at least one of the following two equations holds

$$\begin{aligned} P(X_{1}~|~R^{k}_{B}(x))=\frac{|X_{1}\cap R^{k}_{B}(x)|}{|R^{k}_{B}(x)|}\ge \alpha ,\\ P(X_{2}~|~R^{k}_{B}(x))=\frac{|X_{2}\cap R^{k}_{B}(x)|}{|R^{k}_{B}(x)|}\ge \alpha . \end{aligned}$$

Since

$$\begin{aligned} (X_{1}\cup X_{2})\cap R^{k}_{B}(x)=(X_{1}\cap R^{k}_{B}(x))\cup (X_{2}\cap R^{k}_{B}(x)). \end{aligned}$$

Then, we can get

$$\begin{aligned} P(X_{1}\cup X_{2}~|~R^{k}_{B}(x))&=\frac{|(X_{1}\cup X_{2})\cap R^{k}_{B}(x)|}{|R^{k}_{B}(x)|}\\&\ge \max \{P(X_{1}~|~R^{k}_{B}(x)),P(X_{2}~|~R^{k}_{B}(x))\}\ge \alpha . \end{aligned}$$

Thus, one concludes that \(\forall X_{1},X_{2}\subseteq U\), \(\underline{Apr^{k,\alpha }_{B}}(X_{1}\cup X_{2})\supseteq \underline{Apr^{k,\alpha }_{B}}(X_{1})\cup \underline{Apr^{k,\alpha }_{B}}(X_{2})\).

6. \(\forall x\in \underline{Apr^{k,\alpha }_{B}}(X_{1}\cap X_{2})\), we can get

$$\begin{aligned} P(X_{1}\cap X_{2}~|~R^{k}_{B}(x))=\frac{|(X_{1}\cap X_{2})\cap R^{k}_{B}(x)|}{|R^{k}_{B}(x)|}\ge \alpha . \end{aligned}$$

Since \((X_{1}\cap X_{2})\cap R^{k}_{B}(x)\) is a subset of \((X_{1}\cap R^{k}_{B}(x))\) and \((X_{2}\cap R^{k}_{B}(x))\), one gets that

$$\begin{aligned} P(X_{1}~|~R^{k}_{B}(x))= & {} \frac{|X_{1}\cap R^{k}_{B}(x)|}{|R^{k}_{B}(x)|}\\\ge & {} \frac{|(X_{1}\cap X_{2})\cap R^{k}_{B}(x)|}{|R^{k}_{B}(x)|}\ge \alpha ,\\ P(X_{2}~|~R^{k}_{B}(x))= & {} \frac{|X_{2}\cap R^{k}_{B}(x)|}{|R^{k}_{B}(x)|}\\\ge & {} \frac{|(X_{1}\cap X_{2})\cap R^{k}_{B}(x)|}{|R^{k}_{B}(x)|}\ge \alpha . \end{aligned}$$

i.e., \(x\in \underline{Apr^{k,\alpha }_{B}}(X_{1})\cap \underline{Apr^{k,\alpha }_{B}}(X_{2})\). Therefore, one concludes that \(\underline{Apr^{k,\alpha }_{B}}(X_{1}\cap X_{2})\subseteq \underline{Apr^{k,\alpha }_{B}}(X_{1})\cap \underline{Apr^{k,\alpha }_{B}}(X_{2})\).

When \(\alpha \) and \(\beta \) take endpoint-specific values, KNPRSs also has the following properties.

Proposition 3

Given a decision information system \(\langle U,C\cup D\rangle \), \(B_{1}\subseteq B_{2}\subseteq C\), the following statements hold.

  1. 1.

    \(\forall X\subseteq C\), \(\underline{Apr^{k,1}_{B_{1}}}(X)\subseteq \underline{Apr^{k,1}_{B_{2}}}(X)\);

  2. 2.

    \(\forall X\subseteq C\), \(\overline{Apr^{k,0}_{B_{1}}}(X)\supseteq \overline{Apr^{k,0}_{B_{2}}}(X)\).

Proof

Since \(B_{1}\subseteq B_{2}\), by Definition 4, we get that \(\forall X\in U\), \(R^{k}_{B_{2}}(x)\subseteq R^{k}_{B_{1}}(x)\).

  1. 1.

    According to Definition 6, we can get

    $$\begin{aligned} \underline{Apr^{k,1}_{B}}(X)&\!=\!\{x~|~P(X|R^{k}_{B}(x))\ge 1\}\\&\!=\!\{x~|~X\cap R^{k}_{B}(x)\!=\!R^{k}_{B}(x)\}. \end{aligned}$$

    So \(\forall x\in \underline{Apr^{k,1}_{B_{1}}}(X)\), we can get

    $$\begin{aligned} X\cap R^{k}_{B_{1}}(x)=R^{k}_{B_{1}}(x)\supseteq R^{k}_{B_{2}}(x). \end{aligned}$$

    i.e., \(x\in \underline{Apr^{k,1}_{B_{2}}}(X)\). Thus, one concludes that \(\underline{Apr^{k,1}_{B_{1}}}(X)\subseteq \underline{Apr^{k,1}_{B_{2}}}(X)\).

  2. 2.

    According to Definition 6, we can get

    $$\begin{aligned} \overline{Apr^{k,0}_{B}}(X)\!=\!\{x~|~P(X|R^{k}_{B}(x))\!>\!0\}\!=\!\{x~|~X\cap R^{k}_{B}(x)\!\ne \!\emptyset \}. \end{aligned}$$

    So \(\forall x\in \overline{Apr^{k,0}_{B_{2}}}(X)\), we can get

    $$\begin{aligned} X\cap R^{k}_{B_{2}}(x)\ne \emptyset . \end{aligned}$$

    Since \(X\cap R^{k}_{B_{2}}(x)\subseteq X\cap R^{k}_{B_{1}}(x)\), one has that \(X\cap R^{k}_{B_{1}}(x)\ne 0\). i.e., \(x\in \overline{Apr^{k,0}_{B_{1}}}(X)\). Therefore, one concludes that \(\overline{Apr^{k,0}_{B_{1}}}(X)\supseteq \overline{Apr^{k,0}_{B_{2}}}(X)\).

\(\square \)

When \(\alpha \) and \(\beta \) takes other non-endpoint values, it can be found that \(\underline{Apr^{k,\alpha }_{B}}(X)\) exhibits left-continuity with respect to \(\alpha \) and \(\overline{Apr^{k,\beta }_{B}}(X)\) exhibits right-continuity with respect to \(\beta \).

Proposition 4

Given a decision information system \(\langle U,C\cup D\rangle \), for any \(0<\gamma <1\), \( B\subseteq C\) and \( X\in U\), the following statements hold.

  1. 1.

    \(\underset{\alpha \rightarrow \gamma ^{+}}{\lim }\underline{Apr^{k,\alpha }_{B}}(X)=\underset{\gamma <\alpha \le 1}{\bigcup }\underline{Apr^{k,\alpha }_{B}}(X)\);

  2. 2.

    \(\underset{\alpha \rightarrow \gamma ^{-}}{\lim }\underline{Apr^{k,\alpha }_{B}}(X)=\underset{0<\alpha <\gamma }{\bigcap }\underline{Apr^{k,\alpha }_{B}}(X)=\underline{Apr^{k,\gamma }_{B}}(X)\);

  3. 3.

    \(\underset{\beta \rightarrow \gamma ^{+}}{\lim }\overline{Apr^{k,\beta }_{B}}(X)=\underset{\gamma <\beta \le 1}{\bigcup }\overline{Apr^{k,\beta }_{B}}(X)=\overline{Apr^{k,\gamma }_{B}}(X)\);

  4. 4.

    \(\underset{\beta \rightarrow \gamma ^{-}}{\lim }\overline{Apr^{k,\beta }_{B}}(X)=\underset{0<\beta <\gamma }{\bigcap }\overline{Apr^{k,\beta }_{B}}(X)\).

Proof

  1. 1.

    When \(\alpha \rightarrow \gamma ^{+}\), obviously \(\underset{\alpha \rightarrow \gamma ^ {+}}{\lim }\underline{Apr^{k,\alpha }_{B}}(X)\subseteq \underset{\gamma <\alpha \le 1}{\bigcup }\underline{Apr^{k,\alpha }_{B}}(X)\) holds. Hence, we only need to certificate

    $$\begin{aligned} \bigcup _{\gamma <\alpha \le 1}\underline{Apr^{k,\alpha }_{B}}(X)\subseteq \lim _{\alpha \rightarrow \gamma ^{+}}\underline{Apr^{k,\alpha }_{B}}(X). \end{aligned}$$

    \(\forall x\in \underset{\gamma <\alpha \le 1}{\bigcup }\underline{Apr^{k,\alpha }_{B}}(X)\), \(\exists ~\xi >0\), \(\xi \rightarrow 0\),

    $$\begin{aligned} P(X~|~R^{k}_{B}(x))\ge \gamma +\xi >\alpha . \end{aligned}$$

    Then, we can get \(x\in \underset{\alpha \rightarrow \gamma ^{+}}{\lim }\underline{Apr^{k,\alpha }_{B}}(X)\), i.e., \(\underset{\gamma <\alpha \le 1}{\bigcup }\underline{Apr^{k,\alpha }_{B}}(X)\subseteq \underset{\alpha \rightarrow \gamma ^{+}}{\lim }\underline{Apr^{k,\alpha }_{B}}(X)\). Therefore, one concludes that \(\underset{\alpha \rightarrow \gamma ^{+}}{\lim }\underline{Apr^{k,\alpha }_{B}}(X)=\underset{\gamma <\alpha \le 1}{\bigcup }\underline{Apr^{k,\alpha }_{B}}(X)\).

  2. 2.

    When \(\alpha \rightarrow \gamma ^{-}\), obviously \(\underset{0<\alpha <\gamma }{\bigcap }\underline{Apr^{k,\alpha }_{B}}(X)\subseteq \underset{\alpha \rightarrow \gamma ^{-}}{\lim }\underline{Apr^{k,\alpha }_{B}}(X)\). Similarly, we only need to certificate

    $$\begin{aligned} \lim _{\alpha \rightarrow \gamma ^{-}}\underline{Apr^{k,\alpha }_{B}}(X)\subseteq \bigcap _{0<\alpha <\gamma }\underline{Apr^{k,\alpha }_{B}}(X). \end{aligned}$$

    \(\forall x\in \underset{\alpha \rightarrow \gamma ^{-}}{\lim }\underline{Apr^{k,\alpha }_{B}}(X)\), \(\exists ~\xi >0\), \(\xi \rightarrow 0\),

    $$\begin{aligned} P(X~|~R^{k}_{B}(x))\ge \alpha >\gamma -\xi . \end{aligned}$$

    Then, we can get \(x\in \underset{0<\alpha <\gamma }{\bigcap }\underline{Apr^{k,\alpha }_{B}}(X)\),

    $$\begin{aligned} \lim _ {\alpha \rightarrow \gamma ^{-}}\underline{Apr^{k,\alpha }_{B}}(X)=\bigcap _{0<\alpha <\gamma }\underline{Apr^{k,\alpha }_{B}}(X). \end{aligned}$$

    Since \(\alpha <\gamma \), one gets that \(\underline{Apr^{k,\alpha }_{B}}(X)\supseteq \underline{Apr^{k,\gamma }_{B}}(X)\). Therefore, we can get

    $$\begin{aligned} \lim _{\alpha \rightarrow \gamma ^{-}}\underline{Apr^{k,\alpha }_{B}}(X)\supseteq \underline{Apr^{k,\gamma }_{B}}(X). \end{aligned}$$

    \(\forall x\in \underset{\alpha \rightarrow \gamma ^{-}}{\lim }\underline{Apr^{k,\alpha }_{B}}(X)\), when \(\xi \rightarrow 0\), \(P(X~|~R^{k}_{B}(x))\ge \alpha >\gamma \), we can get

    $$\begin{aligned} \lim _{\alpha \rightarrow \gamma ^{-}}\underline{Apr^{k,\alpha }_{B}}(X)\subseteq \underline{Apr^{k,\gamma }_{B}}(X). \end{aligned}$$

    Thus, one concludes that \(\underset{\alpha \rightarrow \gamma ^{-}}{\lim }\underline{Apr^{k,\alpha }_{B}}(X)=\underset{0<\alpha <\gamma }{\bigcap }\underline{Apr^{k,\alpha }_{B}}(X)=\underline{Apr^{k,\gamma }_{B}}(X)\).

Items 3 and 4 can be proved in the same way, and we don’t repeat it here.\(\square \)

Then, in order to more precisely express the accuracy of the set, we define the degree of dependence of the decision attribute D on the conditional attribute set B as follows.

Definition 8

Given a decision information system \(\langle U,C\cup D\rangle \), \(B\subseteq C\), k is a given positive integer, and the parameters \(0\le \beta <\alpha \le 1\), then the dependence of the decision D on the conditional attribute set B can be defined as

$$\begin{aligned} \gamma _{B}(D)=\frac{|POS_{B}(D)|}{|U|}. \end{aligned}$$
(26)

The positive region \(POS_{B}(D)=\underline{Apr^{k,\alpha }_{B}}(D)\) abstractly represents all deterministic rules, where the number of elements is the number of all deterministic rules, and the dependence function \(\gamma _{B}(D)\) is the proportion of deterministic rules among all rules, quantifying the consistency of the conditional attribute B with respect to the decision D for a given positive integer k.

According to \(POS_{B}(D)\subseteq U\), we have \(0\le \gamma _{B}(D)\le 1\). Obviously, \(POS_{B}(D)=U\Leftrightarrow \gamma _{B}(D)=1\Leftrightarrow \langle U,C\cup D\rangle \) is a consistent decision information system. Meanwhile, the larger the dependency \(\gamma _{B}(D)\), the greater the role of the conditional attribute B in determining the decision D.

Example 2

(Continued from Example 1) According to \(POS_{B}(D)=\bigcup \limits _{i=1}^{n}\underline{Apr^{k,\alpha }_{B}}(D_{i})\), we can get that \(POS_{B_{1}}(D)=U\) and \(POS_{B_{2}}(D)=\{x_{4}, x_{5}\}\). By Definition 8, we get that \(\gamma _{B_{1}}(D)=1\) and \(\gamma _{B_{2}}(D)=0.25\). From the dependence of KNPRSs, the consistency of \(B_{1}\) is better than that of \(B_{2}\), and the conditional attribute \(B_{1}\) has a greater role in determining the decision D than the conditional attribute \(B_{2}\).

3.2 An attribute reduction approach of water quality indicators based on KNPRSs

By the conclusion of the previous subsection, we know that the dependency of a decision attribute d reflects the classification ability of a subset of attributes, so this subsection measures the importance of a single attribute to a subset of attributes by calculating the difference between the dependency of d on the original set of attributes and the dependency on the reduced set of attributes by one attribute.

Definition 9

Given a decision information system \(\langle U,C\cup D\rangle \), \(B\subseteq C\), \(a\in B\), define the importance of attribute a for the conditional attribute B under decision D as

$$\begin{aligned} sig(a,B,D)=\gamma _{B}(D)-\gamma _{B-\{a\}}(D). \end{aligned}$$
(27)

From Definition 9, we find that if \(sig(a,B,D)\le 0\), then the elimination of attribute a improves the classification ability of the attribute set or has no effect on the classification ability, and a is said to be unnecessary in B with respect to D and can be eliminated. Conversely, if \(sig(a,B,D)>0\), then the elimination of attribute a has a negative effect on the classification ability of the attribute set, and a is necessary in B with respect to D and should be retained.

Definition 10

Assume that \(\langle U,C\cup D\rangle \) is a decision information system and \(P\subseteq C\). Then, P is said to be an attribute reduction of C with respect to D if there is \(\gamma _{P}(D)\ge \gamma _{C}(D)\) and any of the conditional attributes in P is necessary in P with respect to D.

Based on Definitions 9 and 10, we can find that by iterating through all the attributes a in C, we can calculate the importance of sig(aBD). And when \(sig(a,B,D)\le 0\), we can eliminate the attribute a until any one of the conditions is necessary, so that we can get the final core attribute and complete the attribute reduction. The specific decision-making steps and processes are illustrated by Algorithm 1.

Algorithm 1
figure e

Attribute reduction algorithm based on KNPRSs.

In Algorithm 1, steps 2-8 compute the k-nearest-neighbor class of the red, and the time complexity is \(\textrm{O}(|U|^{2}\times |C|)\). The time complexity of steps 10-15 is \(\textrm{O}(|U/D|)\). The time complexity of steps 18 is \(\textrm{O}((|C|+1)\times |C|\div 2)\). The time complexity of steps 17-23 is \(\textrm{O}(|U|^{2}\times |C|\times (|C|+1)\times |C|\div 2 )\). The time complexity of Algorithm 1 is \(\textrm{O}(|U|^{2}\times |C|\times (|C|+1)\times |C|\div 2 )\).

Definition 11

Given a water quality indicators system \(\langle U,C\) \(\cup D\rangle \), D is the target water quality indicator to be predicted, \(red \subseteq C\), we define red as an attribute reduction of C if and only if \(\gamma _{red}(D)\ge \gamma _{C}(D)\), and any of the conditional attributes in red are necessary in red with respect to D.

3.3 A water quality prediction approach based on PSO-LSTM

Water quality data are mainly collected through wireless sensor monitoring equipment at 13 automatic monitoring stations along the line, and the equipment automatically collects water quality data every 6 hours to monitor the physical and biochemical characteristics and health of water quality. The automatic monitoring stations have been in working condition since the opening of the line, with a time-series complete water quality database. So this paper is based on deep learning in the recurrent neural network framework for targeted model training and prediction of relevant water quality data. In this section, we optimize the Long Short-Term Memory (LSTM) based on the characteristics of time-series, instability and non-linearity of water quality monitoring data, and construct a model suitable for water quality prediction.

The LSTM neural network has good performance in long-distance time-series data through the memory cell and “gate" structure, which is suitable for the long span of data stored in the automatic monitoring stations of the South-North Water Transfer Middle Route. However, at the same time, the 13 automatic monitoring stations along the middle route have different water quality conditions and different complexity. So it is difficult to set the LSTM network structure hyperparameters suitable for each station by manual tuning without prior analysis of the water quality mechanism, which has a certain impact on the accuracy of the LSTM water quality prediction model.

PSO algorithm is a kind of bionic global optimization algorithm to simulate the foraging process of birds, using PSO algorithm to optimize the initial hyperparameters of the LSTM, using the population iterative update, to find the optimal combination of hyperparameters, and in the combination of parameters to train the LSTM water quality prediction model, to achieve the optimal prediction effect. Therefore, in this paper, while constructing the LSTM network structure, the PSO algorithm is added to complete the optimization of the initial hyperparameters, and a Long Short-Term Memory network is constructed based on the Particle Swarm Optimization(PSO-LSTM).That is, according to the following specific iterative process.

Step 1 Initialize: initialize the hyperparameters of the LSTM: time step, number of hidden layer units, learning rate. Initialize the relevant parameters of the population;

Step 2 Input data : input the data after data preprocessing and test set partitioning to the LSTM iterative model for training, and output the loss function;

Step 3 Update the particles: input the loss function as the objective function into the PSO algorithm, calculate the individual adaptation value to evaluate the particles, and complete the update of particle position and velocity;

Step 4 Judge Iteration: if not, the current hyperparameters are updated to the LSTM iterative model and Step 2 is repeated;

Step 5 Complete the final LSTM construction: if the stopping condition is satisfied, the optimized hyperparameters are output and the final LSTM is updated;

Step 6 Predicte water quality: the final optimized PSO-LSTM is used to predict the water quality results.

Algorithm 2
figure f

Water quality prediction method based on PSO-LSTM.

Fig. 2
figure 2

PSO-LSTM algorithm flowchart

Table 3 Dissolved oxygen data ADF test results of station A

As shown in Fig. 2, firstly, the initial loss function is output as the fitness function of the population by inputting the data into the LSTM. Secondly, update the velocity and position of the particles in the population, and determine whether the current optimum satisfies the termination condition. If not, the current hyperparameter combination is updated to the LSTM iterative model, and the loss function is retrained and output; if it is satisfied, the optimal hyperparameter combination is output, and the final LSTM is constructed. The specific decision-making steps and processes are illustrated by Algorithm 2.

In Algorithm 2, the time complexity of PSO is \(\textrm{O}(n\times m)\), which n represents the number of populations, m represents the number of iterations. The time complexity of LSTM in step 7 is \(\textrm{O}(h\times i + h^{2} + h)\),which h represents the hidden size, i represents the input size. The time complexity of Algorithm 2 is \(\textrm{O}((n\times m)(h\times i + h^{2} + h))\).

4 Data experiment

In this section, we validate the feasibility and superiority of water quality prediction method based on KNPRSs and PSO-LSTM.

The data in this paper are obtained from the water quality data of 13 automatic monitoring stations in the South-North Water Transfer Middle Route in China from 2019.1 to 2021.6. The automatic monitoring stations are automatically collected every 6 hours, with 3000-3500 data each. 12 water quality indicators were monitored: water temperature, pH, conductivity, turbidity, dissolved oxygen(DO), ammonia nitrogen(NH3-N), permanganate index(COD Mn), dissolved organic matter, integrated biological toxicity, total phosphorus(TP), total nitrogen(TN), chlorophyll a(chla).

4.1 Experiment preparation

4.1.1 Stability of time-series data

Stability is a very important statistical property of time-series. Only time-series with stability can be valuable for prediction, otherwise, the prediction results will have lags and other problems. In this paper, the ADF (Augmented Dickey-Fuller) stability test is used to test the stability of the water quality. Dissolved oxygen data in station A as an example. The results are shown in Table 3, the p-value of the test is 0.5172, the T-statistic is -1.5328, which is greater than the critical value of 10% confidence level, is a non-stability time-series. After the first-order difference of the data, the P-value is less than 0.0001 and the T-statistic is -12.6644, which is less than the critical value of 1% confidence level, and it is a stability time-series.

Then, the time-series data of dissolved oxygen index at the station A are first-order differenced to obtain the stability time-series data. As shown in Fig. 3, predicted results of water quality data that are not stable are delayed and have low prediction accuracy.

Fig. 3
figure 3

Predicted results of the OD of station A

4.1.2 Parameters of KNPRSs

Suppose that the water quality classification of i at the \(t+1\) time of the indicator i as the prediction target, as the decision attribute of the t time sample data.

Each target water quality indicator corresponds to a decision attribute \(d_{i}\), and \(D=\{0,1,2\}\). Among them, \(d_{i}=0\) means the water quality indicator is in the normal range, \(d_{i}=1\) means the water quality indicator is in the yellow warning range, \(d_{i}=2\) means the water quality indicator has reached the red warning range.

The next warning interval of the target water quality indicator i as a decision attribute, through the attribute simplification model based on KNPRSs, that is, the Algorithm 1 , we can get the attribute simplification red about the water quality indicator i, called the target water quality indicator of the nuclear attribute. The value of k is generally 0.05 of the length of the data set, so this paper takes \(k=150\); \(\alpha \) is generally in the interval of [0.5,1], and this paper takes the value of 0.9.

4.1.3 Parameters of PSO-LSTM

The overall network structure of PSO-LSTM consists of four layers: an input layer, two hidden layers and a fully connected output layer. The number of hidden units in the hidden layer is used as a hyperparameter to be optimized and trained in the PSO algorithm. The number of neurons in the input and output layers is matched to the dimensionality of the input and output and the number of hidden units in the hidden layer.

The time step is also an important parameter, which determines how long ago the cell structure can remember information. Too large a time step can lead to the interference of historical useless information in the prediction, while too small a time step can lead to the lack of sufficient memory information. Therefore, in this paper, the time step is used as a hyperparameter for the optimization of the PSO algorithm.

In this paper, the Adam optimizer is used to update the weights of the LSTM network. If the initial learning rate is set too small, the convergence will be too slow, and if the initial learning rate is set too large, the model will be scattered or fall into local minima. Therefore, it is important to choose a suitable initial learning rate for the training of the model, and the initial learning rate is also used as one of the hyperparameters for the optimization of PSO algorithm in this paper.

The loss function uses the mean square error MSE, which is the summed average of the squares of the differences between the true and predicted values, and can be derived by summing. The specific formula is as follows.

$$\begin{aligned} MSE=\frac{1}{n}\sum \limits ^{n}_{i=1}(y_{i}-y^{\prime }_{i})^{2} \end{aligned}$$
(28)

The training set is used for training the model and the test set is used for evaluating the performance of the model. According to the reference of several experiments, the number of hidden layer units \(unit=20\), time step \(timestep=20\), learning rate \(learnrate=0.002\), and the maximum number of iterations is 200. The loss function (28) is input to the PSO algorithm as the fitness function.

The PSO algorithm needs better global search ability in the early stage and stronger local search ability in the late stage. Therefore, in this paper, the weight w is taken as a linear function decreasing with time, that is, \(w=w_{max}-(w_{max}-w_{min})\frac{t}{t_{max}}\). \(w_{max}=1.4\), \(w_{min}=0.4\), t is the number of current iteration rounds, and \(t_{max}\) is the maximum number of iterations.

Similar to the weight w, the learning factors \(c_{1}\) and \(c_{2}\) can balance the ability of local search and global search, the larger the value of learning factor, the better the convergence of the algorithm and increase the local search ability, contrary to the weight w. According to the value of weight w, this paper takes the values of \(c_{1}=2\), \(c_{2}=2\).

Finally, set the search area boundaries \(unit\in [10, 30]\), \(timestep\in [10, 30]\), \(learnrate\in [0, 0.1]\). According to the Algorithm 2 , the PSO-LSTM optimization model is output and the final water quality data is predicted.

Fig. 4
figure 4

Predicted results of the six target water quality indicators of station A

4.2 Experimental procedure and analysis

In order to verify the performance of the water quality prediction method based on KNPRSs and PSO-LSTM (KNPRSs-PSO-LSTM) proposed in this paper, six target water quality indicators are randomly selected from three of the 13 automatic monitoring stations along the middle route: DO, NH3-N, COD Mn, TP, TN, chla to carry out the prediction task. At the same time, the method is compared with three other different water quality prediction models: water quality prediction method based on Long Short-Term Memory (LSTM), water quality prediction method based on neighborhood rough sets and Long Short-Term Memory (NRS-LSTM), water quality prediction method based on k-nearest-neighbor probability rough sets and Long Short-Term Memory (KNPRSs-LSTM).

Table 4 Table of prediction model evaluation results of the six target water quality indicators at station A

Three different evaluation indicators are used to analyze the results of water quality prediction, so as to reflect the performance of the model comprehensively. Three evaluation indicators are Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE). The formulae for calculating are as follows.

$$\begin{aligned}&MAE=\frac{1}{n}\sum \limits ^{n}_{i=1}|y_{i}-y^{\prime }_{i}|,\end{aligned}$$
(29)
$$\begin{aligned}&RMSE=\sqrt{\frac{1}{n}\sum \limits ^{n}_{i=1}(y_{i}-y^{\prime }_{i})^{2}},\end{aligned}$$
(30)
$$\begin{aligned}&MAPE=\frac{1}{n}\sum \limits ^{n}_{i=1}\frac{|y_{i}-y^{\prime }_{i}|}{y_{i}}. \end{aligned}$$
(31)

Where \(y_{i}\) represents the true value of water quality data, \(y^{\prime }_{i}\) represents the predicted value of water quality data, and n represents the number of data sets.

All methods are executed in Python, and run in hardware environment with AMD Ryzen 5 5600H with Radeon Graphics, 3.30 GHz, with 16 GB RAM.

The predicted results of the six target water quality indicators of automatic monitoring station A are shown in Fig. 4. The curve “TRUE" represents the real test set data.

The prediction model evaluation results of the six target water quality indicators of automatic monitoring station A are shown in Table 4.

The bar graph is shown in Fig. 5.

Fig. 5
figure 5

Predictive model evaluation results of the six target water quality indicators of station A

The predicted results of the six target water quality indicators of automatic monitoring station B are shown in Fig. 6. The curve “TRUE" represents the real test set data.

Fig. 6
figure 6

Predicted results of the six target water quality indicators of station B

The prediction model evaluation results of the six target water quality indicators of automatic monitoring station B are shown in Table 5.

Table 5 Table of prediction model evaluation results of the six target water quality indicators at station B
Fig. 7
figure 7

Predictive model evaluation results of the six target water quality indicators of station B

The bar graph is shown in Fig. 7.

The predicted results of the six target water quality indicators of automatic monitoring station C are shown in Fig. 8. The curve “TRUE" represents the real test set data.

The prediction model evaluation results of the six target water quality indicators of automatic monitoring station C are shown in Table 6.

The bar graph is shown in Fig. 9.

Based on the above experimental results, the following analysis is conducted:

Fig. 8
figure 8

Predicted results of the six target water quality indicators of station C

  1. 1.

    From the overall view, four models based on LSTM networks showed high prediction accuracy in water quality prediction, except for NH3-N and TP indicators of station A. Based on MAPE, the prediction accuracy reached over \(90\%\). This is an example of the superiority of the memory cell and “gate" structure of LSTM in time-series data prediction. It proves that the KNPRSs-PSO-LSTM built on the basis of LSTM is feasible and effective for the time-series nature of water quality data in this paper.

  2. 2.

    From (c) of Figs. 5, 7 and 9, we can get that the performance of LSTM on the prediction of target water quality indicators in different regions is inconsistent. Through the MAPE, it can be found that the prediction of NH3-N and TP is poor at station A, while the prediction of TP and chla is poor at station B, and the prediction of NH3-N, TP and chla is poor at station C. These reflect the complexity of the water quality situation, using the original indicator system and unified model hyperparameters, can not achieve a high prediction accuracy on all the target water quality indicators. After using the NRS and KNPRSs for attribute reduction and eliminating the interference of redundant attributes, the prediction accuracy is improved, and the effect of KNPRSs is better than that of NRS, which also verifies the superiority of the attribute reduction model based on KNPRSs proposed in this paper compared with the traditional attribute reduction methods. Further, based on the optimization of KNPRSs-LSTM, the model hyperparameters are optimized by using PSO algorithm, so that the KNPRSs-PSO-LSTM presents very high prediction accuracy for six target water quality indicators of three different stations.

  3. 3.

    From Figs. 4, 6 and 8, we can get that when the water quality time-series data present good stability, the prediction results of the four models fit better with the exact values. However, when faced with less stable time-series data, and without eliminating redundant attributes, the fitting effect of the LSTM using the initial hyperparameters is poor. The NRS-LSTM and the KNPRSs-LSTM with some noise interference removed only improve the fitting effect to a certain extent. The KNPRSs-PSO-LSTM, on the other hand, can find the optimal hyperparameters adaptively when facing the time-series data with low stability, which greatly improves the fitting effect. This reflects the superiority of the KNPRSs-PSO-LSTM proposed in this paper for solving the problem of poor stability of water quality data.

Table 6 Table of prediction model evaluation results of the six target water quality indicators at station C
Fig. 9
figure 9

Predictive model evaluation results of the six target water quality indicators of station C

In summary, this paper, in view of the complexity of water quality indicators, through the KNPRSs of feature attribute screening, a good realization of the effective data dimensionality reduction, reduce the noise interference; for the instability of water quality data, through the PSO-LSTM of optimization of hyperparameters, adaptive data of different indicators in different areas, improve the stability of the model, so that the prediction accuracy has been greatly improved. The above experimental results prove that the KNPRSs-PSO-LSTM is an effective, reliable and efficient method for water quality prediction.

5 Conclusions

For the water quality data from 13 automatic monitoring stations along the South-North Water Transfer Middle Route in China, this paper proposes a novel water quality data prediction method based on the k-nearest-neighbor probability rough sets and PSO-LSTM from the time-series, instability, indicator complexity of water quality data, and with the goal of water quality forecast and early warning. The main work is summarized as follows:

  1. 1.

    First, we propose a water quality attribute reduction method based on a novel model of KNPRSs. The water quality data of the South-North Water Transfer Middle Route has the complexity of indicators, each automatic monitoring station collects data of 12 related indicators, the water quality situation is not the same, and different target indicators of different stations, their correlation indicators are not the same. In this paper, the complexity of the data set and the limitations of the traditional neighborhood rough sets are fully considered, and probabilistic rough sets based on the k-nearest- neighbor algorithm are constructed. In this way, we have completed the attribute reduction of the target water quality attributes, eliminated the redundant attributes and extracted the key attributes. The input data dimension of the subsequent neural network prediction model is streamlined, and the accuracy of the water quality data prediction is verified through relevant numerical experiments.

  2. 2.

    Secondly, we study a water quality prediction method based on PSO-LSTM. In view of the time-series and instability of water quality data, this paper combines the Long Short-Term Memory network and Particle Swarm Optimization algorithm to build a PSO-LSTM water quality prediction model. The model with LSTM neural network as the main body, can well cope with the time-series of water quality data, through the “gate" structure and memory cells to selectively retain historical information. At the same time, the PSO algorithm is used to optimize the hyperparameters of the LSTM neural network, and the model is adaptively optimized for different indicators at different stations. Through relevant numerical experiments, it’s verified that the MAE, RMSE and MAPE are reduced compared with the common LSTM neural network, and the PSO-LSTM shows a better performance in water quality data prediction.

For the proposed water quality attribute reduction method based on the novel model of KNPRSs, the selection of parameter \(\alpha \) also affects the final reduction results. In the future, we will study how to provide support for the selection of \(\alpha \) according to the characteristics of the data set. In addition, how to improve the training speed of the model while ensuring the accuracy of the prediction is the main direction to be investigated in the future.