Keywords

1 Introduction

Underdetermined blind source separation (UBSS) technology is widely concerned as an effective speech signals processing tool [1,2,3]. It can help to recovery the unknown source signals from the observed speech signal that is the mixture of the sound sources, in the case that the number of observed signals is much less than that of sources.

Generally, sparse component analysis (SCA) approach is used to solve the UBSS problem after transforming the unknown parameters into a sparse domain [4]. In the SCA approach, the underdetermined mixture matrix has to be estimated before recovering the unknown sources. Thus, the result of mixture matrix estimation plays an important effect on the performance of SCA approach. The task of this paper is to design an improved algorithm for mixture matrix estimation with higher precision which benefits to improve the recovery of the speech signals.

To solve the underdetermined mixture matrix estimation problem, K-means clustering approach is a popular idea to solve the vectors of the mixture matrix as the clustering centers [5]. However, traditional K-means clustering methods perform poor for the exception data, and the convergence efficiency is still required to be improved. In order to improve the clustering performance, the optimization method is generally used to solve the K-means clustering problem. Thus, this paper proposes a clustering approach with the combination of neural network and genetic algorithm. In the proposed approach, neural network algorithm is used to generate the initial population for genetic algorithm, and genetic algorithm is designed to obtain the global optimum for clustering. Thus, the proposed approach can improve the global search ability and avoid falling into the local optimal.

In this paper, the UBSS problem model and the preprocessing of the observed signal are given in the Sect. 2. Section 3 introduces the proposed clustering algorithm, which is combined with the neural network algorithm [6] and genetic algorithm [7]. Simulation results of four speech source signals and two observed signals are presented in Sect. 4. Finally, conclusions and future work are summarized in Sect. 5.

2 Problem Statement

2.1 Model

Assume an underdetermined blind source separation model is given as

$$ {\mathbf{X}}(t) = {\mathbf{AS}}(t)\quad t = 1,2, \cdot \cdot \cdot ,T $$
(1)

where \( \, {\mathbf{X}}(t) = \left[ {x_{1} \left( t \right),x_{2} \left( t \right), \cdot \cdot \cdot ,x_{m} \left( t \right)} \right]^{T} \) denotes the observed signals, \( {\mathbf{A}} \) is the mixture matrix with the size of \( m \times n \), and \( {\mathbf{S}}\left( t \right) = \left[ {s_{1} \left( t \right),s_{2} \left( t \right), \cdot \cdot \cdot ,s_{n} \left( t \right)} \right]^{T} \) denotes the source signals. In the underdetermined system, the number of the source signals is larger than that of the observed signals, i.e., n > m. Among them, only X(t) is known, A and S(t) are unknown without any prior information. Our task is to estimate the mixture matrix A from the observed signals X(t).

When the sparsity of speech signals are poor in the time domain, the speech source signals should be transformed into a sparse domain as (2).

$$ \, \wp \left\{ {{\mathbf{X}}\left( t \right)} \right\} \, = {\mathbf{A}}\wp \left\{ {{\mathbf{S}}\left( t \right)} \right\} \, $$
(2)

where \( \wp \left\{ \bullet \right\} \) denotes a method to transform the signals in the time-domain to a sparse domain, such as, Discrete Cosine Transform (DCT), Short Time Fourier Transform (STFT), etc. [8]. It depends on the characteristic of source signals \( {\mathbf{S}}\left( t \right) \). For example, when voice signals are chosen as the sources, which are mostly unsteady in a long time but have stability in the short time, \( \wp \left\{ \bullet \right\} \) can be the STFT algorithm.

2.2 Preprocessing of Speech Signals

In the signal pre-processing, after signal transformation to the sparse domain, two sides of (1) are simultaneously transferred as (3).

$$ {\mathbf{X}}\left( \omega \right) = {\mathbf{AS}}\left( \omega \right) $$
(3)

where \( {\mathbf{X}}(\omega ) = \left[ {x_{1} \left( \omega \right),x_{2} \left( \omega \right), \cdot \cdot \cdot ,x_{m} \left( \omega \right)} \right]^{T} \) is the spectrum of the observed signals, and \( {\mathbf{S}}(\omega ) = \left[ {s_{1} \left( \omega \right),s_{2} \left( \omega \right), \cdot \cdot \cdot ,s_{n} \left( \omega \right)} \right]^{T} \) is the spectrum of the source signals. If only one item from the source \( s_{1} \left( \omega \right) \) has a non-zero value and the items from other sources are zero at a slot \( \omega \), (3) can be abbreviated as:

$$ \left[ {\begin{array}{*{20}c} {x_{1} (\omega )} & {x_{2} (\omega )} & { \cdot \cdot \cdot } & {x_{m} (\omega )} \\ \end{array} } \right]^{T} = a_{i} \times s_{i} (\omega ) $$
(4)

Then, we can get the model (5).

$$ s_{i} (\omega ) = \frac{{x_{1} (\omega )}}{{a_{1i} }} = \frac{{x_{2} (\omega )}}{{a_{2i} }} = \cdot \cdot \cdot = \frac{{x_{m} (\omega )}}{{a_{mi} }} $$
(5)

Obviously, when the source vector is sparse, the observed signal has the characteristics of linear clustering, and each clustering corresponds to each column of the mixture matrix. Based on this, n source signals will determine n straight lines, so that mixture matrix estimation problem is transformed into estimating the direction of the clustering line in the observation space. However, the observed points in (4) are distributed in the vicinity of the straight line of clustering, therefore a clustering algorithm has to be proposed to mining the underline clustering characteristics on these points.

In order to reduce the interference among data, absolution values of the observed signals are used instead of real or imaginary part. In addition, mirror processing is applied to improve the processing efficiency. Considering that the points near the origin have no obvious linear clustering characteristics, a specific radius can be selected according to the specific data, and data points inside the circle are filtered out. The selected data is then shown in Fig. 1.

Fig. 1.
figure 1

Scatter distribution after filtrating

The points in Fig. 1 shows the two-dimensional observation signals after short-time Fourier transform. It is found that the points in the same cluster have almost the same linear slope, and this conclusion is also proved by (5). Thus, the tangent value of every point is a good index for distinguishing the clusters. The closer the tangent values of two points are, the high possibility they belong to the same cluster. According to (5), it is known that the tangent value of all points in each cluster is the same as the tangent value of each column in the mixture matrix. Therefore, the problem of solving mixture matrix is converted to a clustering analysis of two-dimensional observation signals. In the clustering analysis, the one-dimensional tangent values are designed as the clustering centers instead of two-dimensional signals. It could simplify the calculation load. The tangent distribution is shown in Fig. 2.

Fig. 2.
figure 2

Distribution of tangent value

From Fig. 2, it is seen that several values change slowly, it means that these values are concentrated in the vicinity of a classification value, which shows the clustering characteristic. Therefore, for two observed signals, it’s feasible to use the tangent values of each column of mixture matrix corresponding to the two observed vectors as the input data of the cluster analysis.

At last, in order to reduce the clustering error, it is necessary to perform normalization to project the tangent values to the interval of [−1,1].

3 Improved Clustering Algorithm Based on Neural Network and Genetic Algorithm

The number of the clustering centers is the same as the column values of mixture matrix. Assume n = 4, then the unknown mixture matrix can be calculated by:

$$ {\hat{\mathbf{A}}} = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\cos \hat{\alpha }_{1} } & {\cos \hat{\alpha }_{2} } & {\cos \hat{\alpha }_{3} } & {\cos \hat{\alpha }_{4} } \\ \end{array} } \\ {\begin{array}{*{20}c} {\sin \hat{\alpha }_{1} } & {\sin \hat{\alpha }_{2} } & {\sin \hat{\alpha }_{3} } & {\sin \hat{\alpha }_{4} } \\ \end{array} } \\ \end{array} } \right] $$
(6)

where \( \hat{\alpha }_{i} = \arctan C_{i} ,\,i = 1, 2, 3, 4 \). The input data of the neural network clustering algorithm are all the tangent values.

In order to solve four clustering centers, the output layer of neural network is set with four neurons. The normalized tangent value is one-dimensional input data, so the input layer of the neural network is set with one neuron.

The number of training iterations of neural network is initialized to ensure the applicability of clustering of samples with large data volumes. After finishing the training of neural network of all iterations, the distribution of all the tangent values are learned. Then we can do the clustering analysis with the normalized tangent values after training.

In order to improve the clustering accuracy, genetic algorithm (GA) is then used to optimize the preliminary clustering results obtained by the competitive neural networks above. The clustering problem is converted to an optimization problem [9].

We use the coding approach of the floating point to present the clustering centers. The genetic individuals are coded as below.

$$ pop(k).gene = \left[ {C\begin{array}{*{20}c} {_{1} } & {C_{2} } & {{\text{C}}_{3} } & {C_{4} } \\ \end{array} } \right] $$
(7)

Where \( pop(k) \) is the k-th individual in the population and the four floating-point numbers denote the values of four cluster centers respectively.

The optimization objective is calculated by the clustering performance, which can be evaluated by the mean of the squares of the distance of clustering points to the respective clustering centers, that is,

$$ pop(k).value = \sum\limits_{i = 1}^{4} {(\sum\limits_{j = 1}^{{S_{i} }} {(x_{ij} - C_{i} )^{2} )/S_{i} } } $$
(8)

Where \( x_{{{\mathbf{ij}}}} \) is marked as the j-th data in class i, \( C_{i} \) is a cluster center of class i, \( S_{i} \) is the total number of data points in class i. Then the individuals are sorted in ascending order, and the number of each individual is represented as \( pop(k).{\text{index}} \). The fitness is calculated as:

$$ pop(k).fitness = a(1 - a)^{pop(k).index - 1} $$
(9)

where parameter a is commonly set as a = 0.6.

In the selection step, a selection array \( cFitness[popSize] \) is built based on the calculated fitness value \( pop(k).fitness \) of the individual. The formula is as follows:

$$ cFitness[k] = \frac{{\sum\limits_{u = 1}^{k} {pop(u).fitness} }}{S} $$
(10)

where \( S = \sum\limits_{k = 1}^{popSize} {pop(k).fitness} \). Then a random number randn is generated on interval (0,1), and the k-th individual is copied into the population in the next generation if \( rand < cFitness(k) \).

In the crossover step, the clustering centers of two individuals are sorted in ascending before crossover. The calculation of the new individual is as follows:

$$ pop(k_{new} ).gene = a_{1} \times pop(k_{1} ).gene + a_{2} \times pop(k_{2} ).gene $$
(11)

where parameters \( a_{1} \) and \( a_{2} \) are 0.6 and 0.4, respectively.

When a mutation is made, a random number is generated on the interval (0,1) for each gene of all individuals in the population. When the random number is less than the mutation probability, −10% to 10% of its own floating value is added randomly.

The flow chart of the above proposed algorithm based on neural network and genetic algorithm is shown in Fig. 3. After that, mixture matrix is calculated by (6) using the tangent values that are the optimum result of the clustering center.

Fig. 3.
figure 3

The flow chart of the proposed algorithm

4 Simulation Results

The experiments of four voice signals are carried out in Matlab 2010b. The speech examples are available on-line in [10]. The signal format is .wav, the coding method is PCM and the sampling point is selected as 65365. The mixture matrix is defined which mix the four source signals into two observation signals as follows:

$$ {\mathbf{A}}{ = }\left[ {\begin{array}{*{20}c} {a_{1} } & {a_{2} } & {a_{3} } & {a_{4} } \\ \end{array} } \right]{ = }\left[ \begin{aligned} \begin{array}{*{20}c} {\cos 18^{ \circ } } & {\cos 36^{ \circ } } & {\cos 54^{ \circ } } & {\cos 72^{ \circ } } \\ \end{array} \hfill \\ \begin{array}{*{20}c} {\sin 18^{ \circ } } & {\sin 36^{ \circ } } & { \, \sin 54^{ \circ } } & {\sin 72^{ \circ } } \\ \end{array} \hfill \\ \end{aligned} \right] $$
(12)

Figure 4 shows the classification result of the tangent values by neural network clustering, where the number of iterations is set to 400, the adjustment step length of weight is initialized to 0.2. It is seen that four classes are clearly separated. In order to illustrate the performance, the clustering center obtained by neural network (NN) is compared with the standard setting shown in Table 1.

Fig. 4.
figure 4

The classification result of the tangent values

Table 1. Comparison between Neural Network clustering result and the standard setting

According to Table 1, it is founded that the result of clustering has certain errors compared with the standard setting if only neural network is used for clustering. The result of the estimation \( {\hat{\text{A}}}_{{\text{NN}}} \) is obtained as

$$ {\hat{\text{A}}}_{{\text{NN}}}\,{ = }\,\left[ \begin{aligned} 0. 9 4 2 7\quad 0. 7 8 9 5\quad 0. 6 3 4 9\quad 0. 4 2 3 1\hfill \\ 0. 3 3 3 5 { }\quad 0. 6 1 3 8\quad 0. 7 7 2 6\quad 0. 9 0 6 1\hfill \\ \end{aligned} \right] $$
(13)

In the general neural network, the optimal strategy uses gradient descent method, which tends to let the solution fall into the local optimum. So, in the proposed method, genetic algorithm and neural network are combined to reduce the probability of falling into the local optimum. Considering that the clustering solution of neural network is stable and convergent, it can be considered that the error range of the result is valid. Thus, it proved that the clustering results of the neural network can be used as the initial solution, genetic algorithm (GA) is then used to search for the better optimum near the results obtained by neural network algorithm. In addition, GA can simultaneously optimize multiple interval regions near the local optimum, which can improve the convergence efficiency. Therefore, the proposed method can obtain better solutions with less error ratio than those of the single neural network method. The results of the proposed method are shown in Table 2.

Table 2. Comparison between Neural-Genetic clustering result and the standard setting

In contrast to Tables 1 and 2, it is found that the error ratio of C1, C2, C3, and C4 are improved by the proposed algorithm. The estimation result by the proposed algorithm based on neural network and genetic algorithm is obtained as

$$ {\hat{\text{A}}}_{{\text{NN-GA}}}\,{ = }\,\left[ \begin{aligned} { 0} . 9 4 6 0\quad 0. 8 0 6 1\quad 0. 6 1 7 3\quad 0. 3 7 1 5\hfill \\ { 0} . 3 2 4 2\quad 0. 5 9 1 7\quad 0. 7 8 6 7\quad 0. 9 2 8 4\hfill \\ \end{aligned} \right] $$
(14)

5 Conclusion

This article mainly researches on an improved algorithm of mixture matrix estimation to solve the problem of underdetermined blind speech separation. In the preprocessing section, the two-dimensional observation is converted into one-dimensional tangent value, it is found to have clustering characteristics from the figure of scatter points. The clustering analysis is carried out by the neural network algorithm to obtain the initial clustering center. Then genetic algorithm is used to optimize the result of neural network algorithm, which improves the stability and convergence of clustering. Experimental results of four speech signals prove that the proposed algorithm can achieve the mixture matrix with better accuracy than neural network, and adoption of the tangent value is useful in the process of clustering analysis. In the future, we will improve the proposed algorithm in different cases, such as, it could be applied to separate the power signals to improve the power quality.