1 Introduction

Automatic singing voice separation, which intends to extract the singing voice from the music mixture, has received much attention in the field of audio signal processing in recent years [1,2,3,4]. It has broad applications in singer identification [5, 6], automatic singing transcription [7], automatic lyrics alignment [8], music information retrieval [9], and content-based music retrieval [10, 11].

The singing voice separation task solicits competing entries to blindly separate the singer’s voice from music recordings. However, musical sound sources are often strongly correlated in time and frequency, a musical recording is often infeasible without additional knowledge about the sources a decomposition. It is extremely difficult for computer systems, although the human auditory system can easily distinguish the vocal and instrumental of music recording. The main challenges come from the variety of simultaneous sound sources as well as the rich pitch and timbre variations of singing voice. As summarized below, recently, many algorithms have been proposed to separate singing voice from music recording, yet the progress is still limited.

Based on Robust Principal Component Analysis (RPCA) [12], which is a matrix factorization algorithm for solving underlying low-rank and sparse matrices. Suppose we are given a large data matrix X, and know that it may be decomposed as \(X=A+E\), where A is a low-rank matrix and E is a sparse matrix. Huang et al. [13] have separated singing voice from music accompaniment with the assumption that the repetitive music accompaniment lie in a low-rank subspace, while the singing voices can be regarded as relatively sparse within songs. The main drawback to this approach is that the resulting sparse matrix often contains instrumental solo or percussion [14, 15]. Yang [14] further incorporated harmonicity priors and a back-end drum removal procedure to improve the separation. Su and Yang proposed a novel artist identification method based on sparse features learned from both the magnitude and phase parts of the spectrum [15]. Papadopoulos et al. [16] presented an adaptive formulation of RPCA that incorporates music content information to guide the decomposition.

Further, the data samples can be represented as linear combinations of the bases in a given dictionary [17], i.e., \(X=DZ+E\), where X is a input matrix, D is a dictionary of sparse matrices and E is a sparse matrix. For music spectrograms, we assume that the instrumentals are repetitive lies in a low-rank matrix and the vocal are sparse lies in a sparse matrix. Because the vocal part of a song can sometimes be low rank as well, the quality of such separation might be limited. Yang [18] considered both the singing voice and music accompaniment as low-rank matrices and employed an online dictionary learning algorithm [19] to learn the structures of singing voice and accompaniment sounds from clean vocal and instrumental signals as prior knowledge introduced in the decomposition process.

Chen and Ellis applied the RPCA framework to speech enhancement assuming that the background noise is low-rank and the speech is sparse [20, 21]. Differently, they incorporated the pre-learned idea to decompose the sparse components into the product of a pre-learned speech dictionary and a sparse activation matrix. They proposed to use the sum of a low-rank matrix and a residual to identify the background noise [20]. This approach, however, cannot be directly used for singing voice separation in music. This is because the background music is often much more non-stationary than background noise and may not be well represented by a low-rank matrix and a residual. To solve this problem, Yu et al. [22] proposed Low-rank and Sparse representation with Pre-learned Dictionaries (LSPD) under the Alternating Direction Method of Multipliers (ADMM) framework for singing voice separation. First, they pre-learned universal voice and music dictionaries from isolated singing voice and background music training data. Then, in addition to using a sparse spectrogram and a low-rank spectrogram to model the singing voice and the background music, respectively, they added a residual term to capture the components that are not well modeled by either the sparse or the low-rank terms.

A partial solution for the sparse matrix contains instrumental solo or percussion is to incorporate reliable annotations for the sparse part using informed RPCA (hereafter RPCAi) [23]. Chan et al. [24] presents a modified RPCA algorithm, the algorithm is then applied to separate the singing voice from the instrumental accompaniment using vocal activity information, this work represents one of the first attempts to incorporate vocal activity information into the RPCA algorithm, while vocal activity detection has been studied extensively [25, 26]. Ikemiya et al. [27] proposed a novel framework for improving both vocal F0 estimation and singing voice separation by making effective use of the mutual dependency of those tasks. Therefore, we present the first model named Low-rank, Sparse representation with Pre-learned Dictionaries and side Information (LSPDi) under the ADMM framework. First, we pre-learn two dictionaries about foreground singing voice and background music and use a sparse spectrogram and a low-rank spectrogram to model them, respectively. Then, a residual term is added to capture the components that are not well modeled by either the sparse or the low-rank term. Finally, we incorporate the reconstructed voice spectrogram from the annotation when separating vocal and music.

However, low-rank optimizations are computationally inefficient due to the use of singular value decompositions. To motivate a new representation, Chan et al. [28] proposed to separate singing voice by informed group-sparse representation with the idea of informed separation incorporating pitch annotations. In jazz and popular music, a few chord symbols are enough to compactly represent the harmonic structure of a piece. One observation is that there are many empty rows in this representation. Therefore, a promising strategy for the inverse problem is to encourage row sparsity given an instrumental dictionary. On that basis, we present the second model named Group-Sparse, Sparse with Pre-learned Dictionaries (GSPD) and the third model named Group-Sparse, Sparse with Pre-learned Dictionaries and side Information (GSPDi). Firstly, we use a sparse spectrogram and group-sparse spectrogram to define the singing voice and the background music, respectively. In addition, a residual term is added to fit the components that are not identified by the low-rank and the sparse part. Specially, we pre-learned voice and music dictionaries from clean singing voice. Evaluations on the iKala dataset [29] show their better performance than existing methods.

The rest of this paper is organized as follows. The overview of the existing model is presented in Sect. 2. Section 3 presents the proposed methods. In Sect. 4, dictionary and vocal activity information are described. The simulation results are presented in Sect. 5. Final section concludes this work.

2 Existing algorithm

Low-rank and sparse representation with pre-learned Dictionaries (LSPD) method [22], shown as,

$$\begin{aligned} \begin{aligned}&\min \limits _{Z_{1},Z_{2}}\lambda _{1}\Vert Z_{1}\Vert _{*}+\lambda _{2}\Vert Z_{2}\Vert _{1} +\lambda _{3}\Vert E\Vert _{1}\\&s.t.\,X=D_{1}Z_{1}+D_{2}Z_{2}+E \end{aligned} \end{aligned}$$
(1)

where X is the input spectrogram, \(D_{1}\in R^{m\times k_{1}}\) is a pre-learned dictionary of the music accompaniment, \(D_{2}\in R^{m\times k_{2}}\) is a pre-learned dictionary of the singing voice, \(D_{1}Z_{1}\) is the separated instrumentals, \(D_{2}Z_{2}\) is the separated voice, E denotes the residual part. \(\lambda _1\), \(\lambda _2\), and \(\lambda _3\) are three weighting parameters for balancing the different regularization terms in this model.

It is noting that LSPD utilizes a sparse spectrogram and a low-rank spectrogram to model the singing voice and the background music, respectively, and adds a residual term to capture the components that are not well modeled by either the sparse or the low-rank terms, which improves the performance of related existing methods to some extent [13, 18, 30]. In the following section, we will present three new algorithms.

3 The proposed method

3.1 Low-rank, sparse representation with pre-learned dictionaries and side information (LSPDi)

In order to improve the separation quality of LSPD further, we considered more prior information and added them, i.e., the reconstructed voice spectrogram from the annotation, to the following model,

$$\begin{aligned} \begin{aligned}&\min \limits _{Z_{1},Z_{2}} \lambda _{1}\Vert Z_{1}\Vert _{*}+\lambda _{2}\Vert Z_{2}\Vert _{1}+\lambda _{3}\Vert E\Vert _{1} +\frac{\gamma }{2}\Vert D_{2}Z_{2}-E_{0}\Vert ^{2}_{F}\\&s.t.\,X=D_{1}Z_{1}+D_{2}Z_{2}+E \end{aligned} \end{aligned}$$
(2)

Here, all parameters in model (2) are in accordance with model (1), and \(E_{0}\) denotes the reconstructed voice spectrogram from the annotation. \(\Vert \cdot \Vert _{F}\) denotes the Frobenius norm (square root of the sum of the squares of its elements), so that a subtraction can be calculated between the separated voice and the reconstructed voice spectrogram from the annotation. In the following, we also use the ADMM algorithm [31] to solve the optimization problem, by introducing two auxiliary variables \(J_{1}\) and \(J_{2}\) as well as three equality constraints,

$$\begin{aligned} \begin{aligned}&\min \limits _{Z_{1},Z_{2},J_{1},J_{2}} \lambda _{1}\Vert J_{1}\Vert _{*}+\lambda _{2}\Vert J_{2}\Vert _{1}+\lambda _{3}\Vert E\Vert _{1}\\&\quad +\frac{\gamma }{2}\Vert D_{2}Z_{2}-E_{0}\Vert ^{2}_{F}\\&s.t. \,X=D_{1}Z_{1}+D_{2}Z_{2}+E, Z_{1}=J_{1}, Z_{2}=J_{2} \end{aligned} \end{aligned}$$
(3)

The unconstrained augmented Lagrangian \({\mathcal {L}}\) is given by

$$\begin{aligned} \begin{aligned} {\mathcal {L}} &=\lambda _{1}\Vert J_{1}^{T}\Vert _{*}+\lambda _{2} \Vert J_{2}\Vert _{1}+\lambda _{3}\Vert E\Vert _{1}+\frac{\gamma }{2}\Vert D_{2}Z_{2}-E_{0}\Vert ^{2}_{F}\\ & \quad +\,<Y_{1},X-D_{1}Z_{1}-D_{2}Z_{2}-E>\\ & \quad +\,<Y_{2},Z_{1}-J_{1}>+<Y_{3},Z_{2}-J_{2}>\\ & \quad +\,\frac{\mu }{2}\left( \Vert X-D_{1}Z_{1}-D_{2}Z_{2}-E\Vert _{F}^{2}\right. \\ & \quad \left. +\,\Vert Z_{1}-J_{1}\Vert _{F}^{2}+\Vert Z_{2}-J_{2}\Vert _{F}^{2}\right) \end{aligned} \end{aligned}$$
(4)

where \(Y_{1}\), \(Y_{2}\), \(Y_{3}\) are the Lagrange multipliers. We then iteratively update the solutions for \(J_{1}\), \(Z_{1}\), \(J_{2}\) and \(Z_{2}\).

Specifically, update \(J_{2}\) firstly,

$$\begin{aligned} J_{2}= & {} \arg \min \limits _{J_{2}}\lambda _{2}\Vert J_{2}\Vert _{1}+\frac{\mu }{2}\Vert J_{1}-\left( Z_{1}+\mu ^{-1}Y_{3}\right) \Vert _{F}^{2} \end{aligned}$$
(5)

that can be solved by the soft-threshold operator

$$\begin{aligned} J_{2}=S_{\frac{\lambda _{2}}{\mu }}\left( Z_{2}+\mu ^{-1}Y_{3}\right) \end{aligned}$$
(6)

since the spectrogram is non-negative

$$\begin{aligned} J_{2}=\max \left\{ S_{\frac{\lambda _{2}}{\mu }}\left( Z_{2}+\mu ^{-1}Y_{3}\right) ,0\right\} \end{aligned}$$
(7)

where 0 is an all zero matrix of the size as \(J_2\).

Then, update \(Z_{2}\), setting \(\frac{\partial {\mathcal {L}}}{\partial Z_{2}}=0\),

$$ \begin{aligned} Z_{2}&=\left( \left( \frac{\gamma }{\mu }+1\right) D_{2}^{T}D_{2}+I\right) ^{-1} \left( D_{2}^{T}\left( X-D_{1}Z_{1}\right. \right. \\&\quad \left. \left. -\,E+\frac{\gamma }{\mu }E_{0}+\frac{1}{\mu }Y_{1}\right) \right. \\&\quad \left. -\,\frac{1}{\mu }Y_{3}+J_{2}\right) \end{aligned} $$
(8)

And the other variables can be updated using the similar way. The detailed algorithm is shown as follows.

LSPDi

Input:X, \(D_{1}\), \(D_{2}\)

output:\(Z_{1}\), \(Z_{2}\)

initialization:\(Z_{1}=0\), \(Z_{2}=0\), \(J_{1}=0\), \(J_{2}=0\), \(Y_{1}=0\), \(Y_{2}=0\),\(Y_{3}=0\)

while not converged do

update\(Z_{1}\), \(Z_{2}\):

\(Z_{1}=(D_{1}^{T}D_{1}+I)^{-1}(D_{1}^{T}(X-D_{2}Z_{2}-E+\mu ^{-1}Y_{1})-\mu ^{-1}Y_{2}+J_{1})\)

\(Z_{2}=\left(\left(\frac{\gamma }{\mu}+1\right)D_{2}^{T}D_{2}+I\right)^{-1}\left(D_{2}^{T}\left(X-D_{1}Z_{1}-E+\frac{\gamma}{\mu }E_{0}+\frac{1}{\mu }Y_{1}\right)-\frac{1}{\mu}Y_{3}+J_{2}\right)\)

update\(J_{1}\), \(J_{2}\), E:

\(U\varSigma V=svd(Z_{1}+\frac{1}{\mu }Y_{2})\), \(J_{1}=US_{\frac{\lambda _{1}}{\mu }}[\varSigma ]V^{T}\)

\(J_{2}=S_{\frac{\lambda _{2}}{\mu }}(Z_{2}+\frac{1}{\mu }Y_{3})\)

\(E=S_{\frac{\lambda _{3}}{\mu }}(X-D_{1}Z_{1}-D_{2}Z_{2}+\mu ^{-1}Y_{1})\)

update\(Y_{1}\), \(Y_{2}\), \(Y_{3}\):

\(Y_{1}=Y_{1}+\mu (X-D_{1}Z_{1}-D_{2}Z_{2}-E)\)

\(Y_{2}=Y_{1}+\mu (Z_{1}-J_{1})\)

\(Y_{3}=Y_{2}+\mu (Z_{2}-J_{2})\)

end while

3.2 Group-sparse, sparse representation with pre-learned dictionaries (GSPD)

Due to the use of singular value decompositions, low-rank optimization is computationally inefficient, which prevents its applications for the processing of big data. Here, group sparse as an replacement is used when designing the optimal model,

$$\begin{aligned} \begin{aligned}&\min \limits _{Z_{1},Z_{2}}\lambda _{1} \Vert Z_{1}^{T}\Vert _{2,1}+\lambda _{2}\Vert Z_{2}\Vert _{1}+\lambda _{3}\Vert E\Vert _{1}\\&s.t. \, X=D_{1}Z_{1}+D_{2}Z_{2}+E \end{aligned} \end{aligned}$$
(9)

where \(\Vert Z^{T}\Vert _{2,1}=\sum _{i}\sqrt{\sum _{j}Z_{ij}^{2}}\) is the row sparsity, which means Z has many empty rows in its representation [32] and corresponds to the sum of the \(l_{2}\)-norms of the rows of Z.

As we all know, the above formulation is not trivial to solve since the \(\Vert \cdot \Vert _{2,1}\) and \(\Vert \cdot \Vert _{1}\) norms are not smooth. Moreover, an additional equality constraint should be considered. Therefore, the alternating direction method of multipliers (ADMM) [33] is applied for this model. ADMM works by first rewriting the constraint(s) into an augmented Lagrange function, then updating each variable in an alternating fashion until convergence. Thus, to solve (9), we first introduce two auxiliary variables \(J_{1}\) and \(J_{2}\) for the alternating updates and rewrite the optimization problem as follows:

$$\begin{aligned} \begin{aligned}&\min \limits _{Z_{1},Z_{2},J_{1},J_{2}} \lambda _{1} \Vert J_{1}^{T}\Vert _{2,1}+\lambda _{2} \Vert J_{2}\Vert _{1}+\lambda _{3}\Vert E\Vert _{1}\\&s.t.\, X=D_{1}Z_{1}+D_{2}Z_{2}+E, Z_{1}=J_{1}, Z_{2}=J_{2} \end{aligned} \end{aligned}$$
(10)

The unconstrained augmented Lagrangian \({\mathcal {L}}\) is given by

$$\begin{aligned} \begin{aligned} {\mathcal {L}}&=\lambda _{1} \Vert J_{1}^{T}\Vert _{2,1}+\lambda _{2}\Vert J_{2}\Vert _{1}+\lambda _{3}\Vert E\Vert _{1}\\&\quad +\,<Y_{1},X-D_{1}Z_{1}-D_{2}Z_{2}-E>\\&\quad +\,<Y_{2},Z_{1}-J_{1}>+<Y_{3},Z_{2}-J_{2}>\\&\quad +\,\frac{\mu }{2}\left( \Vert X-D_{1}Z_{1}-D_{2}Z_{2}-E\Vert _{F}^{2}+\Vert Z_{1}\right. \\&\quad \left. -\,J_{1}\Vert _{F}^{2}+\Vert Z_{2}-J_{2}\Vert _{F}^{2}\right) \end{aligned} \end{aligned}$$
(11)

where \(Y_{1}\), \(Y_{2}\) and \(Y_{3}\) are the Lagrange multipliers. Model (11) can be minimized with respect to \(J_1\), \(J_2\), \(Z_1\), \(Z_2\) and E, respectively, by fixing the other variables and updating the lagrangian multipliers \(Y_1\), \(Y_2\) and \(Y_3\). For example, the minimization of \(J_1\) reduces to

$$\begin{aligned} \begin{aligned} J_{1}&=\arg \min \limits _{J_{1}} \lambda _{1}\Vert J_{1}^{T}\Vert _{2,1}+\frac{\mu }{2}\Vert J_{1}-\left( Z_{1}+\mu ^{-1}Y_{2}\right) \Vert _{F}^{2}\\&=\left( \left( 1-\frac{\lambda _{1}}{\Vert \left( Z_{1}+\mu ^{-1}Y_{2} \right) _{i}\Vert }\right) _{+}\left( Z_{1}+\mu ^{-1}Y_{2}\right) _{i}\right) _{i=1}^{k} \end{aligned} \end{aligned}$$
(12)

where k is the number of row of \(J_1\), \(A_i\) denotes the ith row of A, \((B_i)_{i=1}^{k}=(B_{1}^{T}, \ldots , B_{k}^{T})\) and \(C_{+}=\max (0,C)\).

Update \(Z_{1}\), setting \(\frac{\partial {\mathcal {L}}}{\partial Z_{1}}=0\),

$$\begin{aligned} \begin{aligned} Z_{1}&=\left( D_{1}^{T}D_{1}+I\right) ^{-1}\left( D_{1}^{T} \left( X-D_{2}Z_{2}-E+\mu ^{-1}Y_{1}\right) \right. \\&\quad \left. -\,\mu ^{-1}Y_{2}+J_{1}\right) \end{aligned} \end{aligned}$$
(13)

The other variables can be updated using the similar way. In the following, the proposed algorithm will be given.

GSPD

Input:X, \(D_{1}\), \(D_{2}\)

output:\(Z_{1}\), \(Z_{2}\)

initialization:\(Z_{1}=0\), \(Z_{2}=0\), \(J_{1}=0\), \(J_{2}=0\), \(Y_{1}=0\), \(Y_{2}=0\),\(Y_{3}=0\)

while not converged do

update\(Z_{1}\), \(Z_{2}\):

\(Z_{1}=(D_{1}^{T}D_{1}+I)^{-1}(D_{1}^{T}(X-D_{2}Z_{2}-E+\mu ^{-1}Y_{1})-\mu ^{-1}Y_{2}+J_{1})\)

\(Z_{2}=(D_{2}^{T}D_{2}+I)^{-1}(D_{2}^{T}(X-D_{1}Z_{1}-E+\mu ^{-1}Y_{1})-\mu ^{-1}Y_{3}+J_{2})\)

update\(J_{1}\), \(J_{2}\)E:

\(J_{1}=((1-\frac{\lambda _{1}}{\Vert (Z_{1}+\mu ^{-1}Y_{2})_{i}\Vert })_{+}(Z_{1}+\mu ^{-1}Y_{2})_{i})_{i=1}^{k}\)

\(J_{2}=S_{\frac{\lambda _{2}}{\mu }}(Z_{2}+\frac{1}{\mu }Y_{3})\)

\(E=S_{\frac{\lambda _{3}}{\mu }}(X-D_{1}Z_{1}-D_{2}Z_{2}+\mu ^{-1}Y_{1})\)

update\(Y_{1}\), \(Y_{2}\), \(Y_{3}\):

\(Y_{1}=Y_{1}+\mu (X-D_{1}Z_{1}-D_{2}Z_{2}-E)\)

\(Y_{2}=Y_{1}+\mu (Z_{1}-J_{1})\)

\(Y_{3}=Y_{2}+\mu (Z_{2}-J_{2})\)

end while

3.3 Sparse representation with pre-learned dictionaries and side information (GSPDi)

Furthermore, more prior information, i.e., the reconstructed voice spectrogram from the annotation is considered in the following model.

$$\begin{aligned} \begin{aligned}&\min \limits _{Z_{1},Z_{2}} \lambda _{1}\Vert Z_{1}^{T}\Vert _{2,1}+\lambda _{2}\Vert Z_{2}\Vert _{1}+\lambda _{3}\Vert E\Vert _{1} +\frac{\gamma }{2}\Vert D_{2}Z_{2}-E_{0}\Vert ^{2}_{F}\\&s.t.\, X=D_{1}Z_{1}+D_{2}Z_{2}+E, \end{aligned} \end{aligned}$$
(14)

To solve (14), we first introduce two auxiliary variables \(J_{1}\) and \(J_{2}\) for the alternating updates and rewrite the optimization problem as follows:

$$\begin{aligned} \begin{aligned}&\min \limits _{Z_{1},Z_{2},J_{1},J_{2}} \lambda _{1}\Vert J_{1}^{T}\Vert _{2,1}+\lambda _{2} \Vert J_{2}\Vert _{1}+\lambda _{3}\Vert E\Vert _{1}+\frac{\gamma }{2}\Vert D_{2}Z_{2}-E_{0}\Vert ^{2}_{F}\\&s.t.\, X=D_{1}Z_{1}+D_{2}Z_{2}+E, Z_{1}=J_{1}, Z_{2}=J_{2}, \end{aligned} \end{aligned}$$
(15)

The unconstrained augmented Lagrangian \({\mathcal {L}}\) is given by

$$\begin{aligned} {\mathcal {L}}&= \lambda _{1}\Vert J_{1}^{T}\Vert _{2,1}+\lambda _{2}\Vert J_{2}\Vert _{1}+\lambda _{3}\Vert E\Vert _{1} +\,\frac{\gamma }{2}\Vert D_{2}Z_{2}-E_{0}\Vert ^{2}_{F}\\&\quad +\,<Y_{1},X-D_{1}Z_{1}-D_{2}Z_{2}-E>\\&\quad +\,<Y_{2},Z_{1}-J_{1}>+<Y_{3},Z_{2}-J_{2}>\\&\quad +\,\frac{\mu }{2}\left( \Vert X-D_{1}Z_{1}-D_{2}Z_{2}-E\Vert _{F}^{2}\right. \\ & \quad \left. +\,\Vert Z_{1}-J_{1}\Vert _{F}^{2}+\Vert Z_{2}-J_{2}\Vert _{F}^{2}\right) \end{aligned}$$
(16)

where \(Y_{1}\), \(Y_{2}\), and \(Y_{3}\) are the Lagrange multipliers. \(<\cdot ,\cdot>\) denotes the trace inner product, and \(\mu\) is a positive penalty parameter. We the iteratively update the solutions for \(J_{1}\), \(Z_{1}\), \(J_{2}\) and \(Z_{2}\). The method of updated variables is similar to the previous two algorithms in above-mentioned Sects. 3.1 and 3.2.

The proposed algorithm is described below.

GSPDi

Input:X, \(D_{1}\), \(D_{2}\)

output:\(Z_{1}\), \(Z_{2}\)

initialization:\(Z_{1}=0\), \(Z_{2}=0\), \(J_{1}=0\), \(J_{2}=0\), \(Y_{1}=0\), \(Y_{2}=0\),\(Y_{3}=0\)

while not converged do

update\(Z_{1}\), \(Z_{2}\):

\(Z_{1}=(D_{1}^{T}D_{1}+I)^{-1}(D_{1}^{T}(X-D_{2}Z_{2}-E+\mu ^{-1}Y_{1})-\mu ^{-1}Y_{2}+J_{1})\)

\(Z_{2}=((\frac{\gamma }{\mu }+1)D_{2}^{T}D_{2}+I)^{-1}(D_{2}^{T}(X-D_{1}Z_{1}-E+\frac{\gamma }{\mu }E_{0}+\frac{1}{\mu }Y_{1})-\frac{1}{\mu }Y_{3}+J_{2})\)

update\(J_{1}\), \(J_{2}\)E:

\(J_{1}=((1-\frac{\lambda _{1}}{\Vert (Z_{1}+\mu ^{-1}Y_{2})_{i}\Vert })_{+}(Z_{1}+\mu ^{-1}Y_{2})_{i})_{i=1}^{k}\)

\(J_{2}=S_{\frac{\lambda _{2}}{\mu }}(Z_{2}+\frac{1}{\mu }Y_{3})\)

\(E=S_{\frac{\lambda _{3}}{\mu }}(X-D_{1}Z_{1}-D_{2}Z_{2}+\mu ^{-1}Y_{1})\)

update\(Y_{1}\), \(Y_{2}\), \(Y_{3}\):

\(Y_{1}=Y_{1}+\mu (X-D_{1}Z_{1}-D_{2}Z_{2}-E)\)

\(Y_{2}=Y_{1}+\mu (Z_{1}-J_{1})\)

\(Y_{3}=Y_{2}+\mu (Z_{2}-J_{2})\)

end while

4 Dictionary and \(E_0\)

4.1 Dictionary

We adopt the idea of Online Dictionary Learning for Sparse Coding (ODL) [19] to learn the singing voice dictionary from isolated training singing voices.

Given N signals (\(x_{i} \in \mathbb {R}_{m}\)), ODL learns a dictionary D by solving the following joint optimization problem,

$$\begin{aligned} \begin{aligned}&\min \limits _{D\ge 0,\alpha }\frac{1}{N}\sum \limits _{i=1}^{N} \left( \frac{1}{2}\Vert x_{i}-D\alpha _{i}\Vert _{2}^{2}+\lambda \Vert \alpha _{i}\Vert _{1}\right) \\&s.t.\, d_{j}^{T}d_{j}\le 1,\alpha _{i}\ge 0 \end{aligned} \end{aligned}$$
(17)

where \(\Vert \cdot \Vert _{2}\) denotes the Euclidean and \(\lambda\) is a regularization parameter. The input frames are extracted from the training set after short-time Fourier transform (STFT). Our implementation of ODL is based on the SPAMS toolbox [19]. Following literature [28], we define the dictionary size to be 100 atoms.

4.2 The reconstructed voice spectrogram from the annotation (\(E_{0}\))

To get the reconstructed voice spectrogram from the annotation, we first transform the human-labeled vocal pitch contours into a time-frequency binary mask. The authors in literature [27] have proposed a harmonic mask similar to that of the work [34], which only passes integral multiples of the vocal fundamental frequencies [35, 36],

$$\begin{aligned} M(f,t) =\left\{ \begin{aligned} 1,&\quad if \quad |f-nF_{0}(t)|<w/2,\exists n\in N^{+}\\ \\ 0,&\quad otherwise. \end{aligned} \right. \end{aligned}$$
(18)

Here, \(F_{0}(t)\) is the vocal fundamental frequency at time t, n is the order of the harmonic, and w is the width of the mask. Then, we simply define the vocal annotations as \(E_{0}=X\circ M\), where \(\circ\) denotes the Hadamard product.

5 Evaluation

5.1 Dataset

Our evaluation is based on the iKala dataset [29]. The iKala dataset contains 352 30-s clips of Chinese popular songs in CD quality. The singers and musicians are professionals.

Following literature [28], in our experiments, we randomly select 44 songs for training (i.e., learning the dictionaries \(D_{1}\) and \(D_{2}\)), leaving 208 songs for testing the performance of separation. The vocals and instrumentals are mixed at signal-to-noise ratio of 0 dB. To reduce the computational cost and the memory footprint of the proposed algorithm, we downsample all the audio recordings from 44,100 to 22,050 Hz. Then, computed its STFT by sliding a Hamming window of 1411 samples with a 75% overlap to obtain the spectrogram [29].

5.2 Evaluation

To measure the quality of the singing voice \(\widehat{v}\) with respect to the original clean singing voice v, we use source-to-interference ratio (SIR), source-to-artifacts ratio (SAR) and source-to-distortion ratio (SDR) provided in the commonly used BSS EVAL toolbox version 3.0.Footnote 1

The source-to-distortion ratio (SDR) [37] is computed as follows,

$$\begin{aligned} {\text {SDR}}(\widehat{v},v)=10{\text {log}}_{10}\left[ \frac{<\widehat{v},v>^{2}}{\Vert \widehat{v}^{2} \Vert \Vert v^{2}\Vert -<\widehat{v},v>^{2}}\right] . \end{aligned}$$
(19)

Normalized SDR (NSDR) is the improvement of SDR from the original mixture x to the separated singing voice \(\widehat{v}\) [38, 39], and is commonly used to measure the separation performance for each mixture:

$$\begin{aligned} {\text {NSDR}}(\widehat{v},v,x) = {\text {SDR}}(\widehat{v},v)-{\text {SDR}}(x,v). \end{aligned}$$
(20)

For overall performance evaluation, the global NSDR (GNSDR) is calculated as,

$$\begin{aligned} {\text {GNSDR}} = \frac{\sum ^{N}_{i=1}w_{i}{\text{NSDR}}\left( \widehat{v}_{i},v_{i},x_{i}\right) }{\sum ^{N}_{i=1}w_{i}}, \end{aligned}$$
(21)

where N is the total number of the songs and \(w_{i}\) is the length of the i-th song. We calculate the weighted average of SIR and SAR , which are the Global SIR (GSIR) and Global SAR (GSAR), respectively, over different clips in a similar way. Higher values of SDR, SAR, SIR, GSIR, GSAR GSDR and GNSDR represent better quality of the separation.

5.3 Parameter selection

There are two versions of the proposed method for singing voice separation.

  1. (1)

    The first one is low-rank representation,

    • LSPD Supervised method proposed by Yu et al. [22].

    • LSPDi Proposed LSPDi method with Low-Rank representation and the reconstructed voice spectrogram from the annotation.

  2. (2)

    The other is group-sparse representation,

  • GSPD Proposed GSPD method with Group-Sparse representation.

  • GSPDi Proposed GSPDi method with Group-Sparse representation and the reconstructed voice spectrogram from the annotation.

Note that LSPD and GSPD all have three parameters, respectively. Both LSPDi and GSPDi have four parameters.

During parameter selection, we use the indicator of global normalized source-to-distortion ratio (GNSDR) as the evaluation index. The higher the value is, the better the separation quality is. As for all algorithms, i.e., LSPD and LSPDi, GSPD and GSPDi, we set \(\lambda _{2}=\lambda _{3}=1/\sqrt{\max (m,n)}\) for each \(X\in R^{m\times n}\) similar to the work in [29], Here, we only adjust \(\lambda _{1}\) and \(\gamma\) in LSPDi and GSPDi.

Fig. 1
figure 1

Separation performance measured by GNSDR for the singing voice (left) and background music (right), using our proposed method LSPDi

Figure 1 shows the different GNSDR values of LSPDi for the separated singing voice and background music. First, fixing \(\lambda _{1}=1\) (or any other), in the vocal section, the GNSDR monotonically rises at first before the maximum value is achieved and then decreases, when \(\gamma = 5\), it reaches the optimal value. Then, we will focus on \(\gamma = 5\), in the accompaniment part, fixing \(\gamma = 5\), its value first increases and then reaches the maximum after a significant downward trend, reaches it optimal value when \(\lambda _{1}= 1\). Therefore, in the algorithm LSPDi, we use the parameter \(\gamma = 5\) and \(\lambda _{1}= 1\). Just like LSPDi, we select the parameter \(\lambda _{1}= 1\) in LSPD.

Fig. 2
figure 2

Separation performance measured by GNSDR for the singing voice (left) and background music (right), using our proposed method GSPDi

Based on GSPDi, Fig. 2 presents the GNSDR for the separated singing voice and background music. In the vocal part, we can see that, for any value of \(\lambda _{1}\), the GNSDR will always get the maximum value at \(\gamma = 5\). So, in the accompaniment part, we fix \(\gamma = 5\) and the value of the GNSDR reaches its maximum at \(\lambda _{1}= 24\), then the values have a significant downward trend. Therefore, in the GSPDi, we set the parameter \(\lambda _{1}= 24\) and \(\gamma = 5\). And we select the parameter \(\lambda _{1}= 24\) in GSPD, in the same way with GSPDi.

5.4 Comparison of the proposed method

Experimental results show that incorporating the reconstructed voice spectrogram from the annotation (E0) can greatly improve the separation performance. As shown in Table 1, Figs. 3 and  4, LSPDi and GSPDi are comparative and both of them achieve a higher performance. Therefore, LPDSi and GSPDi will be used for the following comparisons with the existing methods.

Fig. 3
figure 3

Separation performance for the music (left) and vocal (right), via the GNSDR, using LSPD, LSPDi, GSPD and GSPDi from left to right

Table 1 Separation quality for the vocal and music for the iKala dataset of LSPD, LSPDi, GSPD and GSPDi
Fig. 4
figure 4

Separation performance for the vocal (left) and music (right), via the SDR (top row), SIR (middle row) and SAR (bottom row), using LSPD, LSPDi, GSPD, and GSPDi from left to right. The central mark (red horizontal line) in each box is the median of the distribution. Higher values are better

5.5 Algorithms for comparison

The proposed above methods are compared with the existing three state-of-the-art singing voice separation algorithms, as RPCA, unsupervised method proposed by Huang et al. [13], LRR, supervised method proposed by Liu et al. [17] and GSRi, supervised method proposed by Chan et al. [28]. In the three experimental algorithms, the original parameters are set as following, \(\lambda\) is set to \(\frac{1}{\sqrt{\max (m,n)}}\) and \(\gamma\) is set to \(\frac{2}{\sqrt{\max (m,n)}}\).

As shown in Fig. 5, the proposed method has a higher value of global normalized source-to-distortion ratio (GNSDR), which means that the introduction of prior knowledge improve the separation performance.

Fig. 5
figure 5

Separation performance for the singing music (left) and vocal (right), via the GNSDR, using RPCA, LRR, GSRi, LSPDi and GSPDi from left to right

Table 2 Separation quality for the vocal and music for the iKala dataset of RPCA, LRR, GSRi, LSPDi and GSPDi

Several results are shown in Table 2. In the vocal part, proposed algorithm achieves higher GSDR and GSIR than RPCA, LRR and GSRi, which shows that LSPDi, GSPDi have better global separation performance and a better ability to remove the instrumental sounds than RPCA, LRR and GSRi. In the background music part, proposed algorithm achieves higher GSIR and GSAR than RPCA, LRR and GSRi, which suggests that LSPDi, GSPDi has better global separation performance than RPCA, LRR, GSRi and a better ability to remove the singing, a better performs in limiting artifacts during the separation process. The difference on GSDR and GSIR might be significant. So basically it suggests that the proposed method works better on the background music. The above observations show that LSPDi and GSPDi have a better ability to deal with the separation of singing voice and accompaniment.

Fig. 6
figure 6

Separation performance for the vocal (left) and music (right), via the SDR (top row), SIR (middle row) and SAR (bottom row), using RPCA, LRR, GSRi, LSPDi and GSPDi from left to right. The central mark (red horizontal line) in each box is the median of the distribution. Higher value is better

Figure 6 shows the separation performance for the vocal and background music, respectively, on the iKala dataset, via the SDR (top row), SIR (middle row) and SAR (bottom row). In each column, the boxes from left to right represent Huang’s method RPCA, Liu’s method LRR, Chan’s method GSRi and the proposed LSPDi and GSPDi, respectively.

From Fig. 6, we can see that for the separation of singing voice and music accompaniment, proposed algorithm achieves the highest SDR, SIR and SAR. These results indicate that our proposed LSPDi and GSPDi algorithm have a better overall separation performance for the singing voice and the music components (highest SDR), which exhibits their better capability of limiting artifacts and removing interferences during separation.

6 Conclusions

In this paper, we have presented two categories, time–frequency-based source separation algorithm for music signals. LSPD [22] and LSPDi consider both the vocal and instrumental spectrograms as sparse matrix and low-rank matrix, respectively. GSPD and GSPDi combine both the vocal and instrumental spectrograms as sparse matrix and group-sparse matrix, respectively. Moveover, the dictionaries for the singing voice and background music pre-learned from isolated singing voice and background music training data, respectively, have successfully utilized to capture more features of vocal or background music spectrogram, and more prior information are introduced, for example, vocal annotations information. Future developments of the presented method could take advantage of more properties of background music and singing voice with such extensions, for example, some of the recent works [40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64] that employ signal classification maybe further motivate the need of this work.