1 Introduction

Today success of deep learning extends beyond academic circles into public knowledge. Perhaps it is the most influential machine learning paradigm of the last decade. Dictionary learning on the other hand enjoyed success, but only within the realms of academia. The recently proposed ‘deep dictionary learning’ (DDL) [1] combines these two representation learning frameworks.

Dictionary learning is a synthesis representation learning approach; it learns a dictionary so that it can generate/synthesize the data from the learned coefficients. This is a shallow approach—learning only one level of dictionary. DDL extends it to multiple levels. The technique has been proposed in [1]; thorough experimentation [2] showed that it performs better than other unsupervised representation learning tools like stacked denoising autoencoder (SDAE) and deep belief network (DBN). DDL showed promise in an application in hyperspectral imaging [3]; where it was able to show that it significantly surpasses other deep learning techniques when training samples are limited.

However the solution to DDL has been far from optimal; it has a greedy solution. In the first level, the dictionary and the coefficients are learnt from the training data as input. In subsequent levels, the coefficients from the previous level acts as input to dictionary learning. Therefore deeper layers are influenced by shallower ones, but not vice versa. Other deep learning tools (stacked autoencoder, DBN) also follow a greedy learning paradigm, but the issue of feedback from deeper to shallower layers is resolved during the fine-tuning stage.

In this work we propose to rectify this issue; we will learn all the levels of dictionary (and the coefficients) in one optimization problem. However we will not be following the heuristic greedy pre-training followed by fine-tuning paradigm usually employed in deep learning. Our solution will be mathematically elegant. The entire DDL problem will be solved in one go using the Majorization Minimization approach.

The rest of the paper will be organized into several sections. Deep dictionary learning and its relationship with other deep learning tools will be discussed in the following section. Our proposed solution is derived in Sect. 3. Experimental results will be shown in Sect. 4. Finally the conclusions of this work and future direction of research will be discussed in Sect. 5.

2 Background

2.1 Representation Learning

Although deep learning has its roots in neural networks, today its reach extends well beyond simple classification. What is more profound is its abstract representation learning ability. It is believed that by going deeper, one can learn more abstract representations which facilitate analysis tasks.

Fig. 1
figure 1

a Single representation layer neural network. b Segregated neural network

Figure 1a shows the diagram of a simple neural network with one representation (hidden) layer. The problem is to learn the network weights between the input and the representation and between the representation and the target. This can be thought of as a segregated problem, see Fig. 1b. Learning the mapping between the representation and the target is straightforward. This is because once the representation is known, solving for the network weights between the hidden layer and the output target boils down to a simple non-linear least squares problem. The challenge is to learn the network weights (from input) and the representation; this is because we need to solve two variables from one input. Broadly speaking this is the topic of representation learning.

Fig. 2
figure 2

a Restricted Boltzmann machine. b Deep Boltzmann machine

Restricted Boltzmann machine (RBM) [4] is one technique to learn the representation layer. The objective is to learn the network weights (W) and the representation (H). This is achieved by optimizing the Boltzman cost function given by:

$$\begin{aligned} p(W,H)=e^{H^{T}WX} \end{aligned}$$
(1)

Basically RBM learns the network weights and the representation/ feature by maximizing the similarity between the projection of the input (on the network) and the features in a probabilistic sense. Since the usual constraints of probability apply, degenerate solutions are prevented. The traditional RBM is restrictive—it can handle only binary data. The Gaussian-Bernoulli RBM [5] partially overcomes this limitation and can handle real values between 0 and 1. However, it cannot handle arbitrary valued inputs (real or complex).

Deep Boltzmann machines (DBM) [6, 7] is an extension of RBM, formed by stacking multiple hidden layers on top of each other (Fig. 2b). The RBM and DBM are undirected graphical models. These are unsupervised representation learning techniques. For training a deep neural network, targets are attached to the final layer and fine-tuned with back propagation.

The other prevalent technique to train the representation layer of a neural network is by autoencoder [8, 9]. The architecture is shown in Fig. 3(a).

$$\begin{aligned} \mathop {\min }\limits _{W,W^{\prime }} \left\| {X-W^{\prime }\phi (WX)} \right\| _F^2 \end{aligned}$$
(2)

The cost function for the autoencoder is expressed above. W is the encoder, and \(W'\) is the decoder. The activation function \(\varphi \) is usually of tanh or sigmoid such that it squashes the input to normalized values (between 0 and 1 or −1 and +1). The autoencoder learns the encoder and decoder weights such that the reconstruction error is minimized. Essentially it learns the weights so that the representation \(\phi (WX)\) retains almost all the information (in the Euclidean sense) of the data, so that it can be reconstructed back. Once the autoencoder is learnt, the decoder portion of the autoencoder is removed and the target is attached after the representation layer.

Fig. 3
figure 3

a Autoencoder. b Stacked autoencoder

To learn multiple layers of representation, the autoencoders are nested into one another. This architecture is called stacked autoencoder, see Fig. 3(b). For such a stacked autoencoder, the optimization problem is complicated. For a two-layer stacked autoencoder, the formulation is,

$$\begin{aligned} \mathop {\min }\limits _{W_1 ,W_2 ,W_1^{\prime } ,W_2^{\prime } } \left\| {X-W_1^{\prime } \varphi ( {W_2^{\prime } \varphi \left( {W_2 \varphi \left( {W_1 X} \right) } \right) } )} \right\| _F^2 \end{aligned}$$
(3)

The workaround is to learn the layers in a greedy fashion [10]. First the outer layers are learnt (see Fig. 4); and using the features from the outer layer as input for the inner layer, the encoding and decoding weights for the inner layer are learnt.

Fig. 4
figure 4

Greedy learning

For training deep neural networks, the decoder portion is removed and targets attached to the innermost encoder layer. The complete structure is fine-tuned with backpropagation.

2.2 Deep Dictionary Learning

In recent times, the concept of deep learning extends beyond those of neural networks. Recent studies on deep multi-task learning [11] and deep distance metric learning [12] exemplify our point. Shallow dictionary learning continues to be an active area of research (e.g. [13]) in machine learning. The proposal of DDL [1,2,3] follows in a similar vein.

Fig. 5
figure 5

a Dictionary learning. b Deep dictionary learning

The standard interpretation of dictionary learning is shown in Fig. 5(a). Given the data (X), one learns a dictionary \(D_{1}\) so as to synthesize the data from the learnt coefficients Z. Mathematically this is expressed as,

$$\begin{aligned} X=D_1 Z \end{aligned}$$
(4)

There are several versions of supervised dictionary learning for machine learning application [14, 15]. However in this work we are only interested in the unsupervised version.

In DDL, the idea is to learn multiple levels of dictionaries. DDL proposes to extend the shallow dictionary learning into multiple layers—leading to deep dictionary learning, see Fig. 5b.

Mathematically, the representation at the second layer can be written as:

$$\begin{aligned} X=D_1 \varphi (D_2 Z_2 ) \end{aligned}$$
(5)

Here the definition of \({\upvarphi }\) remains the same as before. Extending this idea, a multi-level dictionary learning problem with non-linear activation can be expressed as,

$$\begin{aligned} X=D_1 \varphi \left( {D_2 \varphi (\ldots \varphi (D_N Z))} \right) \end{aligned}$$
(6)

In dictionary learning one usually employs a sparsity penalty on the coefficients. This is required for solving inverse problems [16] like denoising, deconvolution, compression etc; but there is no reason (theoretical or intuitive) for adding the sparsity penalty for machine learning problems. The seminal paper that started dictionary learning [17], did not impose any sparsity penalty. Most recent studies in dictionary learning based computer vision use the K-SVD algorithm (originally developed for solving inverse problems) as the workhorse and hence are bound to use the sparsity constraint.

Without the sparsity penalty, DDL leads to,

$$\begin{aligned} \mathop {\min }\limits _{D_1 ,\ldots D_N ,Z} \left\| {X-D_1 \varphi ( {D_2 \varphi (\ldots \varphi (D_N Z))} )} \right\| _F^2 \end{aligned}$$
(7)

This problem is highly non-convex and requires solving huge number of parameters. With limited amount of data, it will lead to over-fitting. To address these issues, a greedy approach is followed [1,2,3]. With the substitution \(Z_1 =\varphi \left( {D_2 \varphi (\ldots \varphi (D_N Z))} \right) \), Eq. (7) can be written as as \(X=D_1 Z_1 \) such that it can be solved as single layer dictionary learning.

$$\begin{aligned} \mathop {\min }\limits _{D_1 ,Z_1 } \left\| {X-D_1 Z_1 } \right\| _F^2 \end{aligned}$$
(8)

This is solved using the method of optimal directions (MOD) [18].

For the second layer, one substitutes \(Z_2 =\varphi (D_3 \ldots \varphi (D_N Z))\), which leads to \(Z_1 =\varphi (D_2 Z_2 )\), or alternately, \(\varphi ^{-1}(Z_1 )=D_2 Z_2 \); this too is a single layer dictionary learning that can be solved using MOD

$$\begin{aligned} \mathop {\min }\limits _{D_2 ,Z_2 } \left\| {\varphi ^{-1}(Z_1 )-D_2 Z_2 } \right\| _F^2 \end{aligned}$$
(9)

Continuing in a similar fashion till the final layer one has \(Z_{N-1} =\varphi (D_N Z)\) or \(\varphi ^{-1}(Z_{N-1} )=D_N Z\). As before, the final level of dictionary and coefficients can be solved using MOD.

This concludes the training stage. During testing, one uses the learnt multi-level dictionaries to generate the coefficients from the test sample. Mathematically one needs to solve,

$$\begin{aligned} \mathop {\min }\limits _{z_{test} } \left\| {x_{test} -D_1 \varphi \left( {D_2 \varphi (\ldots \varphi (D_N z_{test} ))} \right) } \right\| _2^2 \end{aligned}$$
(10)

Using the substitution \(z_1 =\varphi ( {D_2 \varphi (\ldots \varphi (D_N z_{test} ))})\), learning the feature from the first layer turns out to be,

$$\begin{aligned} \mathop {\min }\limits _{z_1 } \left\| {x_{test} -D_1 z_1 } \right\| _2^2 \end{aligned}$$
(11)

This has a simple analytic solution in the form of pseudoinverse.

With the substitution \(Z_2 =\varphi (D_3 \ldots \varphi (D_N Z))\), one can generate the features at the second level by solving,

$$\begin{aligned} \mathop {\min }\limits _{z_2 } \left\| {z_1 -\varphi (D_2 z_2 )} \right\| _2^2 \equiv \mathop {\min }\limits _{z_2 } \left\| {\varphi ^{-1}(z_1 )-D_2 z_2 } \right\| _2^2 \end{aligned}$$
(12)

The equivalent form has a closed form solution as well. Continuing in this fashion till the final layer, one has

$$\begin{aligned} \mathop {\min }\limits _{z_{test} } \left\| {z_{N-1} -\varphi (D_N z_{test} )} \right\| _2^2 \equiv \mathop {\min }\limits _{z_{test} } \left\| {\varphi ^{-1}(z_{N-1} )-D_N z_{test} } \right\| _2^2 \end{aligned}$$
(13)

One can note that the test phase is not very time consuming. One can precompute all the pseudoinverse dictionaries for each level; and can multiply the inputs (after applying inverse of the activation wherever necessary) by these pseudoinverses. Thus during testing, one just needs to compute some matrix vector products; this is the same as any other deep learning tool in test phase.

It must be noted that greedy DDL is not the same as deep matrix factorization [19]. Deep matrix factorization is a special case of DDL; where the activations functions are linear. For deep matrix factorization, owing to the linearity of the activation functions one may combine all the levels into a single one; this would collapse the entire deep structure into an equivalent shallow one.

3 Proposed Optimal Algorithm

Our goal is to solve (7). Prior studies on deep dictionary learning were only able to solve it greedily in a sub-optimal fashion. For the sake of convenience the problem is repeated.

$$\begin{aligned} \mathop {\min }\limits _{D_1 ,\ldots D_N ,Z} \left\| {X-D_1 \varphi \left( {D_2 \varphi (\ldots \varphi (D_N Z))} \right) } \right\| _F^2 \end{aligned}$$
(14)

This will be solved using a Majorization Minimization approach. The general outline is discussed in the next sub-section. This is a popular technique in signal processing [20, 21] and machine learning [22, 23], but to the best of our knowledge it has not been used for solving deep learning problems.

3.1 Majorization Minimization

Figure 6 shows the geometrical interpretation behind the Majorization-Minimization (MM) approach. The figure depicts the solution path for a simple scalar problem but essentially captures the MM idea.

Fig. 6
figure 6

Majorization Minimization

Let, J(x) is the function to be minimized. Start with an initial point (at k=0) \(\hbox {x}_{\mathrm{k}}\) (Fig. 6a). A smooth function \(\hbox {G}_{\mathrm{k}}\)(x) is constructed through \(\hbox {x}_{\mathrm{k}}\) which has a higher value than J(x) for all values of x apart from \(\hbox {x}_{\mathrm{k}}\), at which the values are the same. This is the Majorization step. The function \(\hbox {G}_{\mathrm{k}}\)(x) is constructed such that it is smooth and easy to minimize. At each step, minimize \(\hbox {G}_{\mathrm{k}}\)(x) to obtain the next iterate \(\hbox {x}_{\mathrm{k}+1}\) (Fig. 6b). A new \(\hbox {G}_{\mathrm{k}+1}\)(x) is constructed through \(\hbox {x}_{\mathrm{k}+1 }\) which is now minimized to obtain the next iterate \(\hbox {x}_{\mathrm{k}+2}\) (Fig. 6c). As can be seen, the solution at every iteration gets closer to the actual solution.

3.2 Algorithm Derivation

We will follow an alternating minimization technique for solving the multiple levels of dictionaries and for the final level of coefficients. In every iteration we need to solve for N dictionaries and final level of coefficients Z.

For the first level of dictionary, we need to solve,

$$\begin{aligned} \mathop {\min }\limits _{D_1 } \left\| {X-D_1 \varphi \left( {D_2 \varphi (\ldots \varphi (D_N Z))} \right) } \right\| _F^2 \end{aligned}$$
(15)

Here it is assumed that the dictionaries \(D_{2}\) to \(D_{N}\) and Z are constant while updating \(D_{1}\). For our convenience, we can express \(Z_1 =\varphi \left( {D_2 \varphi (\ldots \varphi (D_N Z))} \right) \). Thus (15) can be written as,

$$\begin{aligned} \mathop {\min }\limits _{D_1 } \left\| {X-D_1 Z_1 } \right\| _F^2 \end{aligned}$$
(16)

One does not need Majorization Minimization to solve this. This (16) is a simple least squares problem with a closed form solution.

Once we have solved \(D_{1}\), we need to solve \(D_{2}\), i.e.

$$\begin{aligned} \mathop {\min }\limits _{D_2 } \left\| {X-D_1 \varphi \left( {D_2 \varphi (\ldots \varphi (D_N Z))} \right) } \right\| _F^2 \end{aligned}$$
(17)

Expressing \(Z_2 =\varphi (D_3 \ldots \varphi (D_N Z))\), we get

$$\begin{aligned} \mathop {\min }\limits _{D_2 } \left\| {X-D_1 \varphi \left( {D_2 Z_2 } \right) } \right\| _F^2 \end{aligned}$$
(18)

We need applying Majorization Minimization from now on. Here \(J(D_2)=\Vert X-D_1 \varphi ( D_2 Z_2 ) \Vert _F^2 \). The majorizer for this (in \(k^{th}\) iteration) will be,

$$\begin{aligned} G_k (D_2 )= & {} \left\| {X-D_1 \varphi \left( {D_2 Z_2 } \right) } \right\| _F^2 +(D_2 -\varphi \left( {D_2 Z_2 } \right) _k )^{T}(aI-D_1^T D_1 )(D_2 -\varphi \left( {D_2 Z_2 } \right) _k )\\= & {} X^{T}X-2X^{T}D_1 \varphi \left( {D_2 Z_2 } \right) +\varphi \left( {D_2 Z_2 } \right) ^{T}D_1^T D_1 \varphi \left( {D_2 Z_2 } \right) \\&+\,(\varphi \left( {D_2 Z_2 } \right) -\varphi \left( {D_2 Z_2 } \right) _k )^{T}(aI-D_1^T D_1 )(\varphi \left( {D_2 Z_2 } \right) -\varphi \left( {D_2 Z_2 } \right) _k ) \\= & {} X^{T}X+\varphi \left( {D_2 Z_2 } \right) ^{T}(aI-D_1^T D_1 )\varphi \left( {D_2 Z_2 } \right) _k \\&-\,2(X^{T}D_1 +x_k^T (aI-D_1^T D_1 ))\varphi \left( {D_2 Z_2 } \right) +a\varphi \left( {D_2 Z_2 } \right) ^{T}\varphi \left( {D_2 Z_2 } \right) \\= & {} a(-2B_1^T D_1 -D_1^T D_1 )+c \end{aligned}$$

where \(B_1 =\varphi \left( {D_2 Z_2 } \right) _k +\frac{1}{a}D_1^T (X-D_1 \varphi \left( {D_2 Z_2 } \right) _k )\); \(c=X^{T}X+\varphi \left( {D_2 Z_2 } \right) _k^T (aI-D_1^T D_1 )\varphi \left( {D_2 Z_2 } \right) _k \) and a is the maximum Eigenvalue of \(D_1^T D_1 \).

Using the identity \(\left\| {X-Y} \right\| _2^2 =X^{T}X-2X^{T}Y+Y^{T}Y\), one can write,

$$\begin{aligned} G_k (D_2 )=a\left\| {B_1 -\varphi \left( {D_2 Z_2 } \right) } \right\| _F^2 -aB_1^T B_1 +c \end{aligned}$$
(19)

Therefore, minimizing (19) is the same as minimizing the first term leaving aside the constants independent of the variable (\(D_{2}\)). Therefore, one can instead minimize

$$\begin{aligned} G_k^{\prime } (D_2 )=\left\| {B_1 -\varphi \left( {D_2 Z_2 } \right) } \right\| _F^2 \end{aligned}$$
(20)

where \(B_1 =\varphi \left( {D_2 Z_2 } \right) _k +\frac{1}{a}D_1^T (X-D_1 \varphi \left( {D_2 Z_2 } \right) _k )\).

Now (20) can be equivalently expressed as,

$$\begin{aligned} \mathop {\min }\limits _{D_2 } \left\| {\varphi ^{-1}(B_1 )-D_2 Z_2 } \right\| _F^2 \end{aligned}$$
(21)

Computing \(\varphi ^{-1}\) is easy since it is an elementwise operation. This (21) is a simple least squares solution since \(Z_{2}\) is a constant; as mentioned several times before it has an analytic solution. This concludes the update for \(D_{2}\).

The same technique is continued till deeper layers. For example, solving \(D_{3}\) would require expressing \(Z_3 =\varphi (D_4 \ldots \varphi (D_N Z))\).

Expanding \(Z_{2}\) in \(G_k^{\prime } (D_2 )\) leads to,

$$\begin{aligned} \left\| {\varphi ^{-1}(B_1 )-D_2 \varphi (D_3 \ldots \varphi (D_N Z))} \right\| _F^2 \end{aligned}$$
(22)

Now, substituting \(Z_3 =\varphi (D_4 \ldots \varphi (D_N Z))\) in (22) leads to,

$$\begin{aligned} \mathop {\min }\limits _{D_3 } \left\| {\varphi ^{-1}(B_1 )-D_2 \varphi (D_3 Z_3 )} \right\| _F^2 \end{aligned}$$
(23)

Note that the problem (23) is exactly the same as (18). Majorization Minimization of (23) leads to

$$\begin{aligned} \mathop {\min }\limits _{D_3 } \left\| {B_2 -\varphi \left( {D_3 Z_3 } \right) } \right\| _F^2 \end{aligned}$$
(24)

where \(B_2 =\varphi \left( {D_3 Z_3 } \right) _k +\frac{1}{a^{\prime }}D_2^T (\varphi ^{-1}(B_1 )-D_2 \varphi \left( {D_3 Z_3 } \right) _k )\); \(a'\) being the maximum eigenvalue of \(D_2^T D_2 \).

As before, solving \(D_{3}\) from the equivalent expression \(\mathop {\min }\limits _{D_3 } \left\| {\varphi ^{-1}(B_2 )-D_3 Z_3 } \right\| _F^2 \) is straightforward.

We continue this till the pre-final layer; after solving \(D_{N-1}\) we are left with the solution of the final level of dictionary \(D_{N}\) coefficients Z. Majorization Minimization would lead to an expression similar to (24); we will have

$$\begin{aligned} \mathop {\min }\limits _{D_N ,Z} \left\| {B_{N-1} -\varphi \left( {D_N Z} \right) } \right\| _F^2 \end{aligned}$$
(25)

Unlike the other layers, we can solve for both the dictionary and the coefficients of the final layer by simple alternating least squares (ALS)/MOD of the following equivalent form.

$$\begin{aligned} \mathop {\min }\limits _{D_N ,Z} \left\| {\varphi ^{-1}(B_{N-1} )-D_N Z} \right\| _F^2 \end{aligned}$$
(26)

The ALS/MOD algorithm is succinctly shown below.

figure a

Note that our method is completely non-parametric; therefore there is nothing to tune, once the number of dictionaries and the number of atoms in each are fixed by the user.

Our proposed derivation results in a nested algorithm, i.e. for one update of \(D_{1}\), the update for \(D_{2}\) is in a loop; similarly for one update of \(D_{2}\), the update for \(D_{3}\) is in a loop and so on. Succinctly the algorithm can be expressed as follows:

figure b

To prevent degenerate solutions where some of the D’s are very high and others low, the columns of all the dictionaries are normalized after every update.

The initialization is done deterministically. First the SVD of X is computed \(( X=USV^{T})\) and \(D_{1}\) is initialized by the top left eigenvectors of X. For \(D_{2}\), the SVD of \({ SV}^{T}\) is computed and the corresponding top eigenvectors are used to initialized \(D_{2}\). The rest of the dictionaries are initialized in a similar fashion. In the last level, the coefficient (Z) is initialized by the product of the eigenvalues and the right eigenvectors of the last SVD. There can be other randomized techniques for initialization which may yield better results, but our deterministic initialization is repeatable and has shown to yield good results consistently.

For our proposed algorithm ideally one needs to run the loops for several iterations. This would be very time consuming; we found that in practice it is not required. Only the deepest loop for updating \(D_{N}\) and Z is solved for 5–10 iterations. The rest of the loops from 2 to N\(-1\) are only run once. Only the outermost loop is run for a large number of iterations (\(\sim \)100).

There will be no variation in the testing phase. As discussed before, once the dictionaries are learnt, the feature generation during testing is fast—one only needs a few (equaling the number of levels) matrix vector multiplication.

4 Experimental Evaluation

4.1 Classification

We carried our experiments on several benchmarks datasets. The first one is the MNIST dataset which consists of 28 \(\times \) 28 images of handwritten digits ranging from 0 to 9. The dataset has 60,000 images for training and 10,000 images for testing. No preprocessing has been done on this dataset.

We also tested on variations of MNIST, which are more challenging primarily because they have fewer training samples (10,000 + 2000 validation) and larger number of test samples (50,000). This one was created specifically to benchmark deep learning algorithms [24].

  1. 1.

    basic (smaller subset of MNIST)

  2. 2.

    basic-rot (smaller subset with random rotations)

  3. 3.

    bg-rand (smaller subset with uniformly distributed noise in background)

  4. 4.

    bg-img (smaller subset with random image background)

  5. 5.

    bg-img-rot (smaller subset with random image background plus rotation)

We have also evaluated on the problem of classifying documents into their corresponding newsgroup topic. We have used a version of the 20-newsgroup dataset [25] for which the training and test sets contain documents collected at different times, a setting that is more reflective of a practical application. The training set consists of 11,269 samples and the test set contains 7505 examples. We have used 5000 most frequent words for the binary input features. We follow the same protocol as outlined in [26].

Our third dataset is the GTZAN music genre dataset [27, 28]. The dataset contains 10,000 three-second audio clips, equally distributed among 10 musical genres: blues, classical, country, disco, hip-hop, pop, jazz, metal, reggae and rock. Each example in the set is represented by 592 Mel-Phon coefficient (MPC) features. These are a simplified formulation of the Mel-frequency cepstral coefficients (MFCCs) that are shown to yield better classification performance. Since there is no predefined standard split and fewer examples, we have used 10-fold cross validation (procedure mentioned in [29]), where each fold consisted of 9000 training examples and 1000 test examples.

In this work our goal is to test the representation capability of the different learning tools. Therefore the training is fully unsupervised (no class label is used). We compare against several state-of-the-art unsupervised deep learning tools—stacked denoising autoencoder (SDAE) [29], K-sparse autoencoder (KSAE) [30], contractive autoencoder (CAE) [31] and deep belief network [32]. Since our goal is to show that our proposed optimal learning algorithm yields improvement over the greedy deep dictionary learning technique proposed before [1,2,3], we carry out comparison with this as well. Learned models for the popular datasets used in this work are publicly available. For the DDL (previous [1] and proposed), a three layer architecture is used where the number of atoms are halved in each subsequent layer.

The generated features from the deepest level are used to train two non-parametric—nearest neighbor (NN) (Table 1) and sparse representation based classification (SRC) [33] (Table 2); and one parametric—support vector machine (SVM) classifier with rbf kernel (Table 3). The results show that our proposed method yields the best results on an average.

Table 1 Comparison on KNN
Table 2 Comparison on SRC
Table 3 Comparison on SVM

The results are as expected. In the prior studies [1, 2] it was already shown that greedy DDL outperforms SDAE and DBN. We now see that, it also improves upon K-sparse autoencoder and contractive autoencoder.

Since this is a new (optimal) algorithm for solving the unsupervised DDL problem, we need to test its speed. The training and testing times for the large MNIST dataset and the relatively smaller MNIST basic dataset are shown in Tables 4 and 5. All the algorithms are run until convergence on a machine with Intel (R) Core(TM) i5 running at 3 GHz; 8 GB RAM, Windows 10 (64 bit) running Matlab 2014a.

Table 4 Training time in seconds
Table 5 Testing time in seconds

The training time of our proposed algorithm is significantly larger than the greedy approach; this is expected. But still we are significantly faster, by several orders of magnitude, compared to other deep learning tools. In terms of testing time, we are faster than greedy DDL. This is because the greedy technique uses standard dictionary learning tools in each level; these are always regularized by sparsity promoting penalties on the coefficients. Thus during testing, one needs to solve an iterative optimization problem. Our formulation on the other hand does not include sparsity promoting \(l_{1}/l_{0}\)-norm; hence each level can be solved via an analytic solution (pseudoinverse). Therefore we just need a matrix vector multiplication. Hence we take almost the same time as other deep learning tools while testing.

4.2 Clustering

The prior study on greedy DDL [1] applied itself to the problem of clustering. In this work we show that our proposed optimal algorithm improves on the clustering results as well. The experimental results have been compared with three studies; the first one being [1]. The other two are GraphEncoder [34] and deep subspace clustering [35]. The study [35] is the most recent and we follow the experimental protocol found there in.

In [35], experiments were carried out on the COIL20 (object recognition) and Extended YaleB (face recognition) datasets. For both the datasets DSIFT (dense scale invariant feature transform) and HOG (histogram of oriented gradients) features were extracted. They were further reduced by PCA to a dimensionality of 300. Since the groundtruths (class labels) for these datasets are available, clustering accuracy was measured in terms of NMI (normalized mutual information), ARI (adjusted rand index) and F-score. The results are shown in Table 6 (COIL20) and Table 7 (YaleB).

The configuration of the deep network for [35] is obtained from the said reference. For the rest (GraphEncoder, Greedy DDL and Proposed) a four tier architecture is used, 600–300–150–75—this yielded the best results; all of them use K-means clustering on the features from the deepest level.

Note that the GraphEncoder has been titled stacked autoencoder (SAE) in [35] and a different configuration has been used. We found that the said configuration performs better than the one used in [35]; hence we have used the better configuration in this work. Except the GraphEncoder, all others (inclusing ours) use tanh activation function; the GraphEncoder uses sigmoid.

Table 6 Clustering results on COIL20
Table 7 Clustering results on YaleB

In [35], thorough experimentation was carried out with a host of other clustering techniques. The deep subspace clustering method yielded the best results. So we only compare with [35]. We see that the previous greedy DDL technique does marginally worse than deep subspace clustering. But with our proposed optimal algorithm, the results improve significantly and we are able to outperform [35] (only in the F-score we do slightly worse for the COIL20 dataset—in the second place of decimal).

5 Conclusion

A new deep learning tool called DDL has been recently proposed. The idea there is to represent the training data as a non-linear combination of several layers of dictionaries. All prior studies were only able to solve the ensuing problem in a greedy fashion. This was a sub-optimal solution as there was no flow of information from the deeper to the shallower layers. This is the first work that proposes an optimal solution to the deep dictionary learning problem; where all the levels of dictionaries are solved simultaneously as a single optimization problem. We invoke the Majorization Minimization framework to solve the said problem. This results in an algorithm that is completely non-parametric.

Experiments have been carried out for both clustering and classification problems; all on benchmark datasets. In all of them, our method performs the best. The only downside of our algorithm (compared to the existing greedy technique) is that ours is comparatively slower than greedy DDL. Nevertheless, we are still several orders of magnitude faster than other unsupervised deep learning tools.