Keywords

1 Introduction

Meta-learning has been widely used in various fields [1]. Particularly, the model-agnostic meta-learning can be combined into unsupervised learning, few-shot learning and reinforcement learning [2]. These learning systems can adopt tasks to train and test and achieve the objective of meta-learning that minimize the generalization error loss [3,4,5]. The goal of meta-learning is to learn a function through a set of learning algorithms, as model-agnostic meta-learning which is widely used recently [6]. Maximum likelihood estimation is a method for us to find the maximum value of the log-likelihood function to form an unconstrained optimization problem.

In this paper, we also use the way of task training and we mainly focus on the maximum likelihood estimation of our model [3,4,5]. We mainly use it to update parameters, so that our objective function can find its global optimal solution, which can greatly reduce the training time of the model and the model achieve better training effect within the allowable range. And we adopt residuals network as our embedding model [7, 8].

The goal of the present study is to achieve the stability of the algorithm, minimize the training error in the training process, and at the same time achieve good generalization ability through test. For the parameter trajectories of logistic regression, we mainly form an unconstraint convex optimization problem, it is unlike SVM which adapts a constraint convex optimization problem [4]. We can use iterative reweighted least square method (IRLS) to get the solver of model [5].

2 Proposed Method

2.1 Problem Formulation

We have mainly undertaken the experiment on two data set—CIFAR-FS and FC100 and experiment on three forms of K ways N shot (5-way-5-shot, 5-way-1-shot, 5-way-2-shot) for classification. On the one hand, our method is mainly divided into two stages. One is the basic learner stage, which is mainly about learning how to calculate the value of \({w}^{i}\) completed by logistic regression differentiation. As shown in Fig. 9.1, \({w}^{i}\) are the weights of the linear classifier. The second is the meta-learning stage, which needs to improve the learning ability through back propagation error.

Fig. 9.1
figure 1

A general overview of our method

We mainly use meta-learning for few-shot learning gradient-based methods, using gradient descent methods to adapt new tasks [9, 10]. Meta-learning enables a few steps of gradient descent to obtain good parameters in parameter space. In logistic regression, the maximum likelihood estimation can be transformed into a minimum unconstrained optimization problem [11]. Meanwhile, logical regression has closed solution like ridge regression [5]. Our method requires a large amount of computation, which requires GPU to calculate the gradient and the solution of the model. As shown in the following Fig. 9.1, we have depicted the overview of our method; it illustrates 1-way 3-shot classification tasks and we adapt logistic regression method as our classifier. The embedding features of the training samples can be learned and obtaining the corresponding weights and testing examples are same. A task is a tuple for fewshot. Finally, the errors are minimized by the meta-learner.

We have traced back to the previous work of the meta-learning framework, explored the convex base learner again, and proposed the base learner [12] of logistic regression. And we compare it with other convex base learners, such as linear SVM and ridge regression.

According to the two components of the previous meta-learning algorithm, namely the base learner and the meta-learner [12], meta-learning is learning to learn, and it is a good way to improve learning skills [13]. The goal of meta-learning is to make the base learning algorithm adapt well to new episodes.

Given a data set \(S=\{{x}_{i},{y}_{i}{\}}_{i=1}^{n}\), which includes a meta-training set and a meta-test set, the meta-training set and a meta-test set also include a training set and a test set, but we named it support set and query set. The support set is used for training, and the query set is used for testing so that they construct a task for training. In this paper, there are a group of tasks that is used as a meta-training set \(I={\{({D}_{i}^{train},{D}_{i}^{test})\}}_{i=1}^{I},{D}_{i}^{train}\cap {D}_{i}^{test}=\varnothing\). The embedded model is parameterized mainly through \(\varnothing\) that mainly uses the support set of the meta-training set. Given J tasks for meta-test \(\mathrm{J}={\{({D}_{i}^{train},{D}_{i}^{test})\}}_{j=1}^{J}\). As we have shown that Fig. 9.2 explains the partition process of data set. The data set is mainly composed of two parts, one is the test set, the other is the training set. At the same time, the test and training set includes support set and query set.

Fig. 9.2
figure 2

The partition of data set

In this paper, the base learner is to estimate the parameter \(\theta\) of \(f(x;\theta )\), here we use the method of university function approximation [14] \(y=f(x;\theta )\), and base learner \(\mathcal{B}\) is used to achieve better generalization ability. We write it as:

$$\theta =\mathcal{B}\left({D}^{train};\varnothing \right)=\mathit{arg}\underset{\theta }{\mathit{min}}{\mathcal{L}}^{base}\left({D}^{train};\theta ,\varnothing \right)+R(\theta )$$
(9.1)

where \({\mathcal{L}}^{base}\) is the loss function which is computed by the base learner, such as the negative log-likelihood function. As we all know, \(R(\theta )\) is a regularization of a function which plays a great important to generalize the loss [15]. As with most meta-learning methods, we regard the training program as episodes, so each episode can be regarded as a small sample classification problem. Usually, the classification of small samples adopts the classification method of K-way and n-shot [16]. Here, we need to consider the values of K and N. Generally, \(N=\{1,\dots ,n\}\). In the above, we have described the tasks, a task (or episode) \({\daleth }_{i}=({D}_{i}^{train},{D}_{i}^{test})\). Simultaneously, \({D}_{i}^{train}\cap {D}_{i}^{test}=\varnothing\) and \({D}_{i}^{val}\) also disjoint with them.

2.2 Efficient Logistic Regression Convex Optimization

The base learner is mainly based on the principle of logistic regression, which is an unconstrained optimization problem. Therefore, we need to discuss the first-order and second-order optimality condition [17, 18], and we first give the unconstrained optimization problem:

$$\theta =\mathcal{B}\left({D}^{train};\varnothing \right)=\mathit{arg}min-\sum_{i=1}^{N}lnp({Y}_{i}|{X}_{i},{w}_{1},\dots ,{w}_{M})+\frac{\lambda }{2}{w}^{T}w$$
(9.2)

where \(\lambda\) is the regularization and \({D}^{train}=\{({x}_{n},{y}_{n})\}\), \({Y}_{i}\) is the labels of dataset, \(\theta =\{{w}_{k}{\}}_{k=1}^{K}\). Because our objective function is differentiable and convex and there is the quality that if the objective function is continuously differentiable, a practical optimality judgment condition can be obtained by virtue of the property of continuous differentiable function.

Theorem 9.1

(The necessary condition of first order) If \({x}^{*}\) is the local optimal solution of the unconstrained optimization problem [19], then \(\nabla f\left({x}^{*}\right)=0\).

Theorem 9.2

(The sufficient condition of second order) When you suppose that point \({x}^{*}\) is the local optimal solution of the unconstrained optimization problem, and if f(x) is continuously differentiable for second order in the neighborhood of point \({x}^{*}\), then

$$\nabla f\left({x}^{*}\right)=0 \;and \; {\nabla }^{2}f\left({x}^{*}\right)>0$$
(9.3)

where \({\nabla }^{2}f\left({x}^{*}\right)\) represents Hessian matrix is positive defined, then \({x}^{*}\) is a strictly local optimal solution of f(x).

Now we consider the logistic regression multi-class classification problem. Given data have a total of M classes, and each sample \({x}_{i}\) corresponds to a vector (or one-hot label) \({y}_{i}=[{y}_{i1},\dots ,{y}_{iM}{]}^{T}\) of M dimension. Each element of \({y}_{i}\) is 0 or 1: If \({x}_{i}\) belongs to m-th class, then \({y}_{im}=1\), and all other elements are 0. The multinomial logistic regression model uses the following soft-max function as the sample x of the conditional probability belongs to the m class [20].

$$p\left({y}_{m}=1|x\right)=\frac{exp({w}_{m}^{T}x)}{\sum_{j=1}^{M}exp(({w}_{j}^{T}x))}$$
(9.4)

where \({w}_{1},\dots .,{w}_{M}\) are the parameters of our model.

We use the following distribution:

$$p\left({y}_{m}=1|x\right)=\sigma \left({w}_{m}^{T}x\right)=\frac{exp({w}_{m}^{T}x)}{1+\sum_{j=1}^{M-1}exp({w}_{j}^{T}x)},m=1,\dots ,M-1$$
(9.5)
$$p\left({y}_{M}=1|x\right)=1-\sigma \left({w}_{m}^{T}x\right)=\frac{1}{1+\sum_{j=1}^{M-1}exp({w}_{j}^{T}x)}$$
(9.6)

The likelihood function of a single sample is:

$$p\left({Y}_{i}|{X}_{i},{w}_{1},\dots ,{w}_{M}\right)=\prod_{m=1}^{M}p({y}_{im}=1|{x}_{i}{)}^{{y}_{im}}$$
(9.7)

Therefore, the likelihood function for the meta-training set is:

$$p\left({Y}_{i}|{X}_{i},{w}_{1},\dots ,{w}_{M}\right)=\prod_{i=1}^{N}\prod_{m=1}^{M}p({y}_{im}=1|{x}_{i}{)}^{{y}_{im}}$$
(9.8)

And we can get the log-likelihood function:

$$lnp\left({Y}_{i}|{X}_{i},{w}_{1},\dots ,{w}_{M}\right)=\sum_{i=1}^{N}\sum_{m=1}^{M}{y}_{im}lnp({y}_{im}=1|{x}_{i})$$
(9.9)

Newton’s-Method and Solving Unconstrained Optimization Problems

Newton’s method is a descent method. The difference between Newton’s method and gradient descent method lies in the choice of descent direction [21, 22]. For unconstrained optimization problem:

$$\mathit{min}f(x)$$
(9.10)

Assuming that f is a convex function and second-order differentiable (the domain is an open set), then the second-order Taylor approximation of f(x) near x is:

$$\widehat{f}\left(x+v\right)=f\left(x\right)+g(x{)}^{T}v+\frac{1}{2}{v}^{T}H(x)v$$
(9.11)

where \(g\left(x\right)=\nabla f\left(x\right)\) is a gradient, \(H\left(x\right)={\nabla }^{2}f\left(x\right)\) is a Hessian matrix. Must be noted that the above is only a quadratic approximation, not a complete Taylor expansion.

If x is regarded as a constant, then the above expression is a quadratic function of v, minimized with respect to v, making the gradient zero:

$$g+Hv=0\to v=-{H}^{-1}g$$
(9.12)

It is the Newton step. Since H is positive definite, its inverse is also positive definite,

$${g}^{T}\Delta {x}_{nt}=-g{H}^{-1}g$$
(9.13)

Unless g = 0, \(\Delta {x}_{nt}\) is the descent direction. When f is a quadratic function, \(x+\Delta {x}_{nt}\) is its minimum point; As f approaches quadratic, \(x+\Delta {x}_{nt}\) is a good estimate of its minimum point [23]; Since f is quadratic differentiable, the quadratic approximation is very accurate around the minimum value, and \(x+\Delta {x}_{nt}\) is a good estimate of the minimum point [24]. The steps of Newton’s method are similar to those of gradient descent, except that the direction of descent is \(\Delta {x}_{nt}=-{H}^{-1}g\).

There’s an objective function (9.2). We should judge whether our goal function is positive definite or not. So let’s calculate the gradient:

$$\lambda w+\sum_{i=1}^{N}\frac{-{y}_{i}{x}_{i}exp(-{y}_{i}{w}^{T}{x}_{i})}{1+exp(-{y}_{i}{w}^{T}{x}_{i})}=\lambda w+\sum_{i=1}^{N}-{y}_{i}{x}_{i}[1-\sigma ({y}_{i}{w}^{T}x)]$$
(9.14)
$${g}_{k}=\lambda {w}_{k}+\sum_{i=1}^{N}-{y}_{i}{x}_{ik}[1-\sigma ({y}_{i}{w}^{T}{x}_{i})]$$
(9.15)

where \({w}_{l}\) is the lth element of w, and \({x}_{ik}\) is the kth element of sample \({x}_{i}\), \(\sigma ({y}_{i}{w}^{T}x)\) is sigmoid function. To calculate the Hessian matrix, we need:

$$\frac{\partial \sigma ({y}_{i}{w}^{T}{x}_{i})}{\partial {w}_{l}}=\frac{\mathit{exp}\left(-{y}_{i}{w}^{T}{x}_{i}\right)}{{\left[1+\mathit{exp}\left(-{y}_{i}{w}^{T}{x}_{i}\right)\right]}^{2}}\left({y}_{i}{x}_{il}\right)=\sigma \left({y}_{i}{w}^{T}{x}_{i}\right)[1-{y}_{i}{w}^{T}{x}_{i}]\left({y}_{i}{x}_{il}\right)$$
(9.16)

Let’s calculate the elements in k row of the Hessian matrix, k, l = 0, 1…, K. When \(k\ne l\),

$$\begin{aligned}{H}_{kl}=\frac{\partial {g}_{k}}{\partial {w}_{l}} & =\sum_{i=1}^{N}{y}_{i}{x}_{il}\frac{\sigma \left({y}_{i}{w}^{T}{x}_{i}\right)}{\partial {w}_{l}} \\ & =\sum_{i=1}^{N}\sigma \left({y}_{i}{w}^{T}{x}_{i}\right)[1-\sigma \left({y}_{i}{w}^{T}{x}_{i}\right){(y}_{i}{x}_{il}){(y}_{i}{x}_{il})]\\&=\sum_{i=1}^{N}\sigma \left({w}^{T}{x}_{i}\right)[1-\sigma \left({w}^{T}{x}_{i}\right)]{x}_{il}{x}_{ik} \end{aligned}$$
(9.17)

When \(k=l\),

$${H}_{kl}=\frac{\partial {g}_{k}}{\partial {w}_{l}}=\lambda +\sum_{i=1}^{N}\sum_{i=1}^{N}\sigma \left({w}^{T}{x}_{i}\right)[1-\sigma \left({w}^{T}{x}_{i}\right)]{x}_{il}{x}_{ik}$$
(9.18)

Noting the matrix \(X=\left[{x}_{1},{x}_{2},\dots ,{x}_{N}\right],{A}_{ii}=\sigma \left({w}^{T}{x}_{i}\right)\left[1-\sigma \left({w}^{T}{x}_{i}\right)\right],\) the Hessian matrix of (9.2) is

$$H=\lambda I+\sum_{i=1}^{N}\sigma \left({{y}_{i}w}^{T}{x}_{i}\right)\left[1-\sigma \left({{y}_{i}w}^{T}{x}_{i}\right)\right]{x}_{i}{x}_{i}^{T}=\lambda I+\sum_{i=1}^{N}{{A}_{ii}x}_{i}{x}_{i}^{T}=\lambda I+XA{X}^{T}$$
(9.19)

where A is a diagonal matrix of order N, whose elements in i row and i column are \({A}_{ii}\), \({A}_{ii}>0\).

Because \({u}^{T}Hu=\lambda {u}^{T}u+({X}^{T}u{)}^{T}A\left({X}^{T}u\right)>0,\forall u\ne 0,\) so H is positive definite, function (9.2) is a convex function, problem \(min-\sum_{i=1}^{N}\mathrm{ln}[1+\mathrm{exp}(-{y}_{i}{w}^{T}{x}_{i})]+\frac{\lambda }{2}{w}^{T}w\) for unconstrained convex optimization problem.

2.3 Approach to the Objective of Meta-learning

When we want to solve unconstrained optimization problems [25], before we do that, we must determine this is a convex optimization problem. The convex function is determined by the Hessian matrix of the objective function \({\mathcal{L}}^{base}\), for which the Hessian matrix \(H=\frac{{\partial }^{2}\theta (w)}{\partial w{\partial w}^{T}}\) is positive defined.

$$\begin{aligned}\theta & =\mathcal{B}\left({D}^{train};\varnothing \right)=\mathit{arg}\underset{\theta }{\mathit{min}}{\mathcal{L}}^{base}\left({D}^{train};\theta ,\varnothing \right)+R\left(\theta \right) \\ & =\mathit{arg}min-\sum_{i=1}^{N}lnp({Y}_{i}|{X}_{i},{w}_{1},\dots ,{w}_{M})+\frac{\lambda }{2}{w}^{T}w \end{aligned}$$
(9.20)

We can confirm that the Hessian matrix of our objective function satisfies the condition of the theorem.

And in order to obtain a closed solution, we must consider using an iterative method to solve it. In there we adopt iteratively reweighted least squares (IRLS) method to optimize the problem, the following iteration [26]:

$${w}^{i}={w}^{i-1}-{H}^{-1}g$$
(9.21)

\(H\) is the Hessian matrix of objective function. The number of Newton steps related to the Hessian matrix can be obtained by the second-order Taylor approximation of the objective function. Among them, the ith iteration updates the parameters

$${H}_{i}=\lambda I+XA{X}^{T},{g}_{i}=\lambda w-XAt$$
(9.22)

\({t}_{i}=\frac{{y}_{i}[1-\sigma \left({{y}_{i}w}^{T}{x}_{i}\right)]}{{A}_{i}}\), \(A=\sigma \left({w}^{T}X\right)\left[1-\sigma \left({w}^{T}X\right)\right]\), \(\sigma\) is the sigmoid function, \({g}_{i}\) is the gradient. So the formula can be obtained by substituting (9.22) into (9.21) that we can compute:

$${w}^{i}=(XA{X}^{T}+\lambda I{)}^{-1}XAz$$
(9.23)

where

$$z=\left({X}^{T}{w}^{i-1}+t\right)$$
(9.24)
$${z}_{i}={X}^{T}{w}^{i-1}+{t}_{i-1}={X}^{T}{w}^{i-1}+\frac{{y}_{i}[1-\sigma \left({{y}_{i}w}^{T}{x}_{i}\right)]}{{A}_{i}}$$
(9.25)
$${A}_{i}=\sigma \left({w}^{T}{x}_{i}\right)\left[1-\sigma \left({w}^{T}{x}_{i}\right)\right]$$
(9.26)

\(min-\sum_{i=1}^{N}lnp({Y}_{i}|{X}_{i},{w}_{1},\dots ,{w}_{M})\) also called the cross-entropy error function of logistic regression multi-classification [27].

Although there are many options for measuring losses, here we use a negative log-likelihood function to measure losses, which are same as in the paper of prototype network [28, 29]. The negative log-likelihood function can measure the performance of the meta-test sample, and we think it is very effective way to adopt this function.

$${L}^{meta}\left({D}^{test};\theta ,\varnothing ,\alpha \right)=\sum_{(x,y)\in {D}^{test}}[-\alpha {w}^{i}{f}_{\varnothing }\left(x\right)+log\sum_{k}exp(\alpha {w}^{j}{f}_{\varnothing }(x))]$$
(9.27)

where \(\theta =\mathcal{B}\left({D}^{train};\varnothing \right)=\{{w}^{j}{\}}_{j=1}^{K}\) and \(\alpha\) is a parameter which can be learned from the process.

3 Results and Discussions

In this paper, we mainly use Resnet and prototypical networks as our embedding model. When experiment on the CIFAR and FC100 data set, the network architecture: R64-MP-DB(0.9,1)-R160-MP-DB(0.9,1)-R320-MP-DB(0.9,2)-R640-MP-DB(0.9,2). We initially set the learning rate to 0.1 and change to 0.006 at epoch 20. The use of such parameters here is in full compliance with the criteria of gradient descent. We referred to the corresponding parameter settings in the Meta-learning of different- able convex optimization [2]. In order to make full use of the device’s availability and available memory space, we tried to set epochs as 20 for many times, which was a wise choice because the GPU often needed to carry out a lot of calculations in the case of many tasks, which would cost a lot of time. The minibatch consists of 8 episodes and every epoch consists of 1000 episodes. And Table 9.1 shows the result of our method and make a comparison to other base learners.

Table 9.1 Comparison of other algorithms on CIFAE-FS and FC100. Average few-shot classification accuracies (%) which on the backbone Resnet12. ‘R2D2’ and Ridge stand for ridge regression but for two different forms. ‘LR’ stands for the logistic regression

As shown in Table 9.1, LR as our base learner can achieve better performance and be more stable when we use CIFAR-FS data set. As shown in Figs. 9.3 and9.4, we compare four base learners with the same k-way n-shot(5-way 1-shot; 5-way 2-shot; 5-way 5-shot) on CIFAR-FS data set and FC100 data set, MiniImagent data set, it depicts our method can stably get the results. But when we use data set FC100, we find that SVM method will be more efficient to test tasks. In this way, although logistic regression method in FC100 data set doesn’t get enough good results but it can confirm that it can be stable for classification. At the same time, it also reflects the authenticity of experiments, the whole operation process is you don’t know FC100 data gathering in the effect of the LR algorithm accuracy is lower than the other. It is believed that LR meta-learning has better stability than the other three kinds of algorithms, so it can be as our further exploration work, we can explore that the logistic regression meta-learning algorithm better adapts to all of the downstream tasks. However, when we use MiniImagenet data set to achieve our method, the base learner of SVM becomes the lowest of accuracy in Table 9.2. And LR as the base learner will get 62.48% accuracy with 5-way 5-shot. As shown in Table 9.1, the more samples there are, the higher the accuracy will be. 5-shot means there gives five samples, and 2-shot means there gives only two samples. Therefore, these two samples and five samples will be more accurate than one sample; either a 5-way 10-shot or a 5-way 15-shot (Table 9.2).

Fig. 9.3
figure 3

Comparison for four base learners with the same k-way n-shot on CIFAR-FS data set

Fig. 9.4
figure 4

Comparison for four base learners with the same k-way n-shot on FC100 data set

Table 9.2 Comparison of other algorithms on MiniImagenet dataset. Average few-shot classification accuracies (%) which on the backbone 64-64-64-64. ‘R2D2’ and Ridge stand for ridge regression but for two different forms. ‘LR’ stands for the logistic regression

4 Conclusion

In this paper, we mainly show that the performance of logistic regression as the base learner and compare it to other base learners. Our method principally considers the unconstrained optimization problem, and the closed-form solution can be obtained through the iterative method. Moreover, experiments have been carried out on all three data sets, which are fully reflected in the figure above. Finally, we make the conclusion that logistic regression method can stably run than other base learners when there are less epochs as you can see in Figs. 9.3, 9.4, and 9.5. And we just adopt 3 ways to experiment with our convex base learner, it can be seen, our method performs well in CIFAR-FS. At the running level, we further save the time to run our process, because data set is great and the process will be long and complex. It is also an effective way to classification as a base learner after embedding features.

Fig. 9.5
figure 5

Comparison for four base learners with the same 5-way 5-shot on MiniImagenet data set