Keywords

1 Introduction

In recent years, the performance of the machine learning algorithms has been rapidly improved. Many techniques of machine learning are proposed such as support vector machine [24], neural network [4], convolutional neural network [6], and so on. Since these models can approximate any non-linear function, they are effective for classification [11, 13, 20, 21], person recognition [7, 10], object detection [25], and so on.

To approximate any non-linear function, almost all models of deep learning use the non-linear activation function. Rectified linear units (ReLU) function is most commonly used as non-linear activation function of the hidden layers in the deep learning models. Sigmoid function or softmax function is often used as a non-linear activation function in the output layer of the deep learning models.

The continuous version of the ReLU function is softplus function and it is derived by the integration of the sigmoid function. The sigmoid function and softmax function are defined by using exponential function and have a close relation with Gaussian distribution. It means that the input of the sigmoid or softmax function is assumed to be a Gaussian distribution. Exponential linear units (ELU) [19], Sigmoid-weighted linear unit (SiLU) [9], swish [18], and mish [16] have been proposed as extension of ReLU function. The such activation functions are derived from ReLU function or sigmoid function.

In machine learning and statistics, most techniques assume the Gaussian distribution for prior distribution or conditional distribution because Gaussian distribution is easy to handle in mathematical theory. For example, the exponential family is often assumed in information geometry which connects various branches of mathematical science in dealing with uncertainty and information based on unifying geometric concepts. In information geometry, it is famous that the exponential family is flat under the e-connection. The Gaussian distribution is a kind of the exponential family.

However, some famous probability distributions, such t-distribution, is not exponential family. As an extension of information geometry, q-space is defined [22]. On the q-space, q-multiplication, q-division, q-exponential, and q-logarithm are defined with hyperparameter q as a natural extension of multiplication and division, etc. in general space. In the q-space, the q-Gaussian distribution is derived by the maximization of the Tsallis entropy under appropriate constraints. The q-Gaussian distribution includes Gaussian distribution and t-distribution that can be represented by setting the hyperparameter q to \(q=1\) for Gaussian distribution and \(q=2.0\) for t-distribution. Since the q-Gaussian distribution can be written by scalar parameter, we can handle some probability distributions as flat in q-space.

The authors proposed to used q-Gaussian distribution for dimensionality reduction technique. The t-Distributed Stochastic Neighbor Embedding (t-SNE) [15] and the parametric t-SNE [14] are extended by using the q-Gaussian distribution instead of t-distribution as the probability distribution on low-dimensional space. They are named q-SNE [1] and the parametric q-SNE [17].

In this paper, we propose to define the activation function and the loss function by using the q-exponential and q-logarithm of q-space. Especially we define q-softplus function as an extension of the softplus function. By this extension, we can introduce hyperparameter q to control the shape of the function. For example, we can recover the standard softplus function or the shifted ReLU function by changing the hyperparameter q of the q-softplus function. To make the origin of the proposed q-softplus function the same as the one of the ReLU function, we also defined the shifted q-softplus function.

To show the effectiveness of the proposed shifted q-softplus function, we have performed experiments in which the shifted q-softplus function is used as the activation function in a convolutional neural network instead of the standard ReLU function. Also, we have performed experiments in which the q-softplus function is used for loss function of metric learning Siamese [5, 8, 10] and Triplet [12, 23] instead of the max function. Through the experiments, the proposed q-softplus function shows better results on CIFAR10, CIFAR100, STL10, and TinyImageNet datasets.

2 Related Work

2.1 q-Space

Information geometry is an interdisciplinary field that applies the techniques of differential geometry to study probability theory and statistics [3]. It studies statistical manifolds, which are Riemannian manifolds whose points correspond to probability distributions. Tanaka [22] extended the information geometry developed on the exponential family to q-Gaussian distribution.

To do so, it is necessary to extend the standard multiplication, division, exponential, and logarithm to q-multiplication, q-division, q-exponential, and q-logarithm in [22]. Then we can consider a space in which these q-arithmetic operations are defined. In this paper, we call this space q-space.

In q-space, the q-multiplication and q-division of two functions f and g are respectively defined as

$$\begin{aligned} f\otimes _{q}g = \left( f^{1-q}+g^{1-q}-1\right) ^{\frac{1}{1-q}}, \end{aligned}$$
(1)

and

$$\begin{aligned} f\oslash _{q}g = \left( f^{1-q}-g^{1-q}+1\right) ^{\frac{1}{1-q}}, \end{aligned}$$
(2)

where q is a hyperparameter.

Similarly the q-exponential and q-logarithm are defined as

$$\begin{aligned} exp_q(x) = \left( 1+\left( 1-q\right) x\right) ^{\frac{1}{1-q}}, \end{aligned}$$
(3)

and

$$\begin{aligned} log_q(x) = \frac{1}{1-q}\left( x^{1-q}-1\right) . \end{aligned}$$
(4)

These q-arithmetic operations converge to the standard multiplication and division when \(q \rightarrow 1\). In the q-space, the q-Gaussian distribution is derived by the maximization of the Tsallis entropy under appropriate constraints. The q-Gaussian distribution includes Gaussian distribution and t-distribution. Since the q-Gaussian distribution can be written with a scalar parameter q, we can handle a set of probability distributions as flat in q-space.

2.2 Activation Function

In a neural network, we use an activation function to approximate non-linear function. The ReLU function is famous and is mostly used in deep neural networks. The ReLU function is defined as

$$\begin{aligned} ReLU(x) = max(0,x). \end{aligned}$$
(5)

The main reason why the ReLU function is used in deep neural network is that the ReLU function can prevent the vanishing gradient problem. The ReLU function is very simple and works well in deep neural networks. This function is also called the plus function.

The softplus function is a continuous version of the ReLU function and is defined as

$$\begin{aligned} Softplus(x) = \log {(1+\exp {x})}. \end{aligned}$$
(6)

The first derivative of this function is continuous around at 0.0 while one of the ReLU function is not. The softplus function is also derivation as integral of a sigmoid function.

Recently many activation functions have been proposed for deep neural networks [9, 16, 18, 19]. Almost all of such activation functions are defined based on the ReLU function or sigmoid function or a combination of the ReLU function and sigmoid function.

These functions are also used to define loss function. For example, the max (ReLU) function or softplus function is used as contrastive loss or triplet loss uses in metric learning.

2.3 Metric Learning

The Siamese network and Triplet network have been proposed and often used for metric learning.

The Siamese network consists of two networks which have the shared weights and can learn metrics between two outputs. In the training, the two samples are fed to each network and the shared weights of the network are modified so that the two outputs of the network are closer together when the two samples belong to the same class, and so that the two outputs are farther apart when they belong to different classes.

Let \(\{(\boldsymbol{x}_i, y_i)|i=1\ldots N\}\) be a set of training samples, where \(\boldsymbol{x}_i\) is an image and \(y_i\) is a class label of i-th sample. The loss function of the Siamese network is defined as

$$\begin{aligned} L_{siamese}&=\frac{1}{2}t_{ij}d_{ij}^2 + \frac{1}{2}(1-t_{ij})max(m-d_{ij}, 0)^2,\end{aligned}$$
(7)
$$\begin{aligned} d_{ij}&=\Vert f(\boldsymbol{x}_i;\theta ) - f(\boldsymbol{x}_j;\theta )\Vert ^2 \end{aligned}$$
(8)

where \(t_{ij}\) is the binary indicator which shows whether the i-th and j-th samples are the same class or not, f is a function corresponding to the network, \(\theta \) is a set of shared weights of the network. This \(\theta \) is learned by minimizing this loss \(L_{siamese}\). The Siamese loss is called the contrastive loss. It is noticed that the max (ReLU) function is used in this loss. It is possible to use the softplus function instead of the max function.

The Triplet network consists of three networks with the shared weights and learns metrics between three outputs. In the training, the three samples are fed to each network. One sample is called an anchor. The sample that is the same class with the anchor is called a positive sample and the sample that is a different class from the anchor is called a negative sample. For the positive sample, the networks is trained such that the two outputs of anchor and positive are closer together. For the negative sample, the networks is trained such that the two outputs of anchor and negative become away from each other.

Let \(x_a\), \(x_p\), and \(x_n\) be the anchor, the positive, and the negative sample respectively. The loss function of the Triplet network is defined as

$$\begin{aligned} L_{triplet}&=max(d_{ap} - d_{an} + m, 0), \end{aligned}$$
(9)

where m is a margin, \(d_{ij}\) is a distance same as the contrastive loss. It is noticed that the max (ReLU) function is also used in this loss. We can use the softplus function instead of the max function. Since the max or softplus function is linear when \(x>>0\), they are effect to move the sample farther away. This is very important for metric learning.

Fig. 1.
figure 1

This figure shows the graph of the activation functions. In (A), it shows the max (ReLU) function, softplus function and q-softplus function with difference hyperparamete q. When \(q=0.999\), q close to 1, the q-softplus function overlaps the softplus function. In (B), it shows the max (ReLU) function and shifted q-softplus function with difference hyperparamete q. When \(q=0.0\), the q-softplus function overlaps the max function.

Fig. 2.
figure 2

This figure shows network architecture where q-softplus or shifted q-softplus function is used. As an activation function, the shifted q-softplus is replaced from ReLU function. As a loss function of triplet loss, the q-softplus function is replaced from max function.

3 q-Softplus Function and Shifted q-Softplus Function

The q-Space is defined to extend information geometry developed for exponential family. By using q-space, we can consider the natural extended world. In this paper, we proposed an extension of the standard activation functions or the loss functions by using q-space. Since q-exponential and q-logarithm express the various shape of a graph by setting a hyperparameter q, we can control the shape of the activation function or the loss function by selecting the better parameter q in the q-space. In particular, in this paper, we proposed the q-softplus function as an extension of the softplus function.

3.1 q-Softplus Function

The q-softplus function is defined as

$$\begin{aligned} qsoftplus(x)&= log_q(1 + exp_qx)\nonumber \\&= \frac{1}{1-q}\left( \left( 1+max\left( 1+\left( 1-q\right) x,0\right) ^{\frac{1}{1-q}}\right) ^{1-q}-1\right) . \end{aligned}$$
(10)

When \(q\rightarrow 1\), q-softplus function close to the original softplus function. Figure 1 (A) shows the shape of the q-softplus function compared with the max (ReLU) function and the softplus function. When \(q=0.999\) (q close to 1), q-softplus function overlapped with the softplus function. Moreover, when \(q=0.0\), q-softplus function becomes the shifted max function. From Fig. 1 (A), it is noticed that the q-softplus function can represent the various shapes including the max (ReLU) function and the softplus function. When \(1+\left( 1-q\right) x>0\) the first derivative of x is as follows,

$$\begin{aligned} \frac{dqsoftplus(x)}{dx}&= \left( 1+\left( 1+(1-q)x\right) ^{\frac{1}{1-q}}\right) ^{-q}\left( 1+(1-q)x\right) ^\frac{q}{1-q} \nonumber \\&= \left( 1+exp_qx\right) ^{-q}\left( exp_qx\right) ^q, \end{aligned}$$
(11)

other wise is 0. When \(q\rightarrow 1\), Eq. 11 closes to first derivation of softplus function.

3.2 Shifted q-Softplus Function

The q-softplus function becomes shifted max function when \(q=0.0\). To make q-softplus with \(q=0.0\) the same as the max function, we propose to shift q-softplus function by introducing sift term. We call this function the shifted q-softplus function Then the shifted q-softplus function is defined as

$$\begin{aligned} sqsoftplus(x)&= log_q(1 + exp_q(x-\frac{1}{1-q}))\nonumber \\&= \frac{1}{1-q}\left( \left( 1+max\left( 1+\left( 1-q\right) (x-\frac{1}{1-q}),0\right) ^{\frac{1}{1-q}}\right) ^{1-q}-1\right) . \end{aligned}$$
(12)

When \(q=0.0\), the shifted q-softplus function becomes the same as the max function. Figure 1 (B) shows the shapes of the shifted q-softplus function. It is noticed that the shifted q-softplus function can represent the various shapes including the max function from this figure.

3.3 Loss Function for Metric Learning

The loss function of the Siamese network or the Triplet network, the max or softplus function is important to move the sample farther away because the max or softplus function is linear when \(x>>0\). We also propose a new loss function called q-contrastive loss and q-triplet loss by using q-softplus. The q-contrastive loss is defined as

$$\begin{aligned} L_{qsiamese}&=\frac{1}{2}t_{ij}d_{ij}^2 + \frac{1}{2}(1-t_{ij})qsoftplus(m-d_{ij}, 0)^2. \end{aligned}$$
(13)

Similarly, the q-triplet loss is defined as

$$\begin{aligned} L_{triplet}&=qsoftplus(d_{ap} - d_{an} + m, 0). \end{aligned}$$
(14)
Table 1. This table shows classification accuracy on CIFAR10, CIFAR100, STL10 and Tiny ImageNet. The hyperparameters q of all activation function on VGG11 are same. The accuracy shows percentage for train and test sample respectively.
Table 2. This table shows test classification accuracy on CIFAR10, CIFAR100, STL10 and Tiny ImageNet by using optuna. The hyperparameters q of shifted q-softplus function found by optuna are shown in Table 3. The accuracy shows percentage.
Table 3. This table shows the found hyperparameter q of each shifted q-softplus function on VGG11 by using optuna. VGG11 has 10 q-softplus activation functions. The qk denotes the k-th shifted q-softplus function from first layer.

By using the q-softplus function, we can control the effect of moving the sample farther away. Since the first derivative of the q-softplus function is continuous at 0, it can move the sample more farther away than the given margin. We can also use the shifted q-softplus function in loss function. Since the shifted q-softplus function has distorted linear shapes, we can control the effect of loss.

Figure 2 shows the example of the network architecture where the q-softplus function or the shifted q-softplus function is used. In this figure, the example of the triplet loss is shown.

4 Experiments

4.1 Experimental Dataset

To confirm the effectiveness of the proposed q-softplus based activation function and loss function, we have performed experiments using MNIST, FashionMNIST, CIFAR10, CIFAR100, STL10, and Tiny ImageNet datasets.

Table 4. This table shows classification accuracy of test sample by Siamese network on MNIST, FashionMNIST and CIFAR10. The accuracy shows percentage for train and test sample respectively by k-nn.

The MNIST has grey images of 10 class hand-written digits. The size of each image is 28 \(\times \) 28 pixels. The number of training samples is 60,000 and the number of test samples is 10,000. The FashionMNIST has grey images of 10 classes of fashion items. The size of each image is 28 \(\times \) 28 pixels. The number of training samples is 60,000 and the number of test samples is 10,000. The CIFAR10 has colored images of 10 class objects. The size of each image is 32 \(\times \) 32 pixels. The number of training samples is 50,000 and the number of test samples is 10,000. The CIFAR100 has colored images of 100 class objects. The size of each image is 32 \(\times \) 32 pixels. The number of training samples is 50,000 and the number of test samples is 10,000. The STL10 has colored images of 10 class objects. The size of each image is 96 \(\times \) 96 pixels. The number of training samples is 500 and the number of test samples is 800. The TinyImageNet has colored images of 200 objects. The size of each image is 64 \(\times \) 64 pixels. The number of training samples is 100,000 and the number of test samples is 10,000.

Table 5. This table shows classification accuracy of test sample by Triplet network on MNIST, FashionMNIST and CIFAR10. The accuracy shows percentage for train and test sample respectively by k-nn.

4.2 Shifted q-Softplus as an Activation Function

To confirm the effectiveness to use the shifted q-softplus function as an activation function, we have performed experiments in which the shifted q-softplus function in CNN is used instead of the ReLU function. The classification accuracy is measured for the datasets CIFAR10, CIFAR100, STL10, and Tiny ImageNet. VGG11 [20] is used as the CNN model and the effect of Batch Normalization (BN) is also investigated. Stochastic gradient descent (SGD) with a momentum of 0.9 is used for optimization. The learning rate is at first set to 0.01 and is multiplied by 0.1 at 20 and 40 epochs. The parameter of the weight decay is set to 0.0001. The batch size is set to 100 training samples and the training is done for 100 epochs.

Table 1 shows the classification accuracy for different q. The score is calculated as the average of 5 trials with a different random seed. From this table, the shifted q-softplus function gives better classification accuracy than the ReLU function. From this table we can notice that the best hyperparameter q is around 0.2. When the hyperparameter q is positive, namely \(q>0.0\), the shape of the shifted q-softplus function becomes lower than the ReLU function. This means that better classification accuracy is obtained when the outputs of each layer are smaller than the outputs of the ReLU function.

We have also performed experiments to find the best hyperparameter q of the shifted q-softplus function for each dataset by using optuna [2]. The optuna is developed for python language to find the best hyperparameter of the machine learning models. The objective function to find the best hyperparameter q is the validation loss. We used 0.1% of training dataset as the validation samples. The trials of finding phase is set to 30.

The results of test accuracy for each dataset are shown in Table 2. Again, the values in the table are the averages of 5 trials with a different random seed. The best hyperparameters q of the shifted q-softplus function for each dataset are shown in Table 3. It is noticed that the best hyperparameter q is larger than 0.0 and smaller than 0.2 for almost all cases.

4.3 q-Softplus as an Loss Function of Metric Learning

To confirm the effectiveness of the q-softplus function as loss function, we have performed experiments in which the q-softplus function is used to define the loss function of the Siamese network and the Triplet network instead of the max function. We call these loss functions q-contrastive loss and q-triplet loss. MNIST, FashionMNIST, and CIFAR10 datasets are used in the experiments. The simple CNN with 2 convolutional layers and 3 fully connected layers is used for MNIST and FashionMNIST datasets. The ReLU function is used as the activation function in the hidden layers of the network. On the other hand, VGG11 with batch normalize is used for CIFAR10 dataset. The dimension of the final output is 10 for all datasets. Stochastic gradient descent (SGD) with a momentum of 0.9 is used for optimization. The learning rate is at first set to 0.01 and is multiplied by 0.1 at 20 and 40 epochs. The parameter of the weight decay is set to 0.0001. The batch size is to 100 samples and the training is done for 100 epochs. The margin in the loss function is determined by preliminary experiments.

The goodness of the feature vectors obtained by the trained network is evaluated by measuring the classification accuracy obtained by using k nearest neighbor (k-nn) in the 10-dimensional feature space. In the following experiment, k is set to 5 for k-nn. Since the q-softplus function becomes shifted max function when \(q=0.0\), we also included experiments with margin - 1.

Table 4 shows the classification accuracy obtained by the Siamese network and Table 5 shows the classification accuracy obtained by Triplet network. The score is the average of 5 trials with a different random seed.

It is noticed that the q-softplus function gives better classification accuracy than the max function. The best hyperparameter q is around −0.5. Since the shape of the q-softplus function becomes higher than the max function when \(q<0.0\), to make the output larger is probably better to move the sample farther away.

5 Conclusion

In this paper, we proposed the q-softplus function and the shifted q-softplus function as an extension of the softplus function. Through the experiments of the classification task, we confirmed that the network in which the shifted q-softplus function is used as activation function in the hidden layers gives the better classification accuracy than the network using the ReLU function. Also, we found that the best q in the shifted q-softplus function is around 0.2. This results suggest that better classification accuracy is obtained when the outputs of each layer are smaller than the outputs of the ReLU function. Through the experiments of metric learning, we confirmed that the q-softplus function can improve the contrastive loss of the Siamese network and the triplet loss of the Triplet network. For the metric learning, the best q is around −0.5. This results suggest that better features can be obtained when the outputs are larger than the output of the max function.