Keywords

1 Introduction

Activation function is an important part of neural networks. The most popular activation function is the Rectified Linear Unit (ReLU) [14]. The ReLU activation function can speed up the learning process with less computational complexity as observed in [3, 11, 12]. Although many other activation functions have been proposed in the last few years [6, 9, 13, 17, 21], ReLU is still the most commonly used activation function for CNNs in image classification due to its simplicity and the fact that other activation functions such as ELU [2] and GELU [9] have no significant advantage over ReLU.

Despite being heavily over-parameterized, deep neural networks have been shown to be remarkably good at generalizing to natural data. There is a phenomenon known as the spectral bias [16] or frequency principle [1, 23, 24] which claims that activation functions such as ReLU make the networks prioritize learning the low frequency modes and the lower frequency components of trained networks are more robust to random parameter perturbations. This has raised the important problem of understanding the implicit regularization effect of deep neural networks and is one main reason for good generalization accuracy [15, 16, 20, 24]. Recently, a theoretical explanation for the spectral bias of ReLU neural networks is provided in [10] by leveraging connections with the theory of finite element method and hat activation function for neural networks that different frequency components of error for hat neural networks decay at roughly the same rate and so hat function does not have the same spectral bias that has been observed for networks based on ReLU, Tanh and other activation functions.

In this paper, we consider CNN with the hat activation function which is defined as follows

$$\begin{aligned} \text {Hat(x)} =\left\{ \begin{array}{ccc} &{}x, &{}x\in [0,1] ,\\ &{}2-x,&{}x\in [1,2],\\ &{}0,&{} \text { otherwise, }\\ \end{array}\right. \end{aligned}$$
(1)

and is actually the B-Spline of first order. Different from ReLU function, hat function has a compact set. We use the hat activation function for CNNs including MgNet [4] and ResNet [7].

MgNet [4] is strongly connected to multigrid method and a systematic numerical study on MgNet in [5] shows its success in image classification problems and its advantages over established networks. We use MgNet in the experiment since it relates to multiscale structure that can handle the frequency variation in different resolution data. Note that hat function has a compact support which is more likely to result in the vanishing gradient problem and no spectral bias, it still obtains comparable generalization accuracy and can even perform slightly better than CNNs with ReLU activation function on MNIST, CIFAR10/100 and ImageNet datasets. This also questions whether the spectral bias is truly significant for regularization. We illustrate that the scale of hat activation function in different resolution layer is of importance and should be set properly for MgNet to adapt the frequency variation in the network. Furthermore, considering the performance of a neural network also depends on how its parameters are initialized, we also test several initialization methods for neural networks including Xavier initialization [3] and Kaiming initialization [6], the results show that all these initialization methods work well for these CNNs with hat activation function.

2 Hat Function for MgNet

Different from ReLU function, the hat function has a compact support of [0, 2]. Neural networks with hat function also have the universal approximation property [18, 19, 22]. Hat function is closely related to finite element method and we can adjust the compact support of this function to change its frequency. Thus, we define the following scaled hat activation function with parameter M such that

$$\begin{aligned} \text {Hat}(x;M) = \left\{ \begin{array}{ccc} &{}x, &{}x\in [0,\frac{M}{2}],\\ &{}M-x, &{} x\in [\frac{M}{2},M],\\ &{} 0,&{} \text { otherwise. }\\ \end{array} \right. \end{aligned}$$
(2)

It has the advantage that its frequency can vary by changing the parameter M. It is shown in [10] that different frequency components of error for hat neural networks decay at roughly the same rate and thus hat function does not have the same spectral bias that has been observed for ReLU networks. To make full good use of this property, we introduce to use MgNet with hat activation function.

MgNet [4] is a convolutional neural network that strongly connects to multigrid methods and also has a good performance in comparison with existing CNN models [5]. The network consists of several iterative blocks shown in Fig. 1 in both data space and feature space.

Fig. 1.
figure 1

MgNet Iterative block.

This block is related to the following residual correction step in multigrid method

$$\begin{aligned} u^{i} = u^{i-1} + B^{i} *({f- A^i *u^{i-1}}). \end{aligned}$$
(3)

where \(B^i,A^i\) are the convolution operators, f and u denote the source term (image) and solution (feature) respectively. Besides, downscaling the primary image f into coarse resolution requires the following iterative step that projects the high resolution data to the low resolution data

$$\begin{aligned} u^{\ell +1} = \varPi _\ell ^{\ell +1} *_2 u^{\ell }, \end{aligned}$$
(4)
$$\begin{aligned} f^{\ell +1} = R^{\ell +1}_\ell *_2 (f^\ell - A^\ell *u^{\ell })+ A^{\ell +1} *u^{\ell +1}, \end{aligned}$$
(5)

where \(\varPi _\ell ^{\ell +1},R^{\ell +1}_\ell *_2 \) represent the convolution with stride 2. MgNet imposes some nonlinear activation function in the iterative steps above to exact the feature from image as shown in the Algorithm 2.

figure a

In this algorithm, \(B^{\ell ,i}, A^\ell , \varPi _\ell ^{\ell +1}, R_\ell ^{\ell +1}\) are some convolution operators of a given kernel size (we often use size 3), \(\theta \) is an encoding layer that increases the number of channel, Avg is the average pooling layer and \(\sigma \) is the activation function. The hyperparameters J and \(\nu _i\) are given in advance.

We note that if we remove the variables with an underline, namely \(\underline{\sigma }\) and \(\underline{\theta }\) in Algorithm 1, we get exactly one classic multigrid method. From the convergence theory of multigrid method, we know that the iterative step (6) is associated with the elimination of high frequency error and with the layer getting deeper, the frequency of data gets lower. Then we can set hat activation functions with various M in different layers of the neural network to adapt this frequency variation in the network.

The MgNet model algorithm is very basic and it can be generalized in many different ways. It can also be used as a guidance to modify and extend many existing CNN models [5]. The following result shows Algorithm 2, admits the following identities

$$\begin{aligned} r^{\ell ,i} = r^{\ell ,i-1} - A^\ell \circ \sigma \circ B^{\ell ,i} \circ \sigma (r^{\ell ,i-1}), \quad i=1:\nu _\ell \end{aligned}$$
(9)

where

$$\begin{aligned} r^{\ell ,i} = f^l - A^\ell *u^{\ell ,i}. \end{aligned}$$
(10)

and (9) represents pre-act ResNet [8].

3 Experiments

Since MgNet has strongly connection with ResNet [7], we evaluate the performance of hat activation function on image classification for MgNet and ResNet compared with ReLU neural networks. For MgNet, we consider \(J=4\) and \(\nu _1=\nu _2=\nu _3=\nu _4=2\) as stated in Algorithm 1, thus there are four different resolution layers.

The following benchmark datasets are used: (i) MNIST, (ii) CIFAR10, (iii) CIFAR100, and (iv) ImageNet.

We consider using SGD method with the batchsize of 256 for 150 epochs. The initialization method of parameters is Kaiming’s uniform intialization [6]. The initial learning rate is 0.1 with a decay rate of 0.1 per 50 epochs. The following results are the average of 3 runs. In the Table 1, the numbers [5,10,15,20] denote the scaling of hat activation in different resolution layers since the size of data have four different resolution levels in MgNet and ResNet18.

Table 1 shows that hat activation function has slightly better generalization capability than ReLU activation function for both MgNet and ResNet. To illustrate the argument of MgNet, we evaluate the performance of MgNet with different scale settings of hat activation function. As is shown in Table 2, it is better to use the hat function with larger support in the coarser resolution data which is consistent of the frequency variation of MgNet.

To exclude the influence of training process, we train the CNNs with more epochs of 300. As is shown in Table 3, the test accuracy increases both for hat MgNet and ReLU MgNet, and hat activation function still maintains slightly better generalization capability than ReLU activation function which indicates that hat activation function is truly powerful.

Since the scale of hat function is fixed which can be a potential disadvantage, we also regard these scale numbers as parameters in the network. Table 4 gives the results of trainable scale hat MgNet on CIFAR10/100 datasets and we also record the scale numbers of the model. Though using two different settings of initial scale, the results all demonstrate that it is better to use the hat function with larger support in the coarser resolution level and the support intend to be getting small during the training. The results show that the generalization accuracy of MgNet is still competitive with a much smaller support in the first few layers without adding any neurons. We also note that it is available for a combination of hat function and ReLU function for CNNs with trainable scale hat function and we can replace the activation function of the encoding layer with ReLU function.

Kaiming’s initialization has been shown to work well for ReLU networks, the experiments show that hat CNNs also work well with this initialization method. Furthermore, we also consider Xavier’s uniform initialization [3] for hat CNNs on CIFAR10/100 datasets. The results in Table 5 and Fig. 2 show that the initialization methods make almost no difference on test accuracy but for the CIFAR100 dataset the loss of Kaiming’s initialization converges slightly fast.

Table 1. Comparison of hat CNNs and ReLU CNNs for image classification.
Table 2. Comparison of different scale setting of hat function for MgNet.
Table 3. MgNet results of 300 epochs.
Table 4. MgNet with hat function of trainable scale (300 epochs).
Table 5. Comparison of different initialization methods for hat-CNNs.
Fig. 2.
figure 2

Comparison of loss curve and test accuracy curve of MgNet for CIFAR10 and CIFAR100 datasets versus initialization methods.

4 Conclusion

We introduce the hat activation function, a compact function for CNNs and evaluate its performance on several datasets in this paper. The results show that hat activation function which has a compact activation area still has competitive performance in comparison with ReLU activation function for MgNet and ResNet although it does not have those properties of ReLU which are deemed to be important. Besides, activation function with a small compact set can cause gradient vanishing easily but this has no influence on the performances of CNNs with hat function. Specifically, from the experiments we note that the scale setting of hat activation function also influences the performance, which is related to the frequency variation in the network. Furthermore, commonly used initialization methods are also shown to be viable for hat CNNs. These numerous experiments show that hat function is indeed a viable choice of activation functions for CNNs and indicate the spectral bias is not significant for generalization accuracy.