1 Introduction

Many types of CNN architecture have been developed by researchers during the last few years aiming at achieving good scores on computer vision tasks. Despite the success of CNNs, a question remains given recent developments: what CNN architectures are good and how can we design such architectures? One possible direction to address this question is neural architecture search (NAS) [5], in which CNN architectures are automatically designed by an algorithm such as evolutionary computation and reinforcement learning to maximize performance on targeted tasks. NAS can automate the design process of neural networks and aids in reducing the trial-and-error of developers.

This chapter is based on the works of [34,35,36] and explains a genetic programming-based approach to automatically design CNN architectures. In the next section, we briefly review NAS methods by categorizing them into three approaches: evolutionary computation, reinforcement learning, and gradient-descent-based approaches. Then, we describe the Cartesian genetic programming (CGP)-based NAS method for a CNN, which is categorized as an evolutionary-computation-based approach. In Sect. 7.3, the CGP-based architecture search method for image classification, termed CGP-CNN, is explained. In Sect. 7.4, the CGP-based architecture search method is extended to the convolutional autoencoder (CAE), a type of CNN, for image restoration.

2 Progress of Neural Architecture Search

Automatic design of neural network structures is an active topic initially presented several decades ago, e.g., [30, 33, 45]. These methods optimize the connection weights and/or network structure of low-level neurons using an evolutionary algorithm, and are also known as evolutionary neural networks. These traditional structure optimization methods target relatively small neural networks whereas recent deep neural networks, including CNNs, have greater than one million parameters though the architectures are still designed by human experts. Aiming at the automatic design of deep neural network architectures, various architecture search methods have been developed since 2017. Nowadays, the automatic design method of deep neural network architectures is termed a neural architecture search (NAS) [5].

To address large-scale architectures, neural network architectures are designed using a certain search method but the network weights are optimized by a stochastic gradient descent method through back-propagation. Evolutionary algorithms are often used to search the architectures. Real et al. [28] optimized large-scale neural networks using an evolutionary algorithm and achieved better performance than that of modern CNNs in image classification tasks. In this method, they represent the CNN architecture as a graph structure and optimize it via the evolutionary algorithm. The connection weights of the reproduced architecture are optimized by stochastic gradient descent as typical neural network training; the accuracy for the architecture evaluation dataset is assigned as the fitness. Miikkulainen et al. [20] proposed a method termed CoDeepNEAT that is an extended version of NeuroEvolution of Augmenting Topologies (NEAT). This method designs the network architectures using blueprints and modules. The blueprint chromosome is a graph in which each node has a pointer to a particular module species. Each module chromosome is a graph that represents a small DNN. Specifically, each node in the blueprint is replaced with a module selected from a particular species to which that node points. During the evaluation phase, the modules and blueprints are combined to generate assembled networks and the networks are evaluated. Xie and Yuille [42] designed CNN architectures using the genetic algorithm with a binary string representation. They proposed a method for encoding a network structure in which the connectivity of each layer is defined by a binary string representation. The type of each layer, number of channels, and size of a receptive field are not evolved in this method. The method explained in this chapter is also an evolutionary-algorithm-based NAS. Different from the aforementioned methods, it optimizes the architecture based on genetic programming and adopts well-designed modules as the node function.

Another approach is to use reinforcement learning to search the neural architectures. In [49], a recurrent neural network (RNN) was used to generate neural network architectures. The RNN was trained with policy-gradient-based reinforcement learning to maximize the expected accuracy on a learning task. Baker et al. [2] proposed a meta-modeling approach based on reinforcement learning to produce CNN architectures. A Q-learning agent explores and exploits a space of model architectures with an 𝜖-greedy strategy and experience replay.

As these methods need neural network training to evaluate the candidate architectures, they often require a considerable computational cost. For instance, the work of [49] used 800 graphics processing units (GPUs). To reduce the computational cost of NAS is an active topic. A promising approach is jointly optimizing the architecture parameter and connection weights. This approach, termed one-shot NAS (aka weight sharing), finds better architecture during single training. In one-shot NAS, the non-differentiable objective function consisting of discrete architecture parameters is transformed into a differentiable objective by continuous [17, 43] or stochastic relaxation [1, 27, 31]; both the architecture parameters and connection weights are optimized by gradient-based optimizers.

3 Designing CNN Architecture for Image Classification

In this section, we introduce the architecture search method based on CGP for image classification. We term the method CGP-CNN. In CGP-CNN, we directly encode the CNN architectures based on CGP and use highly functional modules as node functions. The CNN architecture defined by CGP is trained by a stochastic gradient descent using a model training dataset and assigns the fitness value based on the accuracies of another training dataset (i.e. the architecture evaluation dataset). Then, the architecture is optimized to maximize the accuracy of the architecture evaluation dataset using the evolutionary algorithm. Figure 7.1 shows an overview of CGP-CNN. In the following, we describe the network representation and the evolutionary algorithm used in CGP-CNN.

Fig. 7.1
figure 1

Overview of CGP-CNN. The method represents the CNN architectures based on CGP. The CNN architecture is trained on a learning task and assigned a fitness based on the accuracies of the trained model for the architecture evaluation dataset. The evolutionary algorithm searches for better architectures

3.1 Representation of CNN Architectures

For CNN architecture representation, we use the CGP encoding scheme that represents an architecture of CNNs as directed acyclic graphs with a two-dimensional grid. CGP was proposed as a general form of genetic programming in [22]. The graph corresponding to a phenotype is encoded to a string termed a genotype and optimized using the evolutionary algorithm.

Let us assume that the grid has N r rows by N c columns; then, the number of intermediate nodes is N r × N c and the number of inputs and outputs depends on the task. The genotype consists of a string of integers of a fixed length and each gene determines the function type of the node and the connection between nodes. The c-th column’s node is only allowed to be connected from the (c − 1) to (c − l)-th column’s nodes, in which l is termed a level-back parameter. Figure 7.2 shows an example of the genotype, phenotype, and corresponding CNN architecture. As seen in Fig. 7.2, the CGP encoding scheme has a possibility that not all of the nodes are connected to the output nodes (e.g., node No. 5 in Fig. 7.2). We term these nodes inactive nodes. Whereas the genotype in CGP is a fixed-length representation, the number of nodes in the phenotypic network varies because of the inactive nodes. This is a desirable feature because the number of layers can be determined using the evolutionary algorithm.

Fig. 7.2
figure 2

Examples of a genotype and phenotype. The genotype (left) defines the CNN architecture (right). Node No. 5 on the left is inactive and does not appear in the path from the inputs to the outputs. The summation node applies max pooling to downsample the first input to the same size as the second input

Referring to modern CNN architectures, we select the highly functional modules as the node function. The frequently used processes in the CNN are convolution and pooling; the convolution processing uses local connectivity and spatially shares the learnable weights and the pooling is nonlinear downsampling. We prepare the six types of node functions, termed ConvBlock, ResBlock, max pooling, average pooling, concatenation, and summation. These nodes operate on the three-dimensional (3-D) tensor (also known as the feature map) defined by the dimensions of the row, column, and channel.

The ConvBlock consists of a convolutional layer with a stride of one followed by the batch normalization [10] and the rectified linear unit (ReLU) [23]. To maintain the size of the input, we pad the input with zero values around the border before the convolutional operation. Therefore, the ConvBlock takes the M × N × C tensor as an input and produces the M × N × C tensor, where M, N, C, and C are the number of rows, columns, input channels, and output channels, respectively. We prepare several ConvBlocks with different output channels and receptive field sizes (kernel sizes) in the function set of CGP.

As shown in Fig. 7.3, the ResBlock is composed of the ConvBlock, batch normalization, ReLU, and tensor summation. The ResBlock is a building block of the modern successful CNN architectures, e.g., [8, 47] and [13]. Following this recent trend of human architecture design, we decided to use ResBlock as the building block in CGP-CNN. The ResBlock performs identity mapping via the shortcut connection as described in [8]. The row and column sizes of the input are preserved in the same manner as those of the ConvBlock after convolution. As shown in Fig. 7.3, the output feature maps of the ResBlock are calculated via the ReLU activation and the summation with the input. The ResBlock takes the M × N × C tensor as an input and produces the M × N × C tensor. We prepare several ResBlocks with different output channels and receptive field sizes (kernel sizes) in the function set of CGP.

Fig. 7.3
figure 3

The ResBlock architecture

The max and average poolings perform the maximum and average operations, respectively, over the local neighbors of the feature maps. We use the pooling with a 2 × 2 receptive field size and a stride of two. The pooling layer takes the M × N × C tensor and produces the M × N × C tensor, where M  = ⌊M∕2⌋ and N  = ⌊N∕2⌋.

The concatenation function takes two feature maps and concatenates them in the channel dimension. When concatenating the feature maps with different numbers of rows and columns, we downsample the larger feature map by max pooling to make them the same sizes as the inputs. Let us assume that we have two inputs of size M 1 × N 1 × C 1 and M 2 × N 2 × C 2, then the size of the output feature maps is \(\min (M_1, M_2) \times \min (N_1, N_2) \times (C_1 + C_2)\).

The summation performs element-wise summation of two feature maps, channel-by-channel. Similar to the concatenation, when summing the two feature maps with different numbers of rows and columns, we downsample the larger feature map by max pooling. In addition, if the inputs have different numbers of channels, we expand the channels of the feature maps with a smaller channel size by filling with zero. Let us assume that we have two inputs of size M 1 × N 1 × C 1 and M 2 × N 2 × C 2, then the sizes of the output feature maps are \(\min (M_1, M_2) \times \min (N_1, N_2) \times \max (C_1, C_2)\). In Fig. 7.2, the summation node applies the max pooling to downsample the first input to the same size as the second input. By using the summation and concatenation operations, our method can express the shortcut connection or branch layers, such as those used in GoogLeNet [37] and residual network (ResNet) [8].

The output node represents the softmax function to produce a distribution over the target classes. The outputs fully connect to all elements of the input. The node functions used in the experiments are listed in Table 7.1.

Table 7.1 Node functions and abbreviated symbols used in the experiments

3.2 Evolutionary Algorithm

Following the standard CGP, we use a point mutation as the genetic operator. The function and the connection of each node randomly change to valid values according to the mutation rate. The fitness evaluation of the CNN architecture involves CNN training and requires approximately 0.5 to 1 h in our setting. Therefore, we need to efficiently evaluate some candidate solutions in parallel at each generation. To efficiently use the computational resource, we repeatedly apply the mutation operator while an active node does not change and obtain the candidate solutions to be evaluated. We term this mutation forced mutation. Moreover, to maintain a neutral drift, which is effective for CGP evolution [21, 22], we modify a parent by neutral mutation if the fitness of the offspring do not improve. The neutral mutation operates only on the genes of inactive nodes without modification of the phenotype. We use the modified (1 + λ) evolution strategy (with λ = 2 in the experiment) using the aforementioned artifice. The procedure of our evolutionary algorithm is listed in Algorithm 1.

The (1 + λ) evolution strategy, the default evolutionary algorithm in CGP, is an algorithm with fewer strategy parameters: the mutation rate and offspring size. We do not need to expend considerable effort to tune such strategy parameters. Thus, we use the (1 + λ) evolution strategy in CGP-CNN.

Algorithm 1 Evolutionary algorithm for CGP-CNN and CGP-CAE

3.3 Experiment on Image Classification Tasks

3.3.1 Experimental Setting

We apply CGP-CNN to the CIFAR-10 and CIFAR-100 datasets consisting of 60, 000 color images (32 × 32 pixels) in 10 and 100 classes, respectively. Each dataset is split into a training set of 50, 000 images and a test set of 10, 000 images. We randomly sample 45, 000 examples from the training set to train the CNN and the remaining 5000 examples are used for architecture evaluation (i.e. fitness evaluation of CGP).

To assign the fitness value to the candidate CNN architecture, we train the CNN by stochastic gradient descent (SGD) with a mini-batch size of 128. The softmax cross-entropy loss is used as the loss function. We initialize the weights using the method described in [7] and use the Adam optimizer [11] with an initial learning rate α = 0.01 and momentum β 1 = 0.9 and β 2 = 0.999. We train each CNN for 50 epochs and use the maximum accuracy of the last 10 epochs as the fitness value. We reduce the learning rate by a factor of 10 at the 30th epoch.

We preprocess the data with pixel-mean subtraction. To prevent overfitting, we use a weight decay with the coefficient 1.0 × 10−4. We also use data augmentation based on [8]: padding 4 pixels on each size and randomly cropping a 32 × 32 patch from the padded image or its horizontally flipped image.

The parameter setting for CGP is shown in Table 7.2. We use a relatively large number of columns to generate deep architectures. The number of active nodes in the individual of CGP is restricted. Therefore, we apply the mutation operator until the CNN architecture that satisfies the restriction of the number of active nodes is generated. The offspring size λ is two, the same number of GPUs in our experimental machines. We test two node function sets termed ConvSet and ResSet for CGP-CNN. The ConvSet contains ConvBlock, max pooling, average pooling, summation, and concatenation in Table 7.1 and the ResSet contains ResBlock, max pooling, average pooling, summation, and concatenation. The difference between these two function sets is whether the set contains ConvBlock or ResBlock. The number of generations is 500 for ConvSet and 300 for ResSet.

Table 7.2 Parameter setting for the CGP-CNN on image classification tasks

The best CNN architecture from the CGP process is retrained using all 50, 000 images in the training set. Then, we compute the test accuracy. We optimize the weights of the obtained architecture for 500 epochs using a different training procedure; we use SGD with a momentum of 0.9, a mini-batch size of 128, and a weight decay of 5.0 × 10−4. Following the learning rate schedule in [8], we start with a learning rate of 0.01 and set it to 0.1 at the 5th epoch. We reduce it by a factor of 10 at the 250th and 370th epochs. We report the test accuracy at the 500th epoch as the final performance.

We implement CGP-CNN using the Chainer framework [40] (version 1.16.0) and run it on a machine with two NVIDIA GeForce GTX 1080 or two GTX 1080 Ti GPUs. We use a GTX 1080 and 1080 Ti for the experiments on the CIFAR-10 and 100 datasets, respectively. Because of the memory limitation, the candidate CNNs occasionally take up the GPU memory, and the network training process fails because of an out-of-memory error. In this case, we assign a zero fitness to the candidate architecture.

3.3.2 Experimental Result

We run CGP-CNN 10 times on each dataset and report the classification errors. We compare the classification performance to the hand-designed CNNs and automatically designed CNNs using the architecture search methods on the CIFAR-10 and 100 datasets. A summary of the classification performances is provided in Tables 7.3 and 7.4. The models, Maxout, Network in Network, VGG, ResNet, FractalNet, and Wide ResNet, are the hand-designed CNN architectures whereas MetaQNN, Neural Architecture Search, Large-Scale Evolution, Genetic CNN, and CoDeepNEAT are the models obtained using the architecture search methods. The values of other models, except for VGG and ResNet on CIFAR-100, are referenced from the literature. We implement the VGG net and ResNet for CIFAR-100 because they were not applied to the dataset in [32] and [8]. The architecture of VGG is identical to that of configuration D in [32]. In Tables 7.3 and 7.4, the number of learnable weight parameters in the models is also listed. In CGP-CNN, the number of learnable weight parameters of the best architecture is reported.

Table 7.3 Comparison of the error rates (%), number of learnable weight parameters, and search costs on the CIFAR-10 dataset
Table 7.4 Comparison of the error rates (%) and number of learnable weight parameters on the CIFAR-100 dataset

On the CIFAR-10 dataset, the CGP-CNNs outperform most of the hand-designed models and show a good balance between the classification errors and the number of parameters. CGP-CNN (ResSet) shows better performance compared to that of CGP-CNN (ConvSet). Compared to other architecture search methods, CGP-CNN (ConvSet and ResSet) outperforms MetaQNN [2], Genetic CNN [42], and CoDeepNEAT [20]. The best architecture of CGP-CNN (ResSet) outperforms Large-Scale Evolution [28]. The Neural Architecture Search [49] achieved the best error rate, but this method used 800 GPUs and required considerable computational costs to search for the best architecture. Table 7.3 also lists the number of GPU days (the computational time multiplied by the number of GPUs used during the experiments) for the architecture search. As seen, CGP-CNN can find a good architecture at a reasonable computational cost. We assume that CGP-CNN, particularly with ResSet, could reduce the search space and find better architectures in an early iteration by using the highly functional modules. The CIFAR-100 dataset is a very challenging task because there are many classes. CGP-CNN finds the competitive network architectures within a reasonable computational time. Even though the obtained architecture is not at the same level as the state-of-the-art architectures, it shows a good balance between the classification errors and number of parameters.

The error rates of the architecture search methods (not only CGP-CNN) do not reach those of Wide ResNet, a human-designed architecture. However, these human-designed architectures are developed with the expenditure of tremendous human effort. An advantage of architecture search methods is that they can automatically find a good architecture for a new dataset. Another advantage of CGP-CNN is that the number of weight parameters in the discovered architectures is less than that in the human-designed architectures, which is beneficial when we want to implement CNN on a mobile device. Note that we did not introduce any criteria for the architecture complexity in the fitness function. It might be possible to find more compact architectures by introducing the penalty term into the fitness function, which is an important research direction, such as in [4, 29, 39].

Figure 7.4 shows the examples of the CNN architectures obtained by CGP-CNN (ConvSet and ResSet). Figure 7.4 shows the complex architectures that are difficult to manually design. Specifically, CGP-CNN (ConvSet) uses the summation and concatenation nodes leading to a wide network and allowing for the formation of skip connections. Therefore, the CGP-CNN (ConvSet) architecture is wider than that of CGP-CNN (ResSet). Additionally, we also observe that CGP-CNN (ResSet) has a similar structure to that of ResNet [8]. ResNet consists of a series of two types of modules: a module with several convolutions and shortcut connections without downsampling and a downsampling convolution with a stride of 2. Although CGP-CNN cannot downsample in the ConvBlock and ResBlock, we see that CGP-CNN (ResSet) uses a pooling layer as an alternative to the downsampling convolution. We can say that CGP-CNN can find an architecture similar to that designed by human experts.

Fig. 7.4
figure 4

CNN architectures obtained by CGP-CNN with ConvSet (left) and ResSet (right) on the CIFAR-10 dataset

4 Designing CNN Architectures for Image Restoration

In this section, we apply the CGP-based architecture search method to an image restoration task of recovering a clean image from its degraded version. We term this method CGP-CAE. Recently, learning-based approaches based on CNNs have been applied to image restoration tasks and have significantly improved the state-of-the-art performance. Researchers have approached this problem mainly from three directions: designing new network architectures, loss functions, and training strategies. In this section, we focus on designing a new network architecture for image restoration and report that simple convolutional autoencoders (CAEs) designed by evolutionary algorithms can outperform existing image restoration methods which are designed manually.

4.1 Search Space of Network Architectures

In this work, we consider CAEs that are built only on convolutional layers with downsampling and skip connections. In addition, we use symmetric CAEs such that their first half (encoder part) is symmetric to the second half (decoder part). The final layer is attached to top of the decoder part to obtain images of fixed channels (i.e. single-channel grayscale or three-channel color images), for which either one or three filters of 3 × 3 size are used. Therefore, specifying the encoder part of a CAE solely determines its entire architecture. The encoder part can have an arbitrary number of convolutional layers up to a specified maximum, which is selected by the evolutionary algorithm. Each convolutional layer can have an arbitrary number and size of filters, and is followed by ReLU [23]. In addition, each layer can have an optional skip connection [8, 18] that connects the layer to its mirrored counterpart in the decoder part. Specifically, the output feature maps (obtained after ReLU) of the layer are passed to and are added element-wise to the output feature maps (obtained before ReLU) of the counterpart layer. We can use additional downsampling after each convolutional layer depending on the task. Whether to use downsampling is determined in advance and thus it is not selected by the architectural search, as explained later.

4.2 Representation of CAE Architectures

Following [34], we represent architectures of CAEs via a directed acyclic graph which is defined on a two-dimensional grid. This graph is optimized by the evolutionary algorithm, in which the graph is termed a phenotype and is encoded by a data structure termed a genotype.

Figure 7.5 shows an example of a genotype and a phenotype of CGP-CAE. Each node of the graph represents a convolutional layer followed by a ReLU in a CAE. An edge connecting two nodes represents the connectivity of the two corresponding layers. The graph has two additional special nodes termed input and output nodes. The former represents the input layer of the CAE and the latter represents the output of the encoder part, or equivalently the input of the decoder part of the CAE. As the input of each node is connected to at most one node, there is a single unique path starting from the input node and ending at the output node. This unique path identifies the architecture of the CAE, as shown in the middle row of Fig. 7.5. Note that the nodes depicted in the neighboring two columns are not necessarily connected. Thus, the CAE can have a different number of layers depending on how the nodes are connected. Because the maximum number of layers (of the encoder part) of the CAE is N max, the total number of layers is 2N max + 1 including the output layer. To control how the number of layers will be chosen, we introduce a hyper-parameter termed level-back l, such that nodes given in the c-th column are allowed to be connected from nodes given in the columns ranging from c − l to c − 1. If we use a smaller l, then the resulting CAEs will tend to be deeper.

Fig. 7.5
figure 5

An example of a genotype and a phenotype of CGP-CAE. A phenotype is a graph representation of a network architecture and a genotype encodes a phenotype. They encode only the encoder part of a CAE and its decoder part is automatically created such that it is symmetrical to the encoder part. In this example, the phenotype is defined on a grid of three rows and three columns

A genotype encodes a phenotype and is manipulated by the evolutionary algorithm. The genotype encoding a phenotype with N r rows and N c columns has N rN c + 1 genes, each of which represents attributes of a node with two integers (i.e. type and connection). The type specifies the number F and size k of the filters of the node, and whether the layer has skip connections or not, by an integer encoding their combination. The connection specifies the node that is connected to the input of this node. The last (N rN c + 1)-st gene represents the output node that stores only the connection determining the node connected to the output node. An example of a genotype is shown in the top row of Fig. 7.5, where F ∈{64, 128, 256} and k ∈{1 × 1, 3 × 3, 5 × 5}.

We use the same evolutionary algorithm as used in the previous section to perform a search in the architecture space (see Algorithm 1).

4.3 Experiment on Image Restoration Tasks

We conducted experiments to test the effectiveness of CGP-CAE. We chose two tasks: image inpainting and denoising.

4.3.1 Experimental Settings

Inpainting

We followed the procedures suggested in [46] for experimental design. We used three benchmark datasets: the CelebFaces Attributes Dataset (CelebA) [16], the Stanford Cars Dataset (Cars) [12], and the Street View House Numbers (SVHN) [24]. The CelebA contains 202,599 images, from which we randomly selected 100, 000, 1000, and 2000 images for training, architecture evaluation, and testing, respectively. All images were cropped to properly contain the entire face and resized to 64 × 64 pixels. For Cars and SVHN, we used the provided training and testing split. The images of Cars were cropped according to the provided bounding boxes and resized to 64 × 64 pixels. The images of SVHN were resized to 64 × 64 pixels.

We generated images with missing regions of the following three types: a central square block mask (Center), random pixel masks such that 80% of all the pixels were randomly masked (Pixel), and half-image masks such that a randomly chosen vertical or horizontal half of the image was masked (Half). For the latter two, a mask was randomly generated for each training mini-batch and each test image.

Considering the nature of this task, we consider CAEs endowed with downsampling. To be specific, the same counts of downsampling and upsampling with stride =  2 were employed such that the entire network had a symmetric hourglass shape. For simplicity, we used a skip connection and downsampling in an exclusive manner; in other words, every layer (in the encoder part) employed either a skip connection or downsampling.

Denoising

We followed the experimental procedures described in [18, 38]. We used grayscale 300 and 200 images belonging to the BSD500 dataset [19] to generate training and test images, respectively. For each image, we randomly extracted 64 × 64 patches, to each of which Gaussian noise with different σ = 30, 50, and 70 are added. As utilized in the previous studies, we trained a single model for all different noise levels.

For this task, we used CAE models without downsampling following the previous studies [18, 38]. We zero-padded the input feature maps computed in each convolution layer not to change the size of the input and output feature space of the layer.

Configurations of the Architectural Search

For the evolutionary algorithm, we chose the mutation probability as r = 0.1, number of children as λ = 4, and number of generations as G = 250. For the phenotype, we used the graph with N r = 3, N c = 20, and level-back l = 5. For the number F and size k of the filters at each layer, we chose them from {64, 128, 256} and {1 × 1, 3 × 3, 5 × 5}, respectively. During an evolution process, we trained each CAE for I = 20, 000 iterations with a mini-batch of size b = 16. We set the learning rate of the ADAM optimizer to be 0.001. For the training loss, we used the mean squared error (MSE) between the restored images and their ground truths:

$$\displaystyle \begin{aligned} \begin{array}{rcl} L(\theta_{D}) = \frac{1}{|S|} \sum_{i=1}^{|S|} ||D(y_i;\theta_{D})-x_i||{}^2_{2}, \end{array} \end{aligned} $$
(7.1)

where the CAE and its weight parameters are D and θ D, respectively; S is the training set, x i is a ground truth image, and y i is a corrupted image. For the fitness function of the evolutionary algorithm, we use the peak signal-to-noise ratio (PSNR) of which the higher value indicates the better image restoration.

Following completion of the evolution process, we fine-tuned the best CAE using the training set of images for additional 500, 000 iterations, in which the learning rate is reduced by a factor of 10 at the 200, 000 and 400, 000 iterations. We then calculated its performance using the test set of images. We implemented CGP-CAE using PyTorch [25] and performed the experiments using four P100 GPUs. Execution of the evolutionary algorithm and the fine-tuning of the best model took approximately 3 days for the inpainting tasks and 4 days for the denoising tasks.

4.3.2 Results of the Inpainting Tasks

We use two standard evaluation measures, the PSNR and structural similarity index (SSIM) [41], to evaluate the restored images. Higher values of these measures indicate better image restoration.

As previously mentioned, we follow the experimental procedure employed in [46]. In the paper, the authors reported the performances of their proposed method, Semantic Image Inpainting (SII), and Context Autoencoder (CE) [26]. However, we found that CE can provide considerably better results than those reported in [46] in terms of PSNR. Thus, we report here PSNR and SSIM values for CE that we obtained by running the code provided by the authors.Footnote 1 To calculate SSIM values of SII, which were not reported in [46], we run the authors’ codeFootnote 2 for SII.

To further validate the effectiveness of the evolutionary search, we evaluate two baseline architectures; an architecture generated by a random search (RAND) and an architecture with same depth as the best-performing architecture found by CGP-CAE but having a constant number (64) of fixed size (3 × 3) filters in each layer with a skip connection (BASE). In the random search, we generate 10 architectures at random in the same search space as ours and report their average PSNR and SSIM values. All other experimental setups are the same.

Table 7.5 shows the PSNR and SSIM values obtained using five methods on three datasets and three masking patterns. We run the evolutionary algorithm three times and report the average accuracy values of the three optimized CAEs. As shown, CGP-CAE outperforms the other four methods for each of the dataset-mask combinations. Notably, CE and SII use mask patterns for inference. To be specific, their networks estimate only pixel values of the missing regions specified by the provided masks, and then they are merged with the unmasked regions of clean pixels. Thus, the pixel intensities of the unmasked regions are identical to their ground truths. On the other hand, CGP-CAE does not use masks yet outputs complete images such that the missing regions are hopefully correctly inpainted. We then calculate the PSNR of the output image against the ground truth without identifying missing regions. This difference should help CE and SII to achieve high PSNR and SSIM values, but nevertheless CGP-CAE performs better.

Table 7.5 Inpainting results

Sample inpainted images obtained by CGP-CAE along with the masked inputs and the ground truths are shown in Fig. 7.6. It is observed that overall CGP-CAE stably performs; the output images do not have large errors for all types of masks. It performs particularly well for random pixel masks (the middle column of Fig. 7.6); the images are realistic and sharp. It is also observed that CGP-CAE tends to yield less sharp images for those with a filled region of missing pixels. However, CGP-CAE can accurately infer their contents, as shown in the examples of inpainting images of numbers (the rightmost column of Fig. 7.6).

Fig. 7.6
figure 6

Examples of inpainting results obtained by CGP-CAE (CAEs designed by the evolutionary algorithm)

4.3.3 Results of the Denoising Task

We compare CGP-CAE to two baseline architectures (i.e. RAND and BASE described in Sect. 7.4.3.2) and two state-of-the-art methods RED [18] and MemNet [38]. Table 7.6 shows the PSNR and SSIM values for three versions of the BSD200 test set with different noise levels σ = 30, 50, and 70, in which the performance values of RED and MemNet are obtained from [38]. CGP-CAE again achieves the best performance for all cases except for a single case (MemNet for σ = 30). It is worth noting that the networks of RED and MemNet have 30 and 80 layers, respectively, whereas our best CAE has only 15 layers (including the decoder part and output layer), showing that our evolutionary method was able to find simpler architectures that can provide more accurate results.

Table 7.6 Denoising results on BSD200

An example of an image recovered by CGP-CAE is shown in Fig. 7.7. As we can see, CGP-CAE correctly removes the noise and produces an image as sharp as the ground truth.

Fig. 7.7
figure 7

Examples of images reconstructed by CGP-CAE for the denoising task. The first column shows the input image with noise level σ = 50

4.3.4 Analysis of Optimized Architectures

Table 7.7 shows the top five best-performing architectures designed by CGP-CAE for the image inpainting task using center masks on the CelebA dataset and the denoising task, along with their performances measured on their test datasets. One of the best-performing architectures for each task is shown in Fig. 7.8. We can see that although their overall structures do not appear unique, mostly because of the limited search space of CAEs, the number and size of filters are quite different across layers, which is difficult to manually determine. Although it is difficult to provide a general interpretation of why the parameters of each layer are selected, we can make the following observations: (1) regardless of the task, almost all networks have a skip connection in the first layer, implying that the input images contain essential information to yield accurate outputs; (2) 1 × 1 convolution seems to be an important ingredient for both tasks; 1 × 1 convolution layers dominate the denoising networks, and all the inpainting networks employ two 1 × 1 convolution layers; (3) when comparing the inpainting networks to the denoising networks, the following differences are apparent: the largest filters of size 5 × 5 tend to be employed by the former more often than the latter (2.8 vs. 0.8 layers on average), and 1 × 1 filters tend to be employed by the former less often than the latter (2.0 vs. 3.2 layers on average).

Fig. 7.8
figure 8

One of the best-performing architectures given in Table 7.7 for inpainting (upper) and denoising (lower) tasks

Table 7.7 Best-performing five architectures of CGP-CAE

5 Summary

This chapter introduced a neural architecture search for CNNs: a CGP-based approach for designing deep CNN architectures. Specifically, the methods, CGP-CNN for image classification and CGP-CAE for image restoration, were explained. The methods generate CNN architectures based on the CGP encoding scheme with highly functional modules and use the evolutionary algorithm to find good architectures. The effectiveness and potential of CGP-CNN and CGP-CAE were verified through numerical experiments. The experimental results of image classification showed that CGP-CNN can find a well-performing CNN architecture. In the experiment on image restoration tasks, we showed that CGP-CAE can find a simple yet high-performing architecture of a CAE. We believe that evolutionary computation is a promising solution for NAS.

The bottleneck of the architecture search of DNN is the computational cost. Simple yet effective acceleration techniques, termed rich initialization and early termination of network training, can be found in [36]. Another possible acceleration technique is starting with a small data size and increasing the training data for the neural networks as the generation progresses. Moreover, to simplify and compact the CNN architectures, we may introduce regularization techniques to the architecture search process. Alternatively, we may be able to manually simplify the obtained CNN architectures by removing redundant or less effective layers.

Considerable room remains for exploration of search spaces of architectures of classical convolutional networks, which may apply to other tasks such as single image colorization [48], depth estimation [3, 44], and optical flow estimation [9].