1 Introduction

Semantic segmentation is a fundamental part in the computer vision field. It aims at partitioning the image into semantically meaningful parts and classifying each part into one class. Semantic segmentation has been used in many applications such as video action and event recognition [9, 12, 13, 17, 40, 52], image search engines [8, 24, 29, 39, 41, 53], image and video coding [10, 11, 25], medical imaging [7, 33], augmented reality [2], autonomous robot navigation [37] and autonomous driving [19].

Many semantic segmentation systems have been developed by applying different methods such as: Thresholding [5, 6], edge detection [20], region growing [44], graph partitioning [26, 43, 60], sparse coding [62] and Convolutional Neural Network (CNN) [4, 21, 32, 38, 50, 59, 61]. Thresholding-based methods [5, 6] segment the image by thresholding the image pixel values and grouping similar pixels together. But the problem of this method is to fix the appropriate threshold which gives a good segmentation. While, edge detection methods [20] detect region edges inside the image. Then, image regions are recognized based on their detected edges. However, one limitation of edge detection based methods is to detect blurred edges and overlapped objects. Besides, region growing methods [44] are based on the assumption of having similar values for neighboring pixels in one region. The problem about these methods is how to identify the degree of similarity between two adjacent pixels in addition to the fact that an object is generally composed of different connected regions. Furthermore, graph partitioning methods [26, 43, 60] consist on grouping the graph nodes into two or more partitions based on certain criteria. One problem of graph methods is the grouping criteria as well as the choice of a number of segments which influences the quality of segmentation. Moreover, sparse coding [62] is introduced as a high level image region representation. Based on this representation, image regions are identified. Nonetheless, sparse coding methods did not always succeed in extracting discriminative region features.

Over the past few years, CNN [4, 21, 32, 38, 50, 59, 61] has made a great progress in semantic segmentation due to its high capacity for data learning. As a result, many CNN variants have been developed such as Fully Convolutional Network (FCN) [38], deep fully convolutional neural network architecture for semantic pixel-wise segmentation (SegNet) [4], Wide Residual Network (ResNet) [58] and Fully Convolutional DenseNet (FC-DenseNet) [32]. Recently, due to its powerful CNN architecture, FC-DenseNet has given very promising results in comparison with the state of the art methods on many semantic segmentation benchmarks.

In this paper, we propose an encoder-decoder method called Reinforced Multiscale fully convolutional DenseNet (RM-DenseNet) where we increase the width of the network by integrating some Wider Dense Blocks (WDBs). These WDBs consist in making the dense block wider by conducting a set of dense blocks recurrently connected together. In fact, each DB takes not only the output of the previous layer but also the initial input of the WDB. Indeed, these recurrent connections extend the contextual field of view and increase the depth of our network without augmenting the number of parameters. Besides, they allow our network to go back to an earlier time to pick up information that may have been otherwise forgotten. In addition, it emulates the human brain system and helps to strengthen the extraction of the target features. Moreover, inspired by [3], we have conducted a MultiScale Convolutional (MSConv) layer after the last DB of the decoder part. This MSConv conducts three parallel convolutional layers with different kernels (1×1 , 3×3 and 5×5) which aggregates the prediction in different sizes of spatial context. In addition, the MSConv layer has made more flexible and powerful method. Our RM-DenseNet has been evaluated on two semantic segmentation benchmarks: CamVid [18] and Cityscapes [22] datasets and has given better results than the state of the art methods.

The remainder of our paper is organized as follows. We review the recent semantic segmentation works in Section 2. In Section 3, we detail our proposed approach. Then, the experimental results are presented for the two semantic segmentation datasets in Section 4. Finally, we conclude this paper and give some future directions in Section 5.

2 Related works

Due to the importance of the semantic segmentation field, many attempts have been developed. In this section, we have focused our study on the deep learning methods since they have recently demonstrated a substantial success in many applications ranging from image processing to semantic segmentation. Consequently, we will present different CNN variants [4, 21, 32, 38, 50, 59, 61] for semantic segmentation task since they have shown their good performance and given the best segmentation results in recent works (See Table 1). In order to make this section clear, we have classified the CNN methods into two categories. The first category concerns the CNN methods that have been developed for the classification task and extended to the semantic segmentation. The second category groups the encoder-decoder based CNN methods. These methods are composed of two main parts: encoder and decoder. The encoder part is similar to the architecture of the conventional CNN methods without neither the fully connected layers nor the classification layer. While the decoder part is added in order to map the low resolution feature maps of the encoder to complete input resolution feature maps for pixel-wise classification.

Table 1 Different semantic segmentation methods

2.1 Classification oriented CNN methods adapted to the semantic segmentation task

As for the first category of CNN methods, image segmentation is conducted using adapted version of the classification oriented CNN methods [14,15,16]. In [38], Long et al. have introduced a Fully Convolutional Networks (FCN) method. This method removes the fully connected layers which give classification scores and replaces it with convolutional layers with very large receptive fields to capture the global context of the scene and output spatial heat maps. The FCN has been built upon three CNN methods: AlexNet [35] to get the FCN-AlexNet version, VGG-16 [47] to get the FCN-VGG16 version and GoogLeNet [48] to get the FCN-GoogleNet version. The FCN-AlexNet architecture is illustrated in Fig. 1. Another method is ReSeg [50] which transforms the ReNet [51] classification method to use it for the semantic segmentation. The architecture of ReSeg is composed of four Recurrent Neural Network (RNN) which retrieve the contextual information by sweeping the image horizontally and vertically in both directions. Then, the last feature map is re-sized by one or more max-pooling layers. Finally, a soft-max layer is used to presume the probability distribution over the classes for each pixel.

Fig. 1
figure 1

Example of Fully Convolutional Networks (FCN) architecture

Besides, Pyramid Scene Parsing Network (PSPNet) is introduced in [61] as an extension of the classification ResNet [31] method. PSPNet used the pretrained ResNet to get the feature map from the last convolutional layer. Then, to get a different sub-region representation, PSPNet used a pyramid parsing module. Moreover, up-sampling and concatenated layers are coming after the pyramid parsing to form the final representation of feature map. Finally, the final pixel prediction is obtained by a convolutional layer. PSPNet method produces additional contextual information.

In addition, in [54], an end-to-end trainable deep bidirectional LSTM (Long-Short Term Memory) named Bi-LSTM is proposed. This method produces a deep CNN and two separate LSTM networks. It exploits the two separate LSTM in order to learn hierarchical visual language embeddings. It is a deep network which exploits future and history context information. Furthermore, Cheng et al. [57] have improved the Bi-LSTM method by adding multi task learning. It takes the advantages of Bi-LSTM model to learn hierarchical visual language and it employs multi task learning to increase its generality. Then, in [55], a regularized deep neural network (RE-DNN) is proposed. This method studies the highly non-linear semantic correlation between text and image. It includes visual, textual and joint models for visual semantic representation learning, textual semantic representation learning and cross-modal mapping respectively. It is composed of five layers which are divided into three parts: one-third for image modality, one-third for text modality and the last one-third of the network for multimodal joint modeling. Furthermore, Cheng W. et al. [56] proposed a CNN based framework in order to exploit the multimodal video representation in action recognition. This framework contains four modules: spatial CNN, temporal CNN, acoustical CNN and a fusion layer. The fusion layer contains both early fusion and late fusion with Neural Network and Support Vector Machine. It is added on the top of CNNs to learn a joint video representation. This method takes into consideration the audio information with the implementation of sophisticated fusion methods.

2.2 Encoder-decoder architecture based methods

Despite their success, the extensions of the conventional CNN methods have problems on learning to decode low-resolution images to pixel-wise predictions for segmentation. That is why encoder-decoder architectures have been proposed in many recent CNN methods. In particular, SegNet [4] as an example of encoder-decoder models (see Fig. 2) which is composed of two symmetric parts where the decoder is an exact mirror of encoder. The encoder part is composed of 13 convolutional layers inspired from VGG-16 [47] method. Besides, a corresponding decoder part with 13 layers also maps the low-resolution feature maps of the encoder. The final decoder output is a soft-max classifier that produces class probabilities for each pixel independently.

Fig. 2
figure 2

Example of SegNet architecture

Following the same encoder-decoder architecture, Simon J. et al. proposed a FC-DenseNet [32] method which transforms the existing classification model DenseNet [27] into fully convolutional one. FC-DenseNet is composed of 11 dense blocks (DBs) where five DBs in the encoder part, one DB in the Bottleneck (between the encoder and the decoder) and 5 DBs in the decoder part. In fact, each DB is composed of BN, Rectified Linear Unit (ReLU) layer and a 3 ×3 convolutional layer (see Fig. 3a). Besides, the DB integrates direct connections from any layer to all subsequent layers. In the encoder part, each DB is followed by a Transition Down (TD) transformation which is composed of BN, ReLU, a 1 ×1 convolutional layer and a 2 ×2 max pooling operation (see Fig. 3b). The layer between the encoder and the decoder is referred to as a bottleneck. However, in the decoder part each DB is followed by a Transition Up (TU) transformation which is composed of a 3 ×3 transposed convolution with stride 2 (see Fig. 3c). The transposed convolution consists on upsampling the previous feature maps. These feature maps are then concatenated to the ones coming from the skip connection to form the input of a new DB. Finally, a 1 × 1 convolutional layer followed by Softmax classification method is used to give the per class distribution at each pixel. This method has achieved good results in semantic segmentation on CamVid [18] and Gatech [45] datasets.

Fig. 3
figure 3

Building blocks of RM-DenseNet: a Layer used in the model, b Transition Down (TD) layer and c Transition Up layer (TU)

As it can be seen from Table 1, the architecture as well as the main contribution of each CNN based method are presented. The majority of the presented methods are based on the VGG-16 [47] architecture. However, among the reported methods, the FC-DenseNet [32] has given the best results for many image segmentation benchmarks. That is what encourages us to build our proposed CNN method upon the FC-DenseNet.

3 Proposed approach

Our proposed method presents a new CNN architecture built upon the successful FC-DenseNet [32] method while using WDBs and an MSConv layer. The WDB improves the classical dense blocks by increasing the width of DB in the encoder part by building a recurrent structure. In fact, the recurrent connections inside the WDB are added to emulate the human visual system and to integrate the context information with fixed number of parameters. In addition, our method conducts an MSConv layer which is inspired from [3] after the last DB of the decoder part to aggregate the prediction in different sizes of spatial context. This network architecture leads to a new semantic segmentation method called Reinforced MultiScale fully convolutional DenseNet (RM-DenseNet) (see Fig. 4).

Fig. 4
figure 4

Diagram of our Reinforced MultiScale fully convolutional DenseNet

3.1 Reinforced multiscale fully convolutional densenet architecture

Our RM-DenseNet follows the FC-DenseNet pipeline with six WDBs. This architecture is built from 269 convolutional layers: one convolutional layer in the input, 162 layers in the encoder part, 61 layers in the bottleneck, 43 layers in the decoder part as well as one MSConv layer and one convolutional layer at the end. First, the input image is passed through a standard convolutional layer with 3 × 3 receptive field. Then, 5 WBDs have been applied where each one of them contains one convolutional layer required for the summation operation and 4 DBs connected between each other with recurrent connections (See Table 2). In addition, as shown in Fig. 3, each WDB is followed by a TD. Each TD is composed of BN, ReLU, a 1 × 1 convolutional layer, dropout with p = 0.2 and a max pooling of size 2 × 2. Besides, a bottleneck between the encoder and the decoder with one WDB is conducted. Afterwards, 5 DBs are used in the decoder part which each of them is preceded by a TU. Each TU is composed of a 3 × 3 transposed convolution with stride 2 to compensate the pooling operation (See Fig. 3). In order to perform model averaging over several scales, MSConv layer is conducted after the last DB. Finally, a convolutional layer with 1 × 1 receptive field and a Soft-max layer are used to provide the per class distribution at each pixel.

Table 2 Wider dense block parameters

3.2 Wider dense block

As shown in Fig. 5, each WDB is composed of one convolutional layer (needed for the summation operation) and four DBs designed following recurrent structure (see Fig. 5b). It is unfolded for t time steps with t = 0 represents only a standard feed forward connection which is denoted in Fig. 5 by the dashed lines. However, for the three other time steps the input of the DB is the summation of the initial input of the WDB and the output of the previous DB which is represented in Fig. 5 by dotted lines. In order to conduct this summation, the initial WDB input and the output of the previous DB should have the same dimension. Similarly to the traditional DB [32], the output of a DB with n number of layers is n×k feature maps (where k is the number of applied filters). In order to make the initial WBD input of a same size, a convolutional layer is applied on it which outputs also n×k feature (see Fig. 5b).

Fig. 5
figure 5

Wider Dense Block architecture

The optimal WDB width has been experimentally determined as four (see Tables 3 and 6). Indeed, increasing the width by using recurrent structure allows our network to gain several advantages. Firstly, the integration of the recurrent connections allows our network to become more suitable to the context information [23, 36] since not only the initial states is stored in the internal memory but also the previous states. Secondly, the network will be able to take into account previous processing that one could interpret as ”previous time step” to pick up some information that may have been otherwise forgotten. Actually, this approach explicitly reuses the earlier processing that processed the data at a lower field of view and is combined with a wider one. This is implicitly a multiscale processing that relies on similar processing structures using the same parameters. In addition, it does not need additional computational parameters thus helping to avoid the difficulty in training such as the over-fitting. Therefore, our network becomes deeper with no additional parameters because of the weights sharing. Finally, the recurrent structure strengthens the extraction of the target features and ameliorates the segmentation performance.

Table 3 mIoU of FC-DenseNet with different time steps of WDB on CamVid dataset

3.3 MultiScale convolutional layer

Our network has been also boosted by using MSConv layer which is inspired from [3] (see Fig. 6). This MSConv Layer performs 3 parallel convolutions using different kernels with 1 × 1, 3 × 3 and 5 × 5 receptive fields contrarily to FC-DenseNet [32] method that uses only one kernel with 1 × 1 size. This will lead to three different feature maps which will be concatenated together into one feature map. These three parallel convolutional layers allow our model to aggregate the predictions at different scales while giving only one prediction output. This ensures more flexibility of our network to presume more information and to improve the segmentation accuracy.

Fig. 6
figure 6

MultiScale convolutional layer

4 Experimental results

This section provides an experimental study of our RM-DenseNet. The proposed method was initialized using HeUniform [30] and trained with RMSprop [49], with an initial learning rate of 0.001. It was evaluated on two semantic segmentation datasets: CamVid [18] and Cityscapes [22]. To evaluate the methods on these two datasets, the Mean Intersection over Union (mIoU) metric is used. The IoU determines the similarity between the predicted region and the ground-truth region for an object present in the image. The mean IoU (mIoU) is simply the average over all classes. The IoU is defined to a given class c, predictions (pi) and targets (ti), by:

$$ IoU(c) = \frac{{\sum}_{i}(p_{i}==c \land t_{i}==c)}{{\sum}_{i}(p_{i}==c \lor t_{i}==c)} $$
(1)

where ∧ is a logical and operation, while ∨ is a logical or operation. We compute IoU by summing over all the pixels i of the dataset. Besides, our RM-DenseNet method was implemented using the publicly available TensorFlow Python API [1].

4.1 CamVid dataset

One of the most commonly used semantic segmentation dataset is Cambridge-driving Labeled Video Database (CamVid) [18] which presents 32 semantic classes. However, in order to compare our system to recent methods [4, 32, 34, 38, 42, 50, 59] only 11 classes have been used for our experiments: sky, building, pole, road, sidewalk, tree, sign, fence, car, pedestrian and cyclist. This dataset contains 701 semantic segmentation frames: 367 frames used to train the network, 233 for testing and 101 for validation. The evaluation of the methods on this dataset is done with the validation set. The size of each frame is 360 × 480. Figure 7 visualizes samples from CamVid dataset. Our RM-DenseNet method was trained with image crops of 224 × 224. Our model can run at about 180ms per image on a GPU.

Fig. 7
figure 7

Samples from CamVid dataset

For our RM-DenseNet architecture, the appropriate width is determined experimentally in order to reach the best mIoU. So, we have run our method with three different widths: two, four and six. As it can be seen from Table 3, the mIoU of our method saturates when the WDB reaches 4 time steps with mIoU equal to 69.13%. However, six time steps have given low result than four time steps which is costly in time without any improvement in mIoU. That is why, we did not go further for this dataset. Besides, in comparison with the FC-DenseNet [32] method, making the architecture wider by adding WDB with four time steps has increased the segmentation accuracy by a factor of 2.2%.

To improve the performance of our method, we have added an MSConv layer after the last DB which conducts three parallel convolutions with different kernel sizes. The impact of this layer can be seen in Table 4. In fact, it gives a mIoU score gain of 1.21% in comparison with the FC-DenseNet [32] standard and reaches 68.11%. As a result, the WDB as well as the MSConv layer have contributed significantly to our RM-DenseNet method with outperforms the FC-DenseNet method [32] by a factor of 2.69%.

Table 4 The contribution of WDB and MSConv layer on CamVid evaluation set

Figure 8 illustrates the different mIoU of our method per network step number which are varying from 1 to 240000. As a result, the maximum mIoU score has been reached with 225985 steps with 256 batch size.

Fig. 8
figure 8

Testing curve for CamVid evaluation set per step number

Table 5 gives the mIoU scores of our method in comparison with other efficient methods in the literature. In terms of mIoU, ENet [42] has given a lower result than other methods. In addition, FCN-8 [38] which is an extension of the classification oriented VGG-16 [47] method has also failed to give acceptable segmentation results. This can be explained by the fact that the spatial invariance does not take into account useful context execution information. Moreover, Reseg [50] which takes the advantages of RNN, gives low results less than 59%. For the SegNet [4] which uses VGG-16 classification based model with encoder-decoder technique and which is substantially deeper than those are mentioned previously, it has provided 60.1% mIoU. Whereas, despite the improvement obtained by using Baysien filters within the Baysien SegNet [34] method, the result is still limited comparable to our RM-DenseNet because of the speed degradation problem. However, Dilation [59], which has incorporated long spatio-temporal regularization to the output of FCN-8 to boost their performance, has given promising result with 65.3% mIoU scores. Among the state of the art methods, FC-DenseNet [32] has given the best mIoU score (66.9%). It is based essentially on DenseNet [27] classification method. That is why our RM-DenseNet method followed the same architecture while using WDB and integrating MSConv layer. Our method improves the mIoU score by a substantial margin of 2.69% compared to the FC-DenseNet method and it gives 69.6%. In terms of per class mIoU, our RM-DenseNet outperforms all the other methods on all the classes except for the class ”sign” where Dilation [59] method performs better due to its dilated convolution operator which expands the receptive field without losing resolution or coverage. This proves more the effectiveness of our RM-DenseNet.

Table 5 Results on CamVid evaluation set

4.2 Cityscapes dataset

Cityscapes dataset [22] consists of 5000 images split into three sets: 2975 images for training, 500 for validation and 1525 for testing. The evaluation of the methods on this dataset is done with the validation set. It has a high images resolution of 2048 × 1024. The images belong to 19 classes. Figure 9 visualizes samples from Cityscapes dataset. Our model can run at about 3s per image on a GPU due to the big size of the images.

Fig. 9
figure 9

Samples from Cityscapes dataset

In order to determine the optimal time steps number for the WDB, we have run our method with two, four and six time steps. The best segmentation accuracy of our method is gotten when WDB is conducted with four time steps which gives an mIoU equal to 79.50% (see Table 6). Besides, including the WDBs to the standard FC-DenseNet method has significantly increased the mIoU by 1.18% which the confirms more the contribution of the WDB.

Table 6 mIoU of FC-DenseNet with different time steps of WDB on CityScapes dataset

Table 7 illustrates the impacts of both WDB and MSConv layer. Similarly to the integration of the WDB to the FC-DenseNet architecture, adding the MSConv layer has improved the performance by an mIoU equal to 0.28%. These experimental results confirm the importance of using wider network as well as the MSConv layer.

Table 7 The contribution of WDB and MSConv layer on CityScapes evaluation set

In order to obtain the best network steps number for this dataset, our method has been conducted with different steps number ranging from 1 to 500000 (Fig. 10). The optimal mIoU has been reached when the number of steps was 431425 with 256 batch size.

Fig. 10
figure 10

Testing curve for Cityscapes per step number

Table 8 reports a comparative study between our method and the state of the art methods. Similarly, on the CamVid dataset, ENet [42] and FCN-8 [38] have given weak results. Moreover, Dilation [59] method has given a 67.1 % mIoU score. Furthermore, different ResNet [31] based models such as DeepLab [21], wide-ResNet [58] and PSPNet [61] have given 70.4%, 78.4% and 80.2% respectively. Indeed, our RM-DenseNet method outperforms all state of the art methods and gives 80.3% due to the use of WDB and MSConv layer. This result confirms one more time the strengths of our method. In fact, in terms of per class mIoU, our RM-DenseNet outperforms all the other methods on all the classes except for the class ”trafficsign” where PSPNet [61] method has given better result due to its capability to embed difficult scenery context features. As a result, our RM-DenseNet has proven its effectiveness and good performances for the CityScapes dataset.

Table 8 Results on CityScapes dataset

5 Discussion

Our RM-DenseNet method has given very promising results on the two datasets. These results confirm the robustness of our architecture which includes the WDBs and the MSConv layer. In fact, thanks to the recurrent connectivity inside the WDB, our architecture has gained several advantages. Firstly, this recurrent structure allows our network to go back to an earlier time to pick up information that may have been otherwise forgotten. Secondly, our network becomes deeper with no additional parameters because of weights sharing. In addition, it allows our model to propagate in width with the same number of parameters which avoids the problem of system complexity and reduces the time consumption. Moreover, it allows the best accumulation and extraction of information. The results obtained in section 4 prove the success of adding recurrent connections (See Tables 4 and 7).

Moreover, the RM-DenseNet architecture has been enriched with the MSConv layer which has led to a significant improvement in the segmentation accuracy. Actually, the MSConv layer aggregates information from three parallel convolutions with different kernel sizes in order to collect different spatial contexts. Moreover, it ensures more flexibility of our network to presume more information. The WDB and the MSConv layer have provided a big gain for our RM-DenseNet method and helped in surmounting several constraints for other CNN methods (See Tables 4 and 7).

By comparing our method with recent state-of-the-art ones, our method emulates the visual system of the human brain by integrating recurrent connections inside the wider dense blocks which are abundant in the visual system. Besides, it follows a very deep CNN architecture which has proven its success in recent methods such as Segnet [4], Wide-ResNet [58], PSPNet [61] and FC-DenseNet [32]. Therefore, our RM-DenseNet has significantly improved the segmentation accuracy compared to all reported methods for both CamVid and Cityscapes datasets (see Tables 5 and 8).

Furthermore, the computational cost of our architecture has been identified by computed the time cost for each image. For CamVid dataset, each image requires about 180 ms since the resolution of images in this dataset is small. However, it takes 3 seconds for each CityScapes images since they are of large-scale. Besides, as it can be seen from Table 5, our RM-DenseNet with deeper and wider architecture has less number of parameters than all other methods and very comparable to FC-DenseNet [32].This will encourage more the use of our model.

The proposed measures evaluate inference time on the original architecture. When it comes to the deployment of such an architecture in a production system, specific hardware optimization is applied to speedup inference time. For example, the TensorRT library from NVIDIA applies network pruning and operators fusion. Such approach is very appropriate to networks based on Dense Blocks since the numerous connections can be pruned very efficiently thus strongly reducing the number of operations and thus processing speed. However, we do not evaluate on such optimized networks since it is application and hardware dependent so that we rely on the original architecture measures.

6 Conclusion and future work

In this paper, we have presented a Reinforced MultiScale fully connected DenseNet method which uses wider dense blocks and multiscale convolutional layers. The wider dense blocks improve the classical dense blocks by increasing the width of DB in the encoder part by building a recurrent structure. By this recurrent structure, the depth of our network will be increased without increasing of the number of parameters. Moreover, a multiScale convolutional layer is integrated after the last dense block in order to give a rich contextual prediction as well as to improve the results. Our method has been experimentally validated on two semantic segmentation benchmarks and has shown good results. For our future work, we plan to improve our RM-DenseNet method by optimizing its architecture.