1 Introduction

Floor plans are widely used in architectural design and construction applications [1]. These drawings typically show a top-down view of architectural layouts of a building. Floor plans can broadly be divided into two categories: simple brochure type (SBT) and complex architectural type (CAT) [2]. Figure 1 shows examples of these two types.

Fig. 1
figure 1

Examples of floor plan layouts a simple brochure type (SBT) drawing and b complex architectural type (CAT)

It is laborious and time consuming for humans to manually extract various information such as room area, number of windows and baseboard length from floor plans. Therefore, automated floor plan analysis has been actively studied in the last few decades [3]. An essential task in the automatic floor plan analysis is to segment a floor plan into various regions (e.g., bedroom, living-room) with correct labels. The semantic segmentation results can be used in various applications such as three-dimensional (3D) modeling and construction cost estimation. However, the large quantity of heterogeneous information in a floor plan makes the semantic segmentation a challenging task.

The early works on computer aided floor plan analysis were based on traditional image analysis such as line detection, and region growing segmentation [4]. Most of these techniques assumed that the images are represented in vector graphics image format. Recently, with the advent of deep learning neural networks (DNN), DNN based floor plan analysis has become very popular [5]. Popular DNN architectures such as FCN [6], U-Net [7], and DeepLab [8] have been used for floor plan analysis [5]. However, the DNNs were developed mostly for natural images, and therefore these networks may not be very efficient for analysis of floor plan images.

The objective of this paper is to propose a deep learning network, henceforth referred to as the FloorNet, which is tailored for robust semantic segmentation of both SBT and CAT floor plans. The FloorNet is an extension of our previous work [9]. It is designed as a multi-task network that has an Encoder to extract the hierarchical features from the floor plan image. These hierarchical features are processed by a room boundary decoder (RBD) and a room type decoder (RTD) to recognize the room-boundary and room-type pixels, respectively. The major contributions of this paper are as follows:

  1. (1)

    The FloorNet proposes an enhanced multiscale room boundary attention model (MRBAM), which helps refine the room type pixels by suppressing the noises near the room boundaries, resulting in an improved performance.

  2. (2)

    Improved RBD and RTD are developed in the FloorNet by replacing the linear interpolation upsampling method by the upconvolution. The learning process of the upconvolution is more helpful to recover the spatial details compared to the linear interpolation methods. Since the CAT floor plan is more complex, the CNN encoder is improved in the FloorNet by utilizing a deeper backbone (i.e., DenseNet) to efficiently extract the features.

  3. (3)

    To the best of authors’ knowledge, this is the first comprehensive study for the automatic analysis of both the SBT and the CAT floor plan images. Experimental results show that the proposed FloorNet provides a superior segmentation performance for both types of floor plans.

The organization of the paper is as follows. Section 2 presents a review of the floorplan segmentation literature. The proposed technique is presented in Sect. 3. The performance of the proposed technique is evaluated and compared with existing techniques in Sect. 4. Section 5 presents a discussion on the architecture of the proposed network, limitations of this study and future research works, followed by the conclusions in Sect. 6.

2 Related works

In this section, we present a review of literature on automated segmentation methods for floor plan images.

2.1 Traditional floor plan analysis

Macé et al. [10] studied floor plan analysis where the walls, represented by thick lines, are first extracted from the components in the vector graphic by coupling the Hough transform [11] and image vectorization. The rooms are then segmented by recursive decomposition until convex-shaped regions are found from the wall borders. However, the reported accuracy of room segmentation is low, and the detected rooms are not labeled.

Ahmed et al. [12] used an idea similar to [10] to detect rooms and introduced new ideas on wall detection. The thick and medium lines are detected as the walls, and the thin lines are considered to be parts of symbols (e.g., windows). The doors and windows are detected from the symbols using the speeded up robust features (SURF) [13]. The rooms are detected using a similar idea as [10] and are labeled using the text. Experimental results with SBT floor plans show performance improvement over [10], but the technique may not give good performance for complex construction type floor plan where all lines may not be walls or symbols.

The contextual relationship between the floorplan elements is usually not present in the vector floor plans [5]. For example, the doors or windows are usually embedded between the wall segments and the kitchen is generally near the dining room. Analyzing a floor plan without considering their contextual relationship is error-prone, as each element can be a constraint for other objects. Deep learning-based floor plan analysis has become popular in the last decade and achieved state-of-the-art performance. The literature review of floor plan analysis using the DNN based methods is presented in the next section.

2.2 Deep learning based floor plan analysis

Most DNNs are based on convolutional neural networks (CNNs), which have become very popular in image analysis, segmentation, and classification.

Dodge et al. [14] was one of the earliest researchers to use deep learning-based methods to analyze the SBT floor plans. In their proposed pipeline, FCN [6] is used to segment the wall pixels. Yamasaki et al. [15] also presented an end-to-end FCN to analyze the apartment floor plans. Compared to the work of Dodge et al. [14], a total of 12 different classes of room types can be detected as the output of their semantic segmentation model.

Yang et al. [16] proposed a U-Net based technique for semantic segmentation of the wall and door in CAT floor plans. Experimental results demonstrated the superiority of a CNN-based approach that can handle the complex drawings. In Jang et al. [17], a DarkNet53-based encoder-decoder (DED) network was used to segment walls and doors based on CAT floor plans. In this network, the final average pooling layer, fully connected layer, and softmax layer of DarkNet53 were removed because only 2 (i.e., wall and door) classes of objects were considered in their model.

Attention mechanisms have been successfully used in various computer vision tasks, such as facial expression recognition [18], saliency detection [19], and crowd counting [20]. Typically, the operation selects the most useful features for classification and then outputs the final features by weighing the importance of attention maps and the target maps.

Zeng et al. [21] presented a method to recognize diverse floor plan elements using a deep multitask neural network where VGG is used to extract the features from the input image. The room boundary and the room type predictions are treated as different tasks in their network. A room-boundary-guided attention (RBGA) mechanism for floor plan analysis is implemented using a spatial contextual module to explore the spatial relations between the boundary and the room elements (henceforth referred to as the RBGA-CNN technique). It has been shown that the RBGA mechanism improves the overall accuracy by approximately 4%.

3 Proposed technique

Recently, we reported a CNN-based network with an efficient boundary attention aggregated model (BAAM-CNN) [9]. This model showed promising results on SBT images. The FloorNet proposed in this paper is an improvement of this work, where various modules of [9] have been enhanced and a thorough performance evaluation with both SBT and CAT datasets conducted.

In this section, the proposed FloorNet is presented. The schematic of the overall architecture is shown in Fig. 2. The architecture consists of five modules: (i) CNN encoder, (ii) Room boundary decoder (RBD), (iii) Room type decoder (RTD), (iv) Multiscale room boundary attention model (MRBAM), and (v) Floorplan classification (FC). The details of these modules are presented in the following.

Fig. 2
figure 2

Schematic of the proposed FloorNet architecture

3.1 CNN encoder

The floor plan analysis starts with the CNN encoder, which includes 5 convolution blocks (as shown in Fig. 2). The purpose of the Encoder module is to generate feature maps from an input floor plan image, which would then be used by next modules for the semantic segmentation. In this work, we have used VGG16 [22], ResNet34 [23] and DenseNet121[24] as candidates for the encoder backbone for extracting floor plan features. The relative performance of these architectures will be presented in Sect. 4. Table 1 shows the size of the encoder feature maps E1-E5 corresponding to the VGG16, ResNet34 and DenseNet121 architectures. Table 1 also identifies the layers (of VGG16, ResNet34 and DenseNet121) from which these feature maps are obtained.

Table 1 The size of feature maps E1-E5 (assuming an input image with size 512 × 512) in Fig. 2. Rows 3, 5 and 7 show the layers from which these feature maps are obtained from the VGG16, ResNet34 and DenseNet121 architectures

The input images are resized to 512 × 512 pixels and fed into the encoder. The consecutive five convolution blocks will generate the feature maps E1, E2, E3, E4, and E5 (the size and depth of these features maps are shown in Table 1).

The feature maps E1-E5 are shared by two parallel branches, i.e., the RBD and the RTD modules which are discussed next.

3.2 Room-boundary decoder (RBD)

After the feature maps E1-E5 are extracted by the CNN encoder, the room boundary decoder uses these features to predict the room boundaries. The function of the RBD module is to generate feature maps for the room boundary. Based on the output feature maps, the classification module will classify the boundary pixels into three classes: background, wall, and door/window.

Figure 3 shows the schematic diagram of the RBD unit. The Feature-1 input refers to the features coming from the CNN encoder (E1, E2, E3, E4) and Feature-2 input refers to the intermediate learned features (B4, B3, B2) coming from the preceding RBD unit (except for the first RBD unit for which Feature-2 is the feature map E5 from the CNN encoder). The size of the RBD unit feature maps B1-B4 is shown in Table 2.

Fig. 3
figure 3

Schematic of the room boundary decoder unit. The “ + ” means elementwise addition. \({C}_{2}\) equals \({2\text{C}}_{1}\) for the B4 layer and \({\text{C}}_{1}\) for the other layers in the decoder. \({C}_{o}\) equals \({\text{C}}_{1}\) for the B4 layer and \({\text{C}}_{1}/2\) for the other layers in the decoder. BN means batch normalization. (k, s) refers to (kernel size, stride value)

Table 2 The size of various feature maps B1-B4, R1-R4, M1-M4 and C1-C3 in Fig. 2. Note that Nc denotes the total number of segmentation classes

It is observed (in Fig. 3) that the size of Feature 2 of an RBD unit is always half (in both directions) the size of Feature-1. Therefore, Feature-2 is upsampled before the addition with the convolution output of Feature-1. The summation of filtered Feature-1 and upsampled Feature-2 goes through another convolution layer to learn the features, and a batch normalization layer (denoted as “BN”) is used to stabilize the learning process. Note that the UpConv2D means the upconvolution with filters of size 4 × 4.

3.3 Room-type decoder (RTD)

The function of the room type decoder is to predict the room type. The architecture of the RTD is similar to the RBD, and includes four RTD units. The schematic of a RTD unit is similar to that of the RBD unit shown in Fig. 3. However, unlike the RBD unit, the bottom input of a RTD unit (as shown in Fig. 2) comes from the CNN encoder and the top input comes from the preceding MRBAM unit (except for the first RTD unit for which the top input comes from the CNN encoder).

3.4 Multiscale room boundary attention model (MRBAM)

The function of MRBAM module is to combine the RBD and RTD features and perform the semantic segmentation (i.e., to predict the room type of each pixel). As shown in Fig. 2, the MRBAM has four identical units. An MRBAM unit takes the feature maps from the room boundary and room type decoders as inputs. There are four different levels in the MRBAM that processes the room boundary features and room type features at different scales. Figure 4a shows the schematic of an MRBAM unit, which has three inputs. The top input (\({T}_{b}\)) is the room boundary feature maps coming from a room-boundary decoder and the bottom input \(({T}_{r}\)) is the room type feature maps from a room-type decoder. The middle input (\({T}_{p}\)) is the intermediate feature maps coming from the preceding MRBAM unit (for the first MRBAM unit this input comes from the CNN encoder).

Fig. 4
figure 4

Schematic diagrams of a MRBAM unit and b Conv Unit. \({T}_{b}\) is the room-boundary feature, \({T}_{p}\) is the output feature of the preceding MRBAM unit, and \({T}_{r}\) is the room-type feature. \(\text{N}\) is the number of features (which is 1 in this work). The weights \({w}_{1}, {w}_{2}, {w}_{3}, {w}_{4} \, \text{and} \, {w}_{5}\) are applied on the five feature maps before the concatenation operation

The three inputs (\(T_{b} ,T_{p} ,T_{r}\)) pass through several modules (e.g., Conv Unit) and are finally concatenated at the Concat module. There are five inputs to the Concat module, which produces the output \(T_{c} (N,W,H,5C)\) where N is the number, W is the width, H is the height, and C is the number of channels. The details of each of these five inputs are presented below.

  1. (a)

    The first input to the Concat module is \(w_{1} T_{b} (N,W,H,C)\)

  1. (b)

    The second input to the Concat module is \(w_{2} T_{b2}\) where

    $$ T_{b2} = Conv\;Unit\{ T_{b} (N,W,H,C) \otimes T_{r} (N,W,H,C)\} $$
    (1)

    Note that \(\otimes\) is the elementwise multiplication operator. Figure 4b shows the schematic of the Conv Unit block that stacks \(n \, 3\times 3\) convolutions and one \(1\times 1\) convolution. Here the value of \(n\) is 2. An example of the \({T}_{b2}\) visualization is shown in Fig. 5a. The grayscale feature maps are obtained by averaging the feature maps across the depth and then normalized between 0 and 255.

    Fig. 5
    figure 5

    An example of gray-scale visualization for feature transformation in the MRBAM unit. a \({T}_{b2}\), b \({T}_{r1}\), and c \({T}_{out}\) refer to the feature maps denoted in Fig. 4a

  1. (c)

    The third input to the Concat module is \(w_{3} T_{p1}\) where

    $$ T_{p1} = Up\;Conv\;2d\{ T_{p} (N,W/2,H/2,C)\} $$
    (2)
  1. (d)

    The fourth input to the Concat module is \(w_{4} T_{r5}\) where

    $$ T_{r5} = Conv\;2d\{ T_{r3} \otimes T_{b2} \} $$
    (3)
    $$ T_{r3} = BRB\;Conv\{ T_{r1} \otimes T_{b2} \} $$
    (4)
    $$ T_{r1} = Conv\;Unit\{ T_{r} (N,W,H,C)\} $$
    (5)

    An example of the \({T}_{r1}\) visualization is shown in Fig. 5b. Note that the BRBConv(.) is a boundary-refinement-block convolutional layer to refine the room boundary features. The kernels are square matrices of size \(\text{M}*\text{M}\) where M is an odd integer. The size of the kernel in the BRBConv is one quarter of the input feature size. A kernel example with M = 17 is shown in Fig. 6. Note that the matrix elements are ones in the horizontal, vertical and diagonal directions, and the center element is four.

    Fig. 6
    figure 6

    Schematic diagram of BRB kernel for M = 17

  1. (e)

    The fifth input to the Concat module is \(w_{5} T_{r} (N,W,H,C)\).

Five different inputs are concatenated by a MRBAM unit. In this work, we have used the input weights \([w_{1} ,w_{2} ,w_{3} ,w_{4} ,w_{5} ] = [1,1,1,7,1]\) to obtain the best performance. The output \({T}_{c}(\text{N},\text{W},\text{H},5\text{C})\) of the concatenation layer is passed through a convolutional layer to reduce the depth from \(5\text{C}\) to\(\text{C}\). An example of the \({T}_{out}\) visualization is shown in Fig. 5c.

3.5 Floorplan classification (FC)

The function of FC module is to predict the final floor plan semantic segmentation result based on the feature maps from the RBD and MRBAM modules.

In the FC module, the inputs B1 and M1 pass through the U modules, resulting in outputs C1 and C2, respectively. As shown in Table 2, the size of C1 is the same as that of the original input image and the depth of C1 is 3 indicating the prediction result of background, wall, or door/window. Similarly, C2 has the same size as that of the original input image and the depth of C2 is (Nc-2) indicating the prediction result of all segmentation class excluding the wall, and door/window.

The C1 and C2 represent the probability values of different pixel classes. For each pixel location, C1 provides the probability of {background, wall, door/window} classes. On the other hand, for each pixel location, C2 provides the probability of 7 (for R3D dataset) or 9 (for CAFP dataset) classes. The details about these datasets are presented in Sect. 4. Note that C2 does not provide the probabilities of the window/door and wall classes as these probabilities are provided by C1.

The Room Merging (RM) module combines the C1 and C2 and generates the final semantic segmentation result of a pixel at (x,y) location using the following approach.

  • 1. Consider the C1 and C2 values at (x,y) location.

  • 2. Consider C1 first. If the pixel is of type door/window or wall, the pixel is classified as door/window or wall. If the pixel is of type background, the class of the pixel is determined by the C2 values.

4 Experiments

4.1 Datasets

In this paper, two datasets are used to evaluate the performance of the proposed technique: (1) the R3D dataset [21] and (2) the complex architecture type floor plan (CAFP) dataset. Figure 1a shows an image example from the R3D dataset and Fig. 1b shows an image from the CAFP dataset. Both datasets have pixel-wise ground truth labels for floor plan training, validation and testing. The R3D dataset has 232 images, each of size 512 × 512 pixels. In the CAFP dataset, a total of 80 floor plan images, each of size 3400 × 2200 pixels, have been obtained from the local house builder collaborators and are manually annotated to generate pixel-wise ground truth images. The CAFP dataset is expanded eight times by use of augmentation: (i) original, (ii) rotation of original image by 90°, 180° and 270°, and (iii) up-down flipping of 4 images from (i) and (ii). The augmented CAFP dataset includes 640 images.

4.2 Network training

As shown in Fig. 2, there are 5 modules in the proposed schematic. Each of these five modules includes CNNs that require training. The weights of the CNNs in the five modules are updated (during training) to minimize the overall loss function of the whole network.

The proposed technique has two tasks, i.e., room boundary prediction and room type prediction. The contributions of the two tasks for this network are balanced by the following weighted loss function:

$$Loss={w}_{b}{L}_{b}+{w}_{r}{L}_{r}$$
(6)

where \({L}_{b}\) is the loss function for boundary prediction and \({L}_{r}\) is the loss function for room type prediction. The weights are calculated as follows:

$${w}_{b}=\frac{{N}_{r}}{{N}_{b}+{N}_{r}} \, \text{and} \, {w}_{r}=\frac{{N}_{b}}{{N}_{b}+{N}_{r}}$$
(7)

where \({N}_{b}\) and \({N}_{r}\) are the numbers of boundary pixels and room pixels, respectively. In Eq. (6), the loss function for a specific task (i.e., \({L}_{b} \,or {L}_{r}\)) is defined by:

$$ L_{task} = - \frac{{N - N_{i} }}{{\mathop \sum \nolimits_{j = 1}^{{N_{c} }} \left( {N - N_{j} } \right)}}\mathop \sum \limits_{i = 1}^{{N_{c} }} \left( {y_{i} \log \left( {p_{i} } \right)} \right) $$
(8)

where \(N\) is the total number of ground-truth pixels, \({N}_{c}\) is the number of classes for the task, \({y}_{i}\) is the label for class \(i\), and \({p}_{i}\) is the predicted probability of class \(i\).

The proposed network is trained on Google Colab GPU High-RAM. The Adam optimizer is used in the training process for 210 epochs. Table 3 shows the dataset setup for the R3D and the augmented CAFP in the stages of training, validation and testing. As shown in Table 3, each floor plan in R3D is segmented into 9 categories. Since the CAFP floor plans have more information than the R3D floor plans, the CAFP floor plans are segmented into 11 categories. The batch size is 1.

Table 3 The dataset for performance evaluation and the classification categories

4.3 Performance metrics

In this paper, we use the Intersection over Union (IoU) as the metric to evaluate the semantic segmentation performance. The IoU of class \(i\) is defined as follows:

$${IoU}_{i}=\frac{{S}_{I}}{{S}_{U}}$$
(9)

where \({S}_{I}\) is the intersection area of the predicted segmentation and the groundtruth for class \(i\), \({S}_{U}\) is the union area of the predicted segmentation and the groundtruth for class \(i\). As the semantic segmentation involves more than two classes, the mIoU, as defined below, is used as the overall performance metric.

$$mIoU=\frac{\sum_{i}{IoU}_{i}}{{N}_{c}}$$
(10)

4.4 Performance evaluation

The VGGNet, ResNet and DenseNet are widely used in the literature as CNN backbones for extracting features. Table 4 shows the performance of the proposed FloorNet using VGG16, ResNet34 and DenseNet121 as the encoder module. It is observed that for the R3D dataset, the mIoU of the DenseNet121-based network is 69%, which is 10% higher than that of the VGG16 network. For the CAFP dataset, the mIoU of the DenseNet121 network is 60%, which is 9% higher than that of the VGG16 network.

Table 4 Performance of the proposed FloorNet (in Fig. 2) with VGG16, ResNet34 and DenseNet121 models in the CNN encoder module

Figures 7 and 8 show the visual comparison of floor plan recognition results produced by our method based on the R3D and CAFP datasets, respectively. From the figures, the prediction of the DenseNet121-based network has a better performance than the VGG16-based and the ResNet34-based networks because the prediction of the DenseNet121 has less noise in large spaces.

Fig. 7
figure 7

Visual comparison of floor plan segmentation results produced by the proposed method for an image from the R3D dataset: a original image, b ground truth, c prediction of the VGG16-based network, d prediction of the ResNet34-based network, e prediction of the DenseNet121-based network

Fig. 8
figure 8

Visual comparison of floor plan segmentation results produced by the proposed method based on one example from the CAFP dataset: a original image, b ground truth, c prediction of the VGG16-based network, d prediction of the ResNet34-based network, e prediction of the DenseNet121-based network

Table 5 shows the performance comparison between the RBGA-CNN, DED and the proposed DenseNet121-based work for the R3D and CAFP datasets. For the R3D dataset, the mIoU of the proposed network is 24%, and 15% higher than that of the DED, and the RBGA-CNN, respectively. When the CAFP dataset is used, the proposed technique also shows better performance than the DED and RBGA-CNN methods.

Table 5 Performance comparison of the proposed technique with the state-of-the-art techniques DED [15], and RBGA-CNN [19]. The last row shows the performance of the proposed technique using the DenseNet121 encoder

Note that because of the attention mechanism, the RBGA-CNN, and the proposed work provides a significant performance improvement over the DED model. Figure 9 shows the training loss for the RBGA-CNN, and the proposed work. Although both techniques use the attention mechanism, the proposed technique can achieve a lower loss in the training process.

Fig. 9
figure 9

Training loss of the RBGA-CNN, and the proposed DenseNet121-based technique for the a R3D dataset and b CAFP dataset

Note that all experimental evaluations were performed on Google Colab GPU High-RAM environment. The inference time required for the proposed FloorNet is approximately 65–75 ms for one image, which shows relatively low computational requirement for testing environment.

5 Discussion

In this section, an ablation study of various modules proposed in the FloorNet is presented. We then discuss the reasons why the MRBAM in the FloorNet is beneficial for the room type prediction. Finally, a few limitations of the FloorNet and future works are discussed.

5.1 Analysis on the decoder and attention modules

As mentioned earlier, the objective of this work is to propose a CNN-based network that is robust for semantic segmentation of both SBT and CAT floor plan types. We first reported the BAAM-CNN in [9] for the semantic segmentation of SBT floor plans by improving the RBGA-CNN network [21]. In this paper, we propose the FloorNet by further enhancing different modules of the BAAM-CNN. In this section, we present a detailed ablation study to show the improvements caused by these modifications.

Table 6 shows the experimental results of the ablation study. FloorNet-a is a variant of FloorNet where VGG in the Encoder module of RBGA-CNN is replaced by the ResNet. FloorNet-b (i.e., BAAM-CNN) is an improved version of FloorNet-a where the RBGA attention module is upgraded to BAAM. FloorNet-c is an improved version of FloorNet-b where ResNet in the Encoder module is replaced by the DenseNet. FloorNet-d is an improved version of FloorNet-c where the upsampling is done by using the UpConv2d operation instead of using the linear interpolation (see Fig. 3). The proposed FloorNet includes all modifications on the CNN Encoder, RBD, RTD and the attention module. The results show that the mIoU of the network is enhanced when the new encoder, the improved decoders and attention module are introduced.

Table 6 Ablation study on the improvements offered by the enhancements proposed in various modules based on the CAFP dataset. Note that the superscript v1 refers to the modules in [9, 21] and v2 refers to the version presented in Sect. 3.2 and 3.3 of this paper

5.2 Analysis on the decoder and attention modules

The ablation study results (in Table 6) show that the attention module is beneficial for the floor plan segmentation. This section presents a qualitative analysis of the reasons.

Figure 10 shows an example (from CAFP dataset) of gray-scale visualization for the feature transformation in the fourth MRBAM unit (MRBAM consists of four MRBAM units (shown in Fig. 2)). The grayscale feature maps are obtained by averaging the feature maps across the depth and then normalized between 0 and 255. Figure 10a shows the well-learned room boundary (RB) feature map that is an input for this MRBAM unit. Figure 10b shows the room type (RT) feature map that is another input for this MRBAM unit. In each room (e.g., the red box), the pixels in the center area differ from the pixels near the boundaries, indicating an inconsistent prediction of the room type. The MRBAM fuses the feature maps (e.g., Fig. 10a and b) from two tasks (i.e., RB prediction and RT prediction) through element-wise multiplication. The directional kernels are used to process the fused features to address the problems that the room boundaries in a floor plan are not only horizontal or vertical (see Fig. 10c). As shown in Fig. 10d, the well-predicted room boundary is helpful to suppress the noises for the room type pixels near the room boundaries, resulting in a uniform and improved room type feature map.

Fig. 10
figure 10

An example of gray-scale visualization for feature transformation in an MRBAM unit. a Tb, b Tr, c Tr3, and d Tout refer to the feature maps denoted in Fig. 4a

6 Limitations and future works

The limitations and future works of this study are summarized below.

First, floor plans, especially the CAT floor plans, typically consist of a large quantity of heterogeneous information. Resizing the input floor plan image to 512 × 512 for the CNN network would reduce the resolution of the floor plan elements. A thorough investigation into the input image size might result in better performance. However, a high-resolution image input may lead to memory issues and require a high computational requirement. Recently, ViTs have emerged as a promising method for computer vision tasks with high computational efficiency and scalability [25]. Presumably, the patch-based scheme of ViTs may help mitigate the memory issue and high computational requirement. The segmentation of a floor plan may be improved using ViTs with a high-resolution input.

Second, the MRBAM module is developed without considering the contextual information among channels. Prior researches [26,27,28] found that the channel contextual information (CCI) could significantly improve the semantic segmentation performance. A CCI module can be developed to fine tune the feature maps from the CNN Encoder.

7 Conclusions

In this paper, an efficient technique, namely FloorNet, is proposed by developing a multiscale room boundary attention model (MRBAM). The FloorNet starts with an enhanced encoder by implementing the DenseNet121. The output feature maps of the encoder are shared by two simultaneous branches, i.e., the room boundary prediction and the room type prediction. The MRBAM combines the room boundary features and the room type features at different scales. Each MRBAM unit uses the well-predicted room boundary features to fine tune the room type features. The learned feature of a MRBAM unit is passed to the next-level convolution layer, through which the room type prediction is improved by the attention mechanism. The proposed technique is evaluated using two types (SBT and CAT) of floor plan images. Experimental results have shown that the proposed technique can achieve a superior performance compared to the state-of-the-art methods for both floor plan types.