1 Introduction

Freehand sketching is one of the most intuitive and convenient communication ways, which has been popular since ancient times. In recent years, people has given rise to the creations of sketches by this way, with the popularity of touch-screen devices (e.g. tablets and smartphones) and the emergence of drawing programs. Obviously different from natural images, sketches are composed of several strokes drawn by humans instead of being captured by cameras. Accordingly, the contents of sketches have the sparsity and present multiple levels of abstraction and different drawing styles. At present, the studies related to sketches mainly include sketch recognition [1,2,3], sketch-based image retrieval [4,5,6], sketch-based 3D object retrieval [7, 8], sketch semantic segmentation [9, 10], sketch caption [11] and sketch synthesis [12, 13]. Among them, sketch semantic segmentation is the understanding for a sketch from the fine-grained perspective and part perspective, which is of great help to analyze the content of the sketch and plays an important role in the realization of other sketch tasks. Sketch semantic segmentation facilitates the discovery of the abstract drawing mechanism and the development of human vision perceptron. Besides, sketch semantic segmentation can automatically and efficiently label part or whole semantics to provide users with a friendly human-computer interaction manner in the computer-aided design system [14, 15]. As a preprocessing step, sketch semantic segmentation can also assist users to accomplish fine-grained retrieval and to improve the performance of sketch tasks. Figure 1 shows the application of sketch semantic segmentation to fine-grained image or 3D retrieval.

Fig. 1
figure 1

The applications of sketch semantic segmentation. Humans can retrieve some images and 3D objects according to the labeling details or parts. One possible solution is that we use sketch semantic segmentation method to obtain the important concerned semantics and then improve the ranking of images with the concerned semantics in the retrieval results

Early sketch semantic segmentation mainly depends on hand-crafted features and traditional complicated models such as radial basis functions [16], graph models [17] and conditional random field (CRF) [18]. These methods handle with large varieties in the appearances of sketches difficultly, with the limited representative power and high computational cost. Consequently, researchers begin to introduce deep neural network into sketch semantic segmentation task. According to the specific formats of sketch inputs, these methods are divided into three categories including sequence-based methods, image-based methods and point-based methods. The sequence-based methods [19,20,21] regard a sketch as a stroke sequence to encode the sketch based on a sequence model such as a recurrent neural network (RNN). The image-based methods [22, 23] directly transform a sketch semantic segmentation issue into an image semantic segmentation issue, which is solved by establishing convolution neural network (CNN) combined with sketch characteristics. This kind of methods treat a sketch as a static image, so they discard stroke information. The point-based methods [10, 24, 25] regard a sketch as a set of points and generate a point-wise feature map for semantic segmentation by continuously aggregating local features and establishing an encoder–decoder structure based on these features. For convenience, we call the component that can continuously aggregate local features as a local feature aggregation module. Obviously, the local feature aggregation module is significant to solve the sketch semantic segmentation issue. However, the common local feature aggregation modules perceive the surrounding regions of the sampling points based on Euclidean distance, which ignores the internal structures of strokes and the continuity of strokes. Therefore, we design a novel local feature aggregation (NLFA) module to perceive the surrounding points by defining “stroke distance”. In addition, most common local feature aggregation modules lose the relative position information when searching the surrounding points of the sampling points, which obviously violates the important idea of extracting local features by CNN. Accordingly, we capture the relative position information and stroke information to strengthen the related semantics. Compared with the common local feature aggregation modules, the encoded information in the proposed module has richer contents, which are obviously helpful to understand semantics.

Generally, the relationship between segments can describe a sketch partly. However, common sketch semantic segmentation methods pay little attention to the relationship between segments. Thus, we design a segment-level self-attention (SLSA) module based on the idea of multi-head self-attention by encoding the contents and positions of segment features. The SLSA module can discover the relationship between segments automatically for capturing the internal structure of a sketch.

Further, we propose a sketch semantic segmentation method using the NLFA module and the SLSA module. Specifically, the method is based on a similar encoder–decoder structure with two sub-branches named point-level sub-branch and segment-level sub-branch. In the encoding part, the NLFA module is exploited to extract informative local features. In the decoding part, a global multi-scale mechanism is exploited to aggregate the local features and several intermediate-stage features as global features. What’s more, we only perform a dimension reduction operation to label each point in the point-level sub-branch while we establish and enhance the segment-level relationship in the segment-level sub-branch. Then, the point-level sub-branch and the segment-level sub-branch consume global features to generate two sketch semantic segmentation maps. Finally, the two maps are fused to obtain the final segmentation result.

In short, the main contributions of this paper could be summarized as the following threefold. (1) We develop a NLFA module to capture local features of the sampling points adequately. Compared with existing local feature aggregation modules, it is a unique local module that fully considers the position information of the sampling points and the internal structures of strokes. (2) We design a SLSA module to establish and enhance the relationship between segments. Both the local and global position information of each segment are encoded as the position embedding vectors to describe the internal structures of sketches more precisely. (3) We construct a similar encoder–decoder structure with two sub-branches and propose a sketch semantic segmentation method based on the structure. The proposed method realizes a global multi-scale mechanism for the first time, which fully integrates the sketch features of several intermediate stages. Most importantly, the proposed method labels the semantics of each point from a point-level perspective and a segment-level perspective simultaneously, which is rarely involved in the previous sketch semantic segmentation methods.

The rest of this article is organized as follows: The related work is discussed in Sect. 2. The proposed sketch semantic segmentation method is described in Sect. 3; Sect. 4 states the experimental operations and results; Sect. 5 summarizes the full texts and future prospects.

All the significant abbreviations and the corresponding full names used in this work are listed in Table 1.

Table 1 The abbreviations and the corresponding full names

2 Related work

Sketch semantic segmentation is essentially to classify each point of a sketch. Labeling each point as different semantics brings benefits to many applications such as sketch-based image retrieval [4,5,6], sketch caption [11] and sketch scene segmentation [9]. Many researchers focus on this research. Sun et al. [17] propose a novel sketch segmentation method by using the low-level perception based on the proximity and the high-level knowledge based on the past experience. Huang et al. [26] develop a data-driven approach by designing a mixed integer programming and using 3D template models. Schneider et al. [18] utilize a heuristic method to establish the graph structures of different strokes and calculate a semantic labeling map based on CRF.

At present, most scholars study sketch semantic segmentation based on deep learning. Wu et al. [19] transform the sketch semantic segmentation issue into a sequence-to-sequence generation issue and convert the sequence of strokes into the corresponding semantic labels based on RNN. Li et al. [22] design an hourglass-shaped CNN combined with the post-processing steps of multi-label graph cuts to improve the segmentation results. Among the deep learning methods, the point-based methods are becoming more and more popular because of their low computational cost and high segmentation accuracy. Wang et al. [24] directly take the sampling points as the input and propose a multi-column point convolutional neural network. The network uses multiple columns with different convolution kernel sizes to better capture the sketch structure. Based on graph convolution neural network, Yang et al. [10] obtain a semantic segmentation map by constructing a static graph convolution unit, a dynamic graph convolution unit and a mix pooling module. Our method is a sketch semantic segmentation method with points as the input using NLFA and SLSA. Compared with the above point-based methods, the proposed method can achieve more abundant sketch features and take into account the two-dimensional spatial distributions of sketches and the internal structures of strokes.

Sketch semantic segmentation methods are closely related to image semantic segmentation methods and partly follow the development of image semantic segmentation methods. At present, most of image semantic segmentation methods are based on deep neural network. Long et al. [27] transfer a classification network into an end-to-end full convolution network to generate a pixel-wise output for semantic segmentation. Vijay et al. [28] propose a deep convolution encoder–decoder network SegNet. The innovation of SegNet is that the decoder uses pooling indices to deal with nonlinear up-sampling operations. Chen et al. [29] develop a DeepLab system by using an atrous algorithm and a fully connected CRF to solve two technical problems of image labeling including signal down-sampling and spatial invariance. Afterward, they also develop some variants of DeepLab including deeplabv2 [30], deeplabv3 [31] and deeplabv3+ [32]. Ronneberger et al. [33] design a more elegant u-shaped framework U-Net for biomedical image segmentation. The framework captures the context by a contracting path and realizes precise localization with a symmetric extending path. Similar to the DeepLab system, some scholars improve U-Net and propose attention U-Net [34], U-Net++ [35] and R2U-Net [36]. Zhang et al. [37] investigate a densely connected neural architecture search framework, which can directly search the optimal structure to represent the multi-scale visual information over a large-scale target dataset. Although the above methods have achieved the impressive performance in the field of image semantic segmentation, for the task of sketch semantic segmentation, these methods generally have the disadvantages of more parameters and higher computational cost. Moreover, for the sparsity of sketches, a large number of blank spaces can be ignored [24], so it is not effective to directly generalize image semantic segmentation methods to sketch field.

3 Methodology

3.1 Overview of the proposed method

Given the sampling points of a sketch, sketch semantic segmentation aims to classify these sampling points so that the points of the same type correspond to the same semantic part of the sketch. To accomplish this task, we propose a sketch semantic segmentation method using a NLFA module and a SLSA module. Figure 2 shows the pipeline of the proposed method with a similar encoder–decoder structure. A sketch is converted into a point set using the farthest point sampling method. The encoding part is used to extract deep features of the point set, and the decoding part generates a semantic segmentation map according to these features. Specifically, the NLFA module determines the surrounding regions of the sampling points using Euclidean distance or “stroke distance” randomly, so that the proposed method can better adapt to the two-dimensional spatial distributions of sketches and the internal structures of strokes. Then, the NLFA module extracts rich local features such as absolute position information, relative position information and stroke information, which give better play to the performance of point-based modality. In general, the encoding part is composed of four NLFA modules which correspond to the first four stages of the similar encoder–decoder structure while the decoding part is composed of five stages. The outputs of four intermediate stages are concatenated to feed into two different sub-branches for the final semantic segmentation result. There are two loss functions at the ends of the sub-branches totally, which are combined together in the training process. The joint loss is expressed in Eq. (1).

$$\begin{aligned} L = L_\mathrm{point-level} + L_\mathrm{segment-level} \end{aligned}$$
(1)

\({L_\mathrm{point-level}}\) is a cross-entropy loss for learning point-level semantics at the point-level sub-branch. The inputs of the loss are a predicted semantic segmentation map and the ground-truth semantic segmentation map with [B, N, C] dimension, where B is batch size, N is the number of sampling points, and C is the number of semantic labels. \({L_\mathrm{point-level}}\) is expressed in Eq. (2).

$$\begin{aligned} L_\mathrm{point-level} = -\sum \limits _{i = 1}^B{\sum \limits _{j = 1}^N{\sum \limits _{k = 1}^C{y_{i,j,k} \log (\widehat{y}_{i,j,k})}}} \end{aligned}$$
(2)

where \({y_{i,j}}\) is a one-hot vector. \({y_{i, j, k}}\) is equal to 1 when the label of \({y_{i, j}}\) is k while all other elements of \({y_{i, j}}\) are 0. \({L_{segment-level}}\) is also a cross-entropy loss to learn the semantics of each segment. Its input is the output of the SLSA module to enhance and learn the relationship between segments. Similar to the point-level loss, \({L_\mathrm{segment-level}}\) is defined in Eq. (3).

$$\begin{aligned} L_\mathrm{segment-level} = -\sum \limits _{i = 1}^B{\sum \limits _{j = 1}^S{\sum \limits _{k = 1}^C{y_{i,j,k} \log (\widehat{y}_{i,j,k})}}} \end{aligned}$$
(3)

where S represents the number of segments. From the above discussion, it can be seen that the proposed method is mainly realized by a NLFA module, a SLSA module and a similar encoder–decoder structure based on the two modules. We will detail these two modules and the structure in the following texts.

Fig. 2
figure 2

The pipeline of the propose method. The green rectangles are the encoded features, and the orange rectangles are the decoded features. The NLFA module indicates a novel local feature aggregation module, and the SLSA module indicates a segment-level self-attention module. MLP is a multi-layer perception, and the dotted lines are concatenation operations. PL output is a point-level sketch semantic segmentation map, and SL output is a segment-level sketch semantic segmentation map. FPS is the farthest point sampling method

3.2 Novel local feature aggregation module

Local feature aggregation module is an important component of a sketch semantic segmentation method with points as the input. The internal process of NLFA module is shown in Fig. 3. For convenience, we ignore the parameter batch size B. It can be seen from Fig. 3 that the input point \({p_i}\) uses a perception procedure P to perceive the surrounding regions of \({p_i}\) using different dilated rates r. Based on the position relationship between the surrounding points and the input point \({p_i}\), a feature calculation procedure C is used to encode the sketch features with wealthy information. Finally, the encoding procedure is completed by multi-layer perceptron and max-pooling operation. Thus, a NLFA module mainly includes a region perception procedure P and a feature calculation procedure C. The NLFA module acts on all center points and encodes local features by acquiring several neighboring points around the center points.

Fig. 3
figure 3

The internal process of encoding sketch features by a NLFA module. P represents a perception procedure for points, which perceives the surrounding points of the sampling point \({p_i}\) by using Euclidean distance or “stroke distance” combined with the K-nearest neighbor algorithm. C represents a feature calculation procedure, which is utilized to collect and calculate distance information, angle information, stroke information, etc. MLP stands for a multi-layer perceptron. MP stands for a max-pooling operation. \({\oplus }\) is a concatenation operation. r is a dilated rate to realize a multi-scale technology, where r is set to 0 and 1. N \({\times }\) indicates that the process in the red dotted box is performed N times due to N points

3.2.1 Region perception procedure

The region perception procedure plays a crucial role when the sampling points perceive their surrounding regions in a local feature aggregation module. The common region perception procedure based on Euclidean distance combined with K-nearest neighbor algorithm is shown in Fig. 4a.

Fig. 4
figure 4

The comparison between the common region perception procedure and the proposed region perception procedure. The black and red points are the sampling points, and the dotted lines are the perception regions. a shows a perception region using the common procedure, while b shows a perception region using the proposed procedure, where k = 4. The common procedure perceives the surrounding region according to the two-dimensional spatial distribution around the center point. The proposed procedure perceives the surrounding region along the corresponding stroke of the center point

In Fig. 4a, \({p_i}\)(\({x_i}\), \({y_i}\)) is a center point, and \({p_{ij}}\)(\({x_{ij}}\), \({y_{ij}}\)) represents any surrounding point of \({p_i}\); then, the surrounding region \({P_{ik}}\)={\({p_{i1}}\), \({p_{i2}}\),..., \({p_{ik}}\)} of \({p_{i}}\) is calculated by Eq. (4).

$$\begin{aligned} P_{ik} = \mathop {\hbox {argmin}}_{p_{i1},p_{i2},...,p_{ik}}\sqrt{(x_i-x_{ij})^2+(y_i-y_{ij})^2} \end{aligned}$$
(4)

where argmin represents gaining the k points corresponding to the k shortest distances. Although the surrounding regions obtained by the above perception procedure conform to the two-dimensional spatial distributions of sketches, it ignores the stroke structures of sketches. Therefore, we propose to obtain the surrounding regions based on “stroke distance” combined with the K-nearest neighbor algorithm. Specifically, “stroke distance” is defined in Eq. (5).

$$\begin{aligned} P_{ik} = \mathop {\hbox {argmin}}_{p_{i1},p_{i2},...,p_{ik}}\left\{ \begin{array}{lr} \sqrt{(x_i-x_{ij})^2+(y_i-y_{ij})^2},s_i=s_{ij} \\ m \times \sqrt{(x_i-x_{ij})^2+(y_i-y_{ij})^2},s_i \ne s_{ij} \\ \end{array} \right. \end{aligned}$$
(5)

where \({p_i}\)(\({x_i}\), \({y_i}\)) is a center point, \({p_{ij}}\)(\({x_{ij}}\), \({y_{ij}}\)) is any point around \({p_i}\), \({s_i}\) and \({s_{ij}}\) represent the two strokes corresponding to \({p_i}\) and \({p_{ij}}\), m is a scaling factor and m satisfies Eq. (6).

$$\begin{aligned} m \times \sqrt{(x_i-x_{ij})^2+(y_i-y_{ij})^2} \ge \sqrt{w^2+h^2} \end{aligned}$$
(6)

where w and h are the width and height of the sketch. Equations (5) and (6) indicate that the “stroke distance” is equal to the traditional Euclidean distance when \({p_i}\) and \({p_{ij}}\) belong to the same stroke while the “stroke distance” is increased evidently when they do not belong to the same stroke. In other words, the NLFA module can perceive the surrounding regions along the direction of a stroke, which makes our method consider the continuity of strokes and is conducive to perceive the stroke structure. The proposed perception procedure is shown in Fig. 4b. The NLFA module randomly selects either one of these two perception procedures, which leads to taking into account the two-dimensional spatial distributions and the continuity of strokes simultaneously.

To enlarge the perception region of each sampling point, we introduce multi-scale technology based on the idea of dilated convolution [38]. Assuming that any point \({p_i}\) in the NLFA module has several perception regions with different dilated rates r, the perception region with r can be expressed as { \({p_{ij}}\) \({\vert }\) \({p_{ij}}\) \({\in }\) \({P_{ik\times (r+1)}}\), j % (r+1) ==0 }, that is, the perception region evenly obtains the first k points closest to the sampling point with the step of r+1. The realization of the above multi-scale perception regions is only suitable for testing, but the perception regions with different dilated rates r are more effective to obtain k surrounding points by a random selecting method during training. Figure 5 describes the perception regions using the common selecting method and the random selecting method.

Fig. 5
figure 5

The perception regions using different selecting methods. \({p_i}\) is a center point, {\({p_{i1},p_{i2},...,p_{in}}\)} is the n points closest to \({p_i}\). For ease of understanding, n, k, r are set to 9, 3, 3 in order. a shows a perception region using the common selecting method. b shows a perception region using the random selecting method. The perception region in b is much larger than the perception region in a

In Fig. 5a, {\({p_{i1}}\), \({p_{i5}}\), \({p_{i9}}\)} are the selected points and the perception region is within the orange dotted line roughly. In Fig. 5b, three points are randomly selected from {\({p_{i1}}\), \({p_{i2}}\),..., \({p_{i9}}\)} and the perception region can cover the entire region within the purple dotted line roughly after multiple iterations.

3.2.2 Feature calculation procedure

Feature calculation procedure is employed to handle with all kinds of sketch information. The common feature calculation procedures mainly collect the coordinates of points and the sketch features of the intermediate stages [24, 25]. However, these collected contents are not sufficient for a sketch, which greatly hinders the performance of sketch semantic segmentation. Obviously, the feature calculation procedure essentially plays the same role as the convolution kernel of CNN. However, due to the discreteness of the sampling points, the common feature calculation procedures discard some crucial elements when searching and determining the surrounding points. The distance information between the sampling points and their surrounding points, and the angle information are implicitly encoded using a traditional convolution kernel. Figure 6 illustrates the encoded contents. Thus, the feature calculation procedure of the NLFA module should encode the distance information and the angle information to ensure the integrity of the sketch features. Assuming that the absolute coordinate of the sampling point \({p_i}\) is (\({x_i}\), \({y_i}\)), and the absolute coordinate of its any surrounding point \({p_{ij}}\) is (\({x_{ij}}\), \({y_{ij}}\)). The distance \({dist_{ij}}\) between the sampling point \({p_i}\) and \({p_{ij}}\) can be expressed in Eq. (7).

Fig. 6
figure 6

The implicitly encoded contents using a traditional convolution kernel. The orange region is a traditional convolution kernel of 3\({\times }\)3 size, and {\({w_1, w_2,..., w_9}\)} are the weights of the kernel. The green region is a region of 5\({\times }\)5 size. It can be clearly observed that the distances between the center point \({p_i}\) and {\({p_{i1},p_{i2},...,p_{i8}}\)} are {1, \({\sqrt{2}}\), 1,\({\sqrt{2}}\), 1, \({\sqrt{2}}\), 1, \({\sqrt{2}}\)}, and the angles between \({p_i}\) and {\({p_{i1},p_{i2},...,p_{i8}}\)} are { \({0^{\circ }}\), \({45^{\circ }}\),..., \({135^{\circ }}\) }

$$\begin{aligned} \hbox {dist}_{ij} = \sqrt{(x_i-x_{ij})^2+(y_i-y_{ij})^2} \end{aligned}$$
(7)

Similarly, the angle information of \({p_i}\) and \({p_{ij}}\) is calculated by Eq. (8).

$$\begin{aligned} \theta _{ij} = \hbox {arctan} \left( \frac{y_{ij}-y_{i}}{x_{ij}-x_i} \right) \end{aligned}$$
(8)

In Eqs. (7) and (8), the coordinates of all points are in the Cartesian coordinate system. The proposed feature calculation procedure ensures that the sketch features do not ignore the crucial context information, which is conductive to learn the complex geometric patterns. Further, the detailed feature calculation procedure of the NLFA module is expressed in Eq. (9).

$$\begin{aligned} F_{ij} = \delta \{ \mathcal {M} \{ p_i \oplus (p_{ij}-p_i) \oplus f_i \oplus (f_{ij}-f_i) \oplus dist_{ij} \oplus \theta _{ij} \oplus s_{ij} \} \} \end{aligned}$$
(9)

where \({f_i}\) and \({f_{ij}}\) represent the intermediate-stage features, \({s_{ij}}\) represents the stroke information, \({\oplus }\) is a concatenation operation, \({\mathcal {M}}\) is a multi-layer perceptron, and \({\delta }\) indicates the ReLU activation function. From Eq. (9), it can be concluded that the NLFA module spans rich context information. Compared with the common local feature aggregation modules, we supplement our module with the more informative contexts, involving the intermediate-stage features after normalization, distance information, angle information and stroke information according to the encoding idea of convolutional kernel, the basic properties of point format data, and the stroke structure of a sketch. And we define “stroke distance”, which enables the obtained local regions to extend along the stroke directions and better perceive the stroke structures.

3.3 Segment-level self-attention module

Supposing the inputs of the SLSA module are the sketch features with [B, N, D] dimension, where B is batch size, N represents the number of the sampling points and D indicates the number of channels. We divide several points on the same stroke into one segment. Each sketch is divided into S segments, the initial feature dimension of each segment is [B, S, D\({\times }\)N/S]. The new segment features with the dimension [B, S, \(D'\)] can be computed by feeding the initial segment features into a multi-layer perceptron. We aim to learn the relationship between segments for describing the internal structure of a sketch. As we all know, multi-head self-attention mechanism can be applied to calculate the relationship between any two objects and is widely utilized in the fields of natural language processing and computer vision. The impressive success of multi-head self-attention mechanism is originated from the correlation between any two objects and the long dependence between them. The multi-head self-attention mechanism makes neural networks pay attention to the characteristics of multiple subspaces at the same time. Inspired by the above observations, based on the idea of multi-head self-attention, we design a SLSA module to learn and strengthen the internal structure of a sketch through taking into account segment features and sketch characteristics. The specific calculation method is shown in Fig. 7.

Fig. 7
figure 7

The internal process of the SLSA module. For simplicity, we discard the parameter batch size B. MLP stands for a multi-layer perceptron, and \({F_s}\) stands for a softmax function. T is a matrix deformation. \({d_\textrm{sketch}}\) and \({d_\textrm{stroke}}\) represent the embedding dictionaries

In Fig. 7, given the segment features \({f_s}\)[S, \(D'\)], the query vector \({v_q}\)[H, \(D''\)/H, S] and the key vector \({v_k}\)[H, \(D''\)/H, S] are obtained by Eqs. (10) and (11).

$$\begin{aligned} v_q= & {} \mathcal {T}(\mathcal {M}(f_s)) \end{aligned}$$
(10)
$$\begin{aligned} v_k= & {} \mathcal {T}(\mathcal {M}(f_s)) \end{aligned}$$
(11)

where \({\mathcal {M}}\) stands for a multi-layer perception, its output dimension is D”, and \({\mathcal {T}}\) stands for matrix deformation. Both \({v_q}\) and \({v_k}\) have H subspaces. Simultaneously, the position information of segments is processed by Eqs. (12) and (13).

$$\begin{aligned} e_\textrm{stroke}= & {} \mathcal {T}(\mathcal {S}(p_\textrm{stroke}, d_\textrm{stroke})) \end{aligned}$$
(12)
$$\begin{aligned} e_\textrm{sketch}= & {} \mathcal {T}(\mathcal {S}(p_\textrm{sketch}, d_\textrm{sketch})) \end{aligned}$$
(13)

where \({\mathcal {S}}\) is a query procedure of the related embedding dictionary. \({p_\textrm{stroke}}\) and \({p_\textrm{sketch}}\) are two one-hot vectors, which represent the position of one segment in the corresponding stroke and that in the corresponding sketch, respectively. The embedding dictionaries \({d_\textrm{stroke}}\) and \({d_\textrm{sketch}}\) are queried according to \({p_\textrm{stroke}}\) and \({p_\textrm{sketch}}\), then a deformation operation is performed on the query results to obtain the position embedding vectors \({e_\textrm{stroke}}\) and \({e_\textrm{sketch}}\). The position information of a segment in the corresponding stroke \({p_\textrm{stroke}}\) is local relative to the position information of the stroke in the corresponding sketch \({p_\textrm{sketch}}\). Thus, the existence of \({p_\textrm{stroke}}\) and \({p_\textrm{sketch}}\) enables the relationship between segments to capture the local position information in the corresponding stroke and the global position information in the corresponding sketch, which makes the relationship more accurate. Furthermore, we need to encode the position features and content features of segments, respectively. The relevant equations are expressed in Eqs. (14) and (15).

$$\begin{aligned} f_{p}= & {}\, (e_\textrm{stroke}+e_\textrm{sketch}) \otimes v_q \end{aligned}$$
(14)
$$\begin{aligned} f_{c}= & {} \, \mathcal {T}(v_q) \otimes v_k \end{aligned}$$
(15)

where \({\otimes }\) represents a matrix multiplication, and + is an element-wise addition. Finally, the calculation process of the SLSA module is expressed in Eq. (16).

$$\begin{aligned} f_{s}' = \mathcal {T}(v_v \otimes \hbox {softmax}(f_p + f_c)) \end{aligned}$$
(16)

where \({f_s'}\) represents segment features after using the SLSA module. The SLSA module can focus on sketch features of multiple subspaces and calculate the correlation of any two segments through matrix operations and softmax function for establishing the relationship between segments. Then, the relationship is further enhanced by capturing crucial local and global position information of segments, which is better for learning the internal structure of a sketch more precisely. The SLSA module provides a basis for sketch semantic segmentation from a segment perspective.

3.4 Similar encoder–decoder structure

The proposed method is mainly realized by constructing a similar encoder–decoder structure. The structure has two significant characteristics. (1) It is a similar encoder–decoder structure rather than a real encoder–decoder structure. (2) It employs a global multi-scale mechanism, that is, sketch features of several intermediate stages are aggregated to generate the final semantic segmentation map.

3.4.1 Similar encoder–decoder structure

Figure 2 illustrates this similar encoder–decoder structure, which is composed of nine stages. The first four stages are the encoding stages using the NLFA module, and the last five stages are the decoding stages using nonlinear transformation primarily. Compared with the common encoder–decoder structure, our structure does not perform the down-sampling and up-sampling operations on the sampling points. The number of the sampling points remains unchanged, which retains the features of the significant points and further ensures the performance of the proposed method. Similar to the common encoder–decoder structure, the number of feature channels in all stages changes from increasing to decreasing, which illustrates that we enrich deep features at the front stages and then label each point after decoding according to the deep features.

3.4.2 Global multi-scale mechanism

The NLFA modules of the encoding stages realize a multi-scale mechanism based on the idea of dilated convolution. However, the multi-scale mechanism is limited to local regions, and the global context information of different stages is ignored in the process of encoding and decoding features. From the global perspective, the lack of long dependence will lead to the feature differences of points from the same category. Consequently, we establish a global multi-scale mechanism to aggregate the outputs of four intermediate stages using skip connections. The four intermediate stages contain the last encoding stage and the front three decoding stages, and the details are shown in Fig. 2. It can be observed from Fig. 2 that the features used to generate the final semantic segmentation map come from the features of four stages with different depths, which can describe multi-scale and global semantic information to enhance the capability for semantic analysis.

4 Experiments

4.1 Experimental settings

4.1.1 Dataset

We evaluate the proposed method on two widely used benchmarks named SPG [20] and SketchSeg-150K [21]. The sketches in SPG and SketchSeg-150K are created by unprofessional painters, which presents the practical significance. SPG dataset has a total of 16,000 sketches spanning 20 categories. SketchSeg-150K dataset has 150,000 sketches spanning 20 categories. Compared with SPG dataset, SketchSeg-150K has fewer semantic labels for each category.

4.1.2 Evaluation metrics

At present, almost all sketch semantic segmentation methods use P-metric and C-metric [26] as the standardized metrics. P-metric is a point-based metric that describes the proportion of the correctly marked points in sketches. C-metric is a component-based metric that describes the proportion of the correctly marked strokes in sketches. The two evaluation metrics are complementary and focus on strokes with different lengths.

4.1.3 Implementation details

As shown in Fig. 2, the model of the proposed method is mainly composed of four encoding stages and five decoding stages. In the encoding stages, there are four NLFA modules which utilize K-nearest neighbor algorithm and multi-layer perceptron. In the K-nearest neighbor algorithm, K is equal to 32. In the decoding stages, there are four multi-layer perceptrons and two sub-branches. The parameters of the above multi-layer perceptrons are listed in Table 2 where the two numbers in each bracket represent the input dimension and the output dimension.

Table 2 The parameters of all multi-layer perceptrons in the proposed method

Our method is implemented by Pytorch framework. We employ Adam algorithm to train the model of the proposed method for 100 epochs where the initial learning rate is 0.001 and the learning rate is updated to 0.65 times of the original learning rate every 10 epochs. The batch size is 20, every 2 points are divided into a segment on SPG, and every 4 points are divided into a segment on SketchSeg-150K. We compare the proposed method with the baseline methods on a server configured with GTX 1080 GPU and 32 G memory.

4.2 Comparative results and analysis

4.2.1 Segmentation accuracy and analysis

To evaluate the effectiveness of the proposed method, we conduct extensive experiments to compare the proposed method with the existing representative baseline methods. The baseline methods mainly fall into three categories. (1) The sequence-based methods include SPGSeg [20] and SketchSegNet+ [21]. (2) The image-based methods include FastSeg [22] and U-Net [34]. (3) The point-based methods include SketchGNN [10] and PointMLP [39]. U-Net and PointMLP are an image semantic segmentation method and a point cloud semantic segmentation method, respectively. Given their adaptability to sketch semantic segmentation, they are added to the comparative experiment. The original SketchGNN is trained for each category separately. For fairness, we use sketches with all categories to train SketchGNN together, which is same as other baseline methods in the comparative experiment. The experimental results of all the methods on SPG are shown in Table 3. As can be observed, the proposed method achieves the average accuracy of 90.6% in P-metric, and the average accuracy of 86.6% in C-metric, which outperforms other baseline methods. Specifically, the proposed method exceeds the second best PointMLP by 3.8% in P-metric and 4.8% in C-metric, respectively. In addition to the average accuracy, Table 3 also shows the segmentation accuracy of each category. It can be observed that the proposed method can achieve the highest accuracy in almost all categories, which shows the high applicability to most categories of sketches. Besides, the proposed method has great advantages in the categories of alarm clock, ant, backpack, calculator crab and ice cream.

Table 3 Semantic segmentation accuracy (%) on SPG dataset in P-metric (P) and C-metric (C)

The experimental results of the baseline methods on SketchSeg-150K are shown in Table 4. It should be noted that the segmentation accuracy of SPGSeg is not shown in Table 4 because we did not find the original experimental results and its source code. As listed in Table 4, the proposed method obtains the average accuracy of 96.3% in P-metric and the average accuracy of 95.0% in C-metric, which outperforms all baseline methods. In terms of the average accuracy, the proposed method exceeds the second best FastSeg by 1.3% in P-metric, and 3.0% in C-metric, respectively. Compared with other baseline methods, the performance of the proposed method in C-metric is obviously superior to that in P-metric. This superiority is mainly rooted in the fact that the integration of strokes and the internal structures of sketches are adequately considered in the NLFA module and the SLSA module, respectively. From the above discussion, it can be concluded that the proposed method has great advantages no matter in P-metric or C-metric, which fully verifies the effectiveness.

Table 4 Semantic segmentation accuracy (%) on SketchSeg-150K dataset in P-metric (P) and C-metric (C)

4.2.2 Parameter size and calculation complexity

The parameter size and calculation complexity are also the important indicators to reflect the advantage and disadvantage of a method. The parameter sizes and calculation complexity (FLOPs) using different semantic segmentation methods are shown in Table 5. According to Table 5, the proposed method has a parameter size of 0.7 M, and its computational complexity is about 0.3 G in FLOPs, which fully demonstrates that the proposed method has the advantages of the high average accuracy, low calculation complexity and small parameter size. The sketch applications mainly occur on portable devices such as mobile phones and tablets. The parameter size less than 1 M is very friendly to be applied on the devices.

Table 5 The parameter sizes and computational complexity using different semantic segmentation methods

4.3 Qualitative results and analysis

We will display intuitively some visual semantic segmentation results in the qualitative experiment. Three open source methods (U-Net, PointMLP, SketchGNN) and the proposed method are taken part in the experiment. The specific segmentation results are shown in Fig. 8. Compared with other methods, the semantic segmentation results of our method are closer to the level of humans in most cases. Results of the traditional image semantic segmentation method U-Net are unsatisfactory, because stroke information is not considered in the encoding process. PointMLP ignores the relationship between segments, resulting in the inferior performance. The original SketchGNN is trained for each category. Thus, SketchGNN loses desired results in this qualitative experiment. Our method captures rich sketch information and then determines the semantic segmentation results of sketches from a point-level perspective and a segment-level perspective, which brings the great performance improvement.

Fig. 8
figure 8

Some visual sketch semantic segmentation results. Different colors represent different semantics

4.4 Ablation analysis

The proposed method mainly involves two crucial modules named a NLFA module and a SLSA module. Thus, we evaluate the impact of the NLFA module and the SLSA module on sketch semantic segmentation results. Both SPG and SketchSeg-150K are important semantic segmentation datasets. There are 3–7 labels per category in SPG, while 2–4 labels per category in SketchSeg-150K. Therefore, the evaluation of the proposed method on SPG is representative and presents more challenges. In ablation analysis, we conduct relevant experiments on SPG dataset.

4.4.1 The study of the novel local feature aggregation module

The NLFA module encodes a large number of useful information related to sketch characteristics. Theoretically, there are three kinds of information that are important factors for the proposed method to achieve high segmentation accuracy, including distance information, angle information and stroke information. In order to evaluate the important impact of these information on the performance, we build three kinds of local feature aggregation modules for the proposed method. Besides, to verify the benefits of “stroke distance”, we also build two kinds of local feature aggregation modules. Five methods using different local feature aggregation modules are as follows:

  1. (1)

    NLFA w/o distance This method is based on the local feature aggregation module that encodes all information in Eq. (9) except the distances between the sampling points and the surrounding points.

  2. (2)

    NLFA w/o angle This method is based on the local feature aggregation module that encodes all information in Eq. (9) except the angles between the sampling points and the surrounding points.

  3. (3)

    NLFA w/o stroke This method is based on the local feature aggregation module that encodes all information in Eq. (9) except the corresponding stroke ID.

  4. (4)

    NLFA w/o SD This method is based on the local feature aggregation module that encodes all information in Eq. (9) and only utilizes Euclidean distance.

  5. (5)

    NLFA The proposed method is based on the local feature aggregation module that encodes all information in Eq. (9).

The above methods will use Euclidean distance or “stroke distance” randomly unless noted otherwise. Table 6 shows semantic segmentation results of these five methods on SPG.

As demonstrated in Table 6, distance information, angle information and stroke information improve the average accuracy of the proposed method by 0.6%, 0.9% and 4.3% in P-metric, respectively, while they improve the average accuracy by 0.8%, 1.5% and 6.1% in C-metric, respectively. It can be concluded that these three kinds of information are of great help to enhance the performance, and stroke information plays the most important role in the sketch semantic segmentation task. Stroke information represents the internal structure of a sketch partly, which has a great impact on semantic analysis, so the performance gap between with and without stroke information can be as high as 6.1% in C-metric. In addition, according to Table 6, “stroke distance” improves the average accuracy by 0.9% in P-metric, and 1.3% in C-metric, respectively. Results illustrate that the NLFA module based on “stroke distance” is more effective for the performance enhancement.

Table 6 Semantic segmentation accuracy (%) of the five methods on SPG dataset in P-metric (P) and C-metric (C)

4.4.2 The study of the segment-level self-attention module

The SLSA module is combined with a cross-entropy loss \({L_\mathrm{segment-level}}\) to establish and strengthen the relationship between segments, so as to improve the understanding for sketch semantics. In order to evaluate the importance of the proposed module, we will test the proposed method without and with the SLSA module, respectively. Table 7 shows the semantic segmentation results on SPG dataset.

Table 7 Semantic segmentation accuracy (%) without and with the SLSA module on SPG dataset in P-metric (P) and C-metric (C)

As shown in Table 7, the SLSA module can enhance the performance of sketch semantic segmentation. The reason for this result is that the SLSA module establishes the segment-to-segment relationship and provides a new segment-level perspective to label each point in a sketch. In Table 7, the SLSA module improves the average accuracy by 1.2% in P-metric, and 1.8% in C-metric, respectively. The positive effect of the SLSA module in C-metric is better than that in P-metric slightly.

4.4.3 The impact of the number of points in each segment on the performance

The number of points in each segment can affect the performance of the proposed method. Therefore, we study the cases that each segment contains 2, 4, 8, 16 and 32 points, respectively. Figure 9 shows the average accuracy using 2, 4, 8, 16 and 32 points on SPG and SketchSeg-150K. It can be observed that the impact of the number of points is limited when the number of points is between 2 and 16, but the average accuracy will decline sharply when the number of points reaches 32. This result illustrates that more points in each segment cannot obtain the better performance. The reason for this result is that the points from different strokes will divide into the same segment when each segment contains too many points, which has an adverse effect on establishing the relationship between segments. In fact, the segments composed of 2 points on SPG can obtain the highest average accuracy of 90.6% in P-metric and 86.6% in C-metric, respectively, while the segments composed of 4 points on SketchSeg-150K can obtain the highest average accuracy of 96.3% in P-metric and 95.0% in C-metric, respectively.

Fig. 9
figure 9

The impact of the number of points in each segment on the average accuracy. The horizontal axes represent the number of points, and the vertical axes represent the average accuracy (%). a is the average accuracy on SPG, and b is the average accuracy on SketchSeg-150K

4.4.4 The impact of the number of sampling points on the performance

The number of sampling points is an important hyper-parameter, which can directly affect the final segmentation result. Theoretically, the reasonable number of sampling points can retain the crucial details of a sketch and keep low computational complexity. Based on this principle, we choose 256 sampling points. Table 8 shows the impact of the number of sampling points on the segmentation accuracy. Generally speaking, within a certain range, the greater N value is set, the more points can be captured by the receptive field, so the proposed method presents the higher segmentation accuracy. This is the reason why the performance of sketch semantic segmentation will gradually improve as N increases in Table 8 when N is less than 512. When N=512, due to the excessive number of points, the region obtained under the fixed receptive field is relatively small compared with the whole sketch, so the captured information tends to be local and the global perception is unsatisfactory. Due to the lack of global perception, the performance of sketch semantic segmentation is degraded. In fact, when we increase the receptive field and set the K value to 48, P-metric and C-metric is 90.5% and 86.9%, respectively, which is basically the same as the performance using \({N}=256\). Therefore, we do not continue to increase the number of N points, because too large N value will lead to a decrease in computing efficiency and segmentation accuracy simultaneously.

Table 8 The impact of the number of sampling points on the segmentation accuracy (%)

4.5 A simple application

Sketch semantic segmentation is the understanding about sketch semantics. Therefore, theoretically, sketch semantic segmentation can improve the performance of sketch recognition. In this section, we show the application of sketch semantic segmentation to the sketch recognition task. The sketch recognition network we used is Sketch-a-Net [40], which is representative and widely used in the sketch field. The sketch recognition method combined with the proposed sketch semantic segmentation method can be divided into four stages. (1) In the training phase, Sketch-a-Net is used to train on the SPG dataset [20] to obtain a standard sketch recognition model. (2) In the segmentation stage, the proposed sketch semantic segmentation method is used to obtain the semantic parts of the input sketch. Certainly, the noise semantics will be removed. (3) The stage of reassigning values for the neurons in the last fully connected layer. Each neuron corresponds to a sketch class, and we take the neuron with the largest score. If the sketch class corresponding to the neuron does not contain any segmented semantic part, the score of the neuron will be reduced to the lowest score. Repeat step (3) until the neuron with the largest score contains a segmented semantic part. (4) In the prediction phase, a sketch class is obtained through the neuron with the largest score in the last fully connected layer. We show the recognition accuracy of Sketch-a-Net and Sketch-a-Net combined with the proposed sketch semantic segmentation method (Sketch-a-Net + SG) in Table 9. It can be seen from Table 9 that the performance of Sketch-a-Net is significantly improved. Specifically, the recognition accuracy is improved by 8.4%. In fact, the performance improvement comes from the revision of predicted sketch classes by sketch semantics. The original Sketch-a-Net does not check the semantics of the candidate sketch classes, while Sketch-a-Net + SG automatically eliminates the sketch classes with incorrect semantics, thus avoiding unnecessary errors. Although Sketch-a-Net + SG is more complex than the original Sketch-a-Net, it can achieve significant performance improvement. Figure 10 shows the qualitative results of Sketch-a-Net and Sketch-a-Net + SG. In Fig. 10, the first line shows the wrong results of Sketch-a-Net, while the second line shows the wrong results of Sketch-a-Net + SG. Obviously, compared with Sketch-a-Net, the wrong results of Sketch-a-Net + SG is easier to understand because it is difficult for humans to recognize sketches of the second line.

Table 9 Comparison of the accuracy (%) of two sketch recognition methods
Fig. 10
figure 10

The qualitative results of Sketch-a-Net and Sketch-a-Net + SG. The sketches of columns 1 to 7 are as follows: airplane, alarm clock, ambulance, backpack, basket, butterfly and duck, respectively

5 Discussion

In this paper, we propose a sketch semantic segmentation method using novel local feature aggregation and segment-level self-attention, which has great advantages in segmentation accuracy, parameter size and computational complexity. The proposed method captures the internal structures of sketches by encoding informative point-level features, perceiving the stroke structures, and building the relationship between segments, thus obtaining the state-of-the-art semantic segmentation accuracy. The input of the proposed method is a point set and a segment set. The MLP for processing these data is much less than the convolutional neural network for processing images in terms of parameter size and computational complexity. This advantage is more conducive to the application of the proposed method to portable devices. Besides, the proposed method aims at tackling sketch semantic segmentation task. Although it cannot directly be used for sketch instance-level or sketch scene segmentation tasks, it can be extended for sketch instance-level or sketch scene segmentation tasks by adding a post-processing step based on clustering algorithm or metric learning [41]. Simultaneously, our approach exists some limitations. (1) We divide n points into a segment. The value of n is determined by a verification set, and some segments may not conform to the stroke structure. (2) The proposed method still has unsatisfactory segmentation results on some sketches as shown in Fig. 11, including sketches with special angles, sketches with large differences in appearance from the same category. We deem that the above issues can be solved by better representations [42] and GAN-based methods [43].

Fig. 11
figure 11

Some unsatisfactory results on special sketches. Sketches with special angles and special appearances show unsatisfactory segmentation results

6 Conclusion

We propose a sketch semantic segmentation method, which is realized by a NLFA module, a SLSA module and a similar encoder–decoder structure based on these two modules. The NLFA module uses Euclidean distance or “stroke distance” randomly when the sampling points search the surrounding regions, which makes the perception regions conform to the two-dimensional spatial distributions and the stroke structures of sketches. And the NLFA module encodes rich information such as distance information, angle information and stroke information, which enhances the semantic discrimination ability of sketch features. Simultaneously, the SLSA module is designed to enhance the relationship between segments and to learn the internal structures of sketches. Finally, the similar encoder–decoder structure integrates the global features of several intermediate stages to enrich semantic contents. Extensive experiments show that the proposed method achieves state-of-the-art performance in both P-metric and C-metric. In the future, we will explore the unsupervised sketch semantic segmentation method, which is more in line with the actual application when a sketch lacks a large number of semantic labels.