1 Introduction

In the last years, image retrieval has become a very relevant discipline in computer science owing mainly to the advances in imaging technology that has facilitated capturing and storing images. In a content-based image retrieval system (CBIR), an image is required as input. This image should express what the user is looking for, but the user frequently does not have an appropriate image for that purpose. Furthermore, the absence of such a query image is commonly the reason for the search [9]. An easy way to express the user query is using a line-based hand-drawing, a sketch, leading to the sketch-based image retrieval (SBIR). In fact, a sketch is the natural way to make a query in applications like CAD or 3D model retrieval [12].

A few authors have addressed image retrieval based on sketches. Some of these works are Edge Histogram Descriptor (EHD) [26], Image Retrieval by Elastic Matching [8], Angular partitioning of Abstract Images [5], Structure Tensor[9], and Histogram of Edge Local Orientations [23]. Recently, Eitz et al. [11] have presented results applying the Bag of Features (BoF) approach for the SBIR problem. Some of these methods will be described in the next section.

The main contribution of this work is to propose a novel local method based on detecting keyshapes. Our method takes into account structural information by means of the keyshapes, and local information by means of local descriptors that are also proposed in this work. Furthermore, based on the stroke’s information, we present a novel strategy for detecting keyshapes that avoid efficiency issues that methods like those based on Hough Transform [29] undergo.

Our experimental results show a slight improvement in the retrieval effectiveness with respect to the state of the art. Nevertheless, owing to our method is based on a different feature (the structure of the objects) from those used by current methods, a combination of our method with a current leading method yields a significant improvement in the retrieval effectiveness.

In this paper, we show that a combination of our approach with the Bag of Feature (BoF) approach proposed by Eitz et al. [11] allows us to achieve significant improvement which is validated by a statistical test. Specifically, the combined method increases the retrieval effectiveness in almost 22 % of the effectiveness reported by the BoF approach.

The rest of this paper is organized as following. Section 2 describes the current methods for SBIR. Section 3 describes in detail the proposed method. Section 4 presents the experimental evaluation. Finally, Section 5 presents conclusions.

2 Related work

SBIR approaches can be classified as global or local techniques. Some interesting approaches falling in the global category are Edge Histogram Descriptor EHD [26], Angular Partitioning of Abstract Images (APAI) [5], and Histogram of Edge Local Orientation (HELO) [23].

Edge Histogram Descriptor (EHD) was proposed in the visual part of the MPEG-7 [21] and was improved by Sun Won et al. [26]. The goal is to get a local distribution of five types of edges (vertical, horizontal, diagonal 45°, diagonal 135°, and no direction) from local regions of the image. The concatenation of local distributions composes the final descriptor.

Another important work on SBIR was presented by Chalechale et al. [5]. This approach is based on angular partitioning of abstract images (APAI). The angular spatial distribution of pixels in the abstract image is the key concept for the feature extraction stage. The method divides the abstract image into eight angular partitions or slices. Then, it uses the number of edge points falling in each slice to make up a feature vector.

Histogram of Edge Local Orientations HELO, proposed by Saavedra and Bustos [23], showed an improvement on the retrieving performance in SBIR. HELO computes a K-bin histogram based on local edge orientations. To get the HELO feature vector, first the sketch is divided in a W×W grid. Second, an edge orientation is estimated for each cell in the grid. Third, a 72-bin histogram is computed using each computed edge orientation. Finally, Manhattan distance is used to measure dissimilarity between histograms.

In the case of local techniques, these commonly represent a sketch by a set of feature vectors. Although these techniques are slower in time than the global ones, these may use local properties and structural information that could lead to better retrieving performance. Relevant approaches falling into this category are Shape Context [1] and STELA [24]. In addition, local techniques are characterized by supporting partial match and by working very well when objects are partially occluded.

Shape Context, proposed by Belongie et al. [1], is a local approach for measuring similarity between shapes. In Shape Context, sketches are represented by a set of points sampled randomly from the stroke points. The sampled points are then used as reference to compute a set of shape feature vectors. The shape context feature vector describes the distribution of the rest of the sampled points with respect to a given point by a log-polar histogram.

Structure Local Approach (STELA), proposed by Saavedra et al. [24], was originally applied for Sketch-based 3D object retrieval. STELA transforms a sketch into a set of keyshapes forming the whole sketch. STELA uses straight lines as keyshapes. Then, each keyshape is regarded as reference to compute a local feature vector. The final process is to match different feature vectors from two sketches. To this end, STELA applies the Hungarian Method using χ 2 test statistics as the cost function.

Another kind of technique is the Bag of Features (BoF) approach that uses local descriptors to obtain a codebook by means of a learning process. Recently, Eitz et al. [11] showed that this technique outperforms the known methods. The problem with BoF is that it does not support partial matching due to the localization information loss.

In the context of the BoF approach, Hu et al. [15, 16] proposed to use a Gradient Field image (GF) over which local descriptors are computed. The local descriptors are obtained by a variation of the HOG approach [7]. The local descriptors computed over the complete image database are clustered to obtain a codebook of approximately 1000 codewords. Although, the authors show good results over a small database, computing the GF is a time consuming process which requires to solve a sparse system of linear equations where the number of unknown terms is the order of the size of the input image.

In addition, a proposal for dealing with a large database was proposed by Cao et al. [4]. This method is based on the Chamfer Distance [2, 25] to measure dissimilarity between a sketch and an image. This approach does not show how to deal with position or scale variations although the authors present an interesting indexing method based on the inverted index structure.

Another approach for the SBIR problem is based on converting the input sketch into a regular image with color an texture [6, 10]. This conversion process is known as image montage. After applying the montage process, the SBIR problem is reduced to the classical CBIR problem in which an example image required as input is the result of the montage process. The montage based image retrieval leads to an expensive process, owing mainly to the additional process to convert the image into a regular image.

In this work we propose a novel local method for retrieving images given a simple sketch as input. Different from STELA, our proposal detects many types of keyshapes and is applied in the context of image retrieval. These keyshapes could be lines, arcs, elliptical shapes, just to mention a few. We show that our approach, based on keyshapes, in combination with the BoF approach outperforms significantly the state-of-the-art approaches.

3 Keyshape based approach

Sketches are characterized by representing the structural components of an object instead of representing color or texture information. For instance, when a person is asked to make a simple drawing of a teapot, he or she will probably draw three components: the body, the spout, and the handle like one of the pictures depicted in Fig. 1. In addition, the absence of color and texture information in sketches may cause the retrieval process to become a difficult task. This fact also means that techniques thought to work on regular images do not work appropriately with sketches. Therefore, in this section we present a novel method for retrieving images using a sketch as query. Our proposal is characterized principally by exploiting the structural information provided by sketches.

Fig. 1
figure 1

Examples of hand drawings of a teapot

Definition 1

The object structure is the distribution of the parts that compose such an object.

From the Definition 1, we extract two relevant terms: (1) the component of an object, and (2) the distribution of the components.

  1. 1.

    Component: The components of an object are difficult to define because these exist in diverse scales. However, we will define a component from the geometric perspective. Therefore, a component is a simple geometric shape like an arc, an ellipse, a circle, a triangle, a rectangle, a square, or simply a straight line. In this way, the teapot showed in the Fig. 1 may be decomposed in an arc representing the handle, two arcs representing the spout, and an ellipse representing the body.

  2. 2.

    Distribution: The distribution of the components describes the spatial relationship between such components. In this way, using the teapot example again, the distribution should point out that the handle and the spout are located on the sides of the body; one on the left and the other on the right side.

To our knowledge, the methods proposed for the sketch based image retrieval do not exploit the structural property of sketches. The current methods are based on edge point distribution or edge point orientations which do not appropriately represent the structural components of the objects on the image. Furthermore, the interest point approach [27] in the computer vision field does not represent components with the semantic level as we are defining here. Moreover, the local region around keypoints could not be discriminating enough since sketches are simple line-based drawings.

The main contribution of this work is to propose a novel method for sketch-based image retrieval that takes advantage of the structural property of objects appearing in an image. Representing sketches by means of their structural components brings up the following advantages:

  • Structural representation allows methods to represent objects on a higher semantic level which is reflected in the increment of the retrieval effectiveness.

  • Structural representation allows methods to handle a smaller number of components with respect to the case of using an interest point approach. This leads to a more efficient matching step.

Our proposal deals with the structural property of objects by means of detecting simple geometric shapes that we call keyshapes which represent the structural components of objects. In addition, we propose two local descriptors computed over each detected keyshape. These descriptors represent the spatial distribution of the structural components. In short, our proposal exploits the structural property of an object describing the distribution of the parts composing it.

In addition, owing to our method is based on a different feature (the structure) from those used by current methods, a combination of our method with a method that has showed good results would lead to a significant improvement in the retrieval effectiveness. In this paper, we experimentally demonstrate such a property of our proposal.

Definition 2

A keyshape is a simple geometric shape that in conjunction with other simple geometric shapes composes a more complex object. Examples of a keyshape may be a circle, an ellipse, a square, a line, among other.

As showed in Fig. 2, our proposed technique consists of three stages (1) keyshapes detection, that allows us to detect simple shapes from an input image, (2) local descriptor computation, that allows us to locally represent the spatial relationship between a reference shape with respect to the others, (3) matching, that computes a cost value after setting some relation between two sets of local descriptors. In the following sections we will describe each one of these stages in detail.

Fig. 2
figure 2

A graphical representation of the stages involved in our SBIR approach

3.1 Keyshapes detection

In this section, we describe the first stage of our proposal. This stage involves, mainly, obtaining a sketch like representation, in particular from test images that are not sketches by themselves, and detecting a set of keyshapes that will be used for computing local descriptors.

3.1.1 Sketch-like representation

We will detect keyshapes from the input sketch (the query) and the images from the database (test images). In order to compare a sketch with a test image we require to transform a test image into a sketch-like representation. In particular, we need to transform the test images into sketches since they are regular color images. A simple way to carry out this transformation is by using an edge detection procedure. To this end, we use the Canny operator [3]. We prefer the Canny method than other methods like the Berkeley boundary detector [20] because of two reasons: (1) Canny allows us to get edge pixels accurately, and (2) computing Canny is much faster than the Berkeley’s approach. In Fig. 3 we depict an image and its edge map representation computed using the well known Canny approach with σ = 0.3.

Fig. 3
figure 3

Simple Canny edges of a test image

A simple edge detection method produces a chaotic edge image. Many of the edge pixels do not provide relevant structural information as we note in Fig. 3. Furthermore, many of them may be a result of noise which may cause degradation in the keyshape detection and consequently in the retrieval effectiveness. To solve this problem, we apply the Canny operator in a multiscale manner. Each scale is computed by the Canny operator applied over a downsampled image. For each scale, an image is downsampled using a factor of 0.5 with respect to the previous scale. In our method, we apply the downsampling process iteratively until the size of the resulting image is less than 200 pixels in one of its dimensions. In our experiments, the downsampling stops after approximately three iterations. We show an example of this process in Fig. 4.

Fig. 4
figure 4

Edge image produced by a multiscale Canny

In the same vein, we apply a downsampling approach for the query. However, due to a query already being a sketch we apply, a thinning operation [14] instead of the Canny operator. The result of the this approach is shown in Fig. 5.

Fig. 5
figure 5

Edge images produced by a multiscale approach over a sketch

After the multiscale stage we keep only the edge map of the third scale, where the noise has been reduced considerably. In addition, small details have been deleted leaving high-scale edges representing the object from a coarser level.

Having a sketch-like image from both the test image and the query, the next step is to obtain an abstract representation from them that allows us to detect simple shapes in a easy way. A good approach for getting such an abstract representation is to decompose the sketch-like image into strokes.

Definition 3

A stroke is a set of pixels produced when a user is making a line-based hand-drawing between a “pen down” and a “pen up” event.

Considering that the underlying images are produced in an offline environment, we do not have available information about the real strokes. To face this problem, we propose to approximate real strokes by edge links.

3.1.2 Edge links

An edge link is a sequence of edge pixels starting and ending in a branching or terminal point. In Fig. 6, branching points are shown enclosed by circles and terminal points are shown enclosed by squares. We call the branching or terminal points simply breakpoints.

Fig. 6
figure 6

Branching and terminal points for two sketch images

A simple algorithm to compute edge links is by tracing neighbor edge pixels starting from a breakpoint until another breakpoint is reached. Having a binary image as input, the output of the algorithm is a set of edge links. The complete description of this algorithm is presented in Algorithm 1. A nice implementation of this algorithm is provided by Kovesi [17].

The algorithm for edge links computation described in Algorithm 1 returns a set of edge links which approximate real strokes. For this reason, henceforth we call these edge links simply “strokes”. In Fig. 7, the image of Fig. 6 is depicted with its corresponding strokes which are showed with different colors.

Fig. 7
figure 7

Strokes approximated by edge links. Different colors indicating different strokes.

3.1.3 Keyshapes

Detecting simple shapes like circles or ellipses on an image might lead to a time consuming process, so we propose an efficient strategy for detecting six types of keyshapes. These keyshape classes are: (1) vertical line, (2) horizontal lines, (3) diagonal line (slope = 1), (4) diagonal line (slope = −1), (5) arc, and (6) ellipse (see Fig. 8). The latter also includes circular shapes. It is important to note that as the number of keyshapes increases, the complexity for scaling the method to large databases also increases. For this reason, we decided to keep just six keyshape types. In spite of the small number of classes, we still get important structural information from images. Furthermore, we chose to represent a sketch by the six mentioned primitives because they represent basis shapes from which more complex shapes are formed. We will now describe the complete process for detecting keyshapes.

Fig. 8
figure 8

The six clases of keyshapes used in this proposal: four types of lines, arc and ellipse

Considering that a stroke may include one or more keyshapes (see Fig. 9), the first step is to divide each stroke S, computed previously, into a set of one or more stroke pieces SP S . To this end, a stroke S is divided with respect to its inflection points.

Fig. 9
figure 9

On the left, a synthetic example showing inflection points on a stroke. On the right, straight lines (dashed lines) approximating stroke pieces

Definition 4

An inflection point is a point where the local edge pixels around it are distributed significantly in two directions.

To determine whether a stroke point q is actually an inflection point or not, we compute the two eigenvalues (λ 1, λ 2) of the covariance matrix of the points falling in a local region around q. Then, we evaluate the ratio r = max(λ 1, λ 2)/min(λ 1, λ 2) and proceed to mark q as an inflection point if r is greater than a threshold. We experimentally set this threshold equal to 10.

Trying to evaluate all points of a stroke looking for inflection points is, actually, a time consuming task. To face this problem, instead of applying the inflection point test over all stroke points, we test only a small set of stroke points. We call this set as the set of maximum deviation points (SoM). We obtain the SoM by approximating a stroke S, which is defined by a set of points S = {p i , ⋯ p j }, by straight lines. To this end, a line L is set between p i and p j . Then we look for the point p k  ∈ S (k = i ⋯ j), with the maximum distance δ M to L. We call p k a maximum deviation point. If δ M  > μ then we process recursively the resulting substrokes S 1 = {p i , ⋯ p k } and S 2 = {p k + 1, ⋯ , p j }. Finally, all the maximum deviation points found by this method form the SoM. A description of this algorithm is described in Algorithm 2.

We should note that the Algorithm 2 returns the set of maximum deviation points (SoM) together with their corresponding deviation value. In Fig. 9 we show a synthetic example of the stroke decomposition in a set of stroke pieces.

After the stroke decomposition stage, we get a set of stroke pieces {sp 1, ⋯ , sp K } for a given stroke S. The next task is to classify each of these stroke pieces as one of the six predefined keyshape types. In addition, we could also get the number of straight lines approximating a stroke piece from the SoM. Let N i be such a number for a stroke piece sp i . This will be a valuable information for the subsequent steps.

To determine the occurrence of a keyshape, we start applying a test which will determine whether sp i is an ellipse or not. We apply this test only if N i  ≥ 2. To this end, we use the function testEllipse, that will be described afterward, that returns a fitness value indicating how well sp i is approximated by an ellipse together with the underlying approximated ellipse parameters. If N i  < 2 or the ellipse fitness value is less than a threshold, the stroke piece sp i may be composed by arcs or straight lines. In this way, if N i  = 1, a line is detected. Otherwise, we apply a test for detecting arcs using the function testArcs, described later, that returns an error of arc approximation together with the underlying arc parameters. If the error of arc approximation is greater than a threshold we divide sp i in the point with maximum deviation using the SoM. The resulting two substrokes are tested separately to detect arcs and lines recursively. The algorithms for detecting keyshapes are described in Algorithms 3 and 4.

3.1.4 Detecting ellipses

The testEllipse function takes a stroke piece SP and tries to approximate SP by an ellipse. To approximate an ellipse we use the ellipse general equation as in [30].

$$ \label{eq:ellipse} Ax^2 + 2Bxy+Cy^2+2Dx + 2Ey + 1 =0 $$
(1)

where the five parameters (x c , y c , r max, r min, θ) corresponding to the center of the ellipse (x c , y c ), the maximum and minimum radii (r max, r min), and the angle respect to the major radius are obtained as follow:

$$x_c = \frac{BE-CD}{W} \\ $$
(2)
$$y_c = \frac{DB-AE}{W} \\ $$
(3)
$$r_{\mathrm{max}} = \sqrt{\frac{-2\cdot det(M)}{W(A+C-R)}}\\ $$
(4)
$$r_{\mathrm{max}} = \sqrt{\frac{-2\cdot det(M)}{W(A+C+R)}}\\ $$
(5)
$$\theta = \frac{1}{2}tan^{-1} \left( \frac{2B}{A-C}\right), $$
(6)

where

$$W=AC-B^2\\ $$
(7)
$$R=\sqrt{(A-C)^2+4B^2}\\ $$
(8)
$$M= \left( \begin{array}{ccc} A & B & D \\ B & C & E \\ D & E & 1 \end{array} \right). $$
(9)

To solve the (1) with respect to a stroke piece SP = {p 1, ⋯ , p n }, we take the following five points: \(p_1, p_\frac{n}{4}, p_\frac{n}{2}, p_\frac{3n}{4},p_{n}\). We then validate the estimated ellipse with respect to the pixels on the image. To this end, we use a fitness function defined by Yao et al. [30].

The fitness function takes each point p from the approximated ellipse E and looks for the closer edge pixel on the image. Let d p be the closer distance from p to any edge pixel on SP, we compute the fitness value as:

$$ fitness_E=\frac{\sum_{\forall p \in E} \frac{1}{exp(\gamma d_{p})}}{|E|}, $$
(10)

where the distance function is the Manhattan distance and γ is a regularization factor that we set to 0.2. The fitness value varies from 0 to 1, where 1 indicates a perfect approximation value while 0 indicates a null approximation.

3.1.5 Detecting arcs

Since an arc is actually a segment of a circle, the testArc function tries to approximate a circle with five points selected from the underlying stroke piece.

These five points are chosen in the same way as in the case of the testEllipse function. The resulting parameters are the circle center (x c , y c ), and the circle radius r. To evaluate how well a stroke piece SP is approximated by an arc, we compute an approximation error in the following way:

$$ error_A=\displaystyle \sum\limits_{p \in SP} |r -dist(p, (x_{c}, y_{c}))| $$
(11)

where dist is the Euclidean distance.

After detecting ellipses, arcs, or straight lines, we represent each detected keyshape with a set of parameters. In this way, each keyshape is specified as follows:

  • Lines: [x 1, y 1, x 2, y 2, L, ι], where (x 1, y 1) is the initial point and (x 2, y 2) is the final point of the detected line. L is the line length. In addition, we use ι to indicate the type of the line (horizontal, vertical, diagonal (slope 1), diagonal (slope −1)).

  • Arcs: [x c , y c , r], where (x c , y c ) is the center of the estimated circle, and r c is the corresponding estimated radius.

  • Ellipse: [x c , y c , r max, r min, ϕ], these are the five parameters of an ellipse, center (x c , y c ), maximum and minimum radii (r max, r min), and orientation ϕ.

Finally, we produce a new stroke representation by means of an image called keyshape image drawing on it each detected keyshape. The keyshape image might be regarded as a normalized stroke-based representation. Two examples of a keyshape image produced by the keyshape detection process are shown in Fig. 10.

Fig. 10
figure 10

Test images on the left and their keyshape images on the right.

3.2 Local descriptors

In this section we focus on describing a local region around each keypoint, aiming to represent the distribution of keyshapes. We know from Definition 1 that the distribution of the object components is a very important aspect for the structural characterization of the underlying objects. In this stage, we propose two local descriptors for the sketch based image retrieval problem which are characterized by taking into account spatial information of the keyshapes.

The first descriptor is called Keyshape Angular Spatial Descriptor (KASD). It divides a local region in an angular way and takes some information from each slice to make a histogram, where the histogram size is the same as the number of slices. The angular partitioning has been used in the computer vision field for the object recognition task [22] and for describing line-base hand-drawings in a global manner [5].

The second descriptor is named Histogram of Keyshape Orientations (HKO). It is a SIFT-like descriptor which computes a local histogram of keyshape orientations on a local region around a referent keyshape.

3.2.1 Local region

We define a local squared region with respect to a keyshape (the referent keyshape) in order to compute a local descriptor that characterizes the spatial distribution of the keyshapes around the referent keyshape. The center of the local region coincides with the center position of the underlying keyshape. Thus, let k be a keyshape. If k is a line, the local region is centered on the center of that line. If k is an arc, the local region is centered in the center of the circle containing the arc. In the case of k is an ellipse, the region is centered on the ellipse center.

In order to face with scale variations, we define the region size depending on the keyshape size. Thus, let k be a keyshape and l be the length of the underlying square-like region side, we proceed to compute l as follows.

  • If k is a line then l = length(kη.

  • If k is an arc then l = radius(kη.

  • If k is an ellipse then l = major_radius(kη.

The symbol η is a constant that we define to be equal to 3. Figure 11 shows a keyshape with the scope of its corresponding local region.

Fig. 11
figure 11

A local region of around of a diagonal-type keyshape (that marked with a red filled circle)

3.2.2 Keyshape Angular Spatial Distribution (KASD)

Let k be a reference keyshape. We divide the local region in angular partitions around k as shown in Fig. 12a. The number of angular regions or slices (N SLICES ) is fixed (we suggest using four or eight slices). For each slice, we compute a local histogram that represents the local distribution of the keyshape points with respect to the six keyshape types. Consequently, we obtain a 6-bin histogram for each slice. Then, we concatenate all the local histograms built for each slice to form a N SLICES ×6-size descriptor. Finally, we transform the descriptor into its unit representation.

Fig. 12
figure 12

On the left, an angular partitioning descriptor. On the right, a histogram of orientations descriptor

The spatial distribution of keyshapes is represented by the local histograms spread through the slices in which the local region is partitioned.

3.2.3 Histogram of Keyshape Orientations (HKO)

Let k be a reference keyshape, we divide the local region around k into a 2×2 grid, where each cell of the grid is called a subregion, as shown in Fig. 12b. For each subregion we compute a histogram of orientations. The orientation angle varies between 0 and π, and it is quantized using 8 bins. The keyshape orientation is computed approximating local gradients using Sobel masks [13]. The local gradients are computed over the keyshape image where many of the outliers have been removed. Therefore, computing orientations over the keyshape image is more robust than computing them over a simple edge map representation. The final descriptor is the concatenation of the four histograms of orientations. We obtain a 32-size descriptor as a result of having four 8-bin histogram for each one of the four subregions. Similar to the above, we take the unit representation of the descriptor.

3.2.4 Combined Descriptor (CD)

To take advantage of both descriptors discussed previously, we propose to use a combined descriptor. This is a descriptor composed of two parts. The first one corresponds to the Keyshape Angular Partitioning Descriptor (KASD) and the second to the Histogram of Keyshape Orientations (HKO) descriptor. A schematic example of the combined descriptor is depicted in Fig. 13. In this case, we reduce the number of slices for the KASD representation to 4. Thus, the combined descriptor is a 56-size vector, where the first 24 bins correspond to the KASD descriptor and the last 32 bins correspond to the HKO descriptor. As with both descriptors discussed previously, we use the unitary representation of the combined descriptor.

Fig. 13
figure 13

Scheme of the combined descriptor using a KASD descriptor with 24 bins and a HKO descriptor with 32 bins

3.3 Matching

As seen in previous sections, both a test image and an input sketch are represented by a set of local descriptors whose size depends on the complexity of the underlying image (for instance, see the images presented in Fig. 10). The next step is to find a way to map descriptors from an input sketch to descriptors computed from a test image in order to get a similarity or dissimilarity score. This score will allow us to rank the test images in an increasing order with respect to the similarity score or in an decreasing order with respect to the dissimilarity score.

Let S be an input sketch and LD(S) be the set of local descriptors of S. We define LD(S) as:

$$ LD(S)= \displaystyle \bigcup\limits_{t \in \{v,h,d_1,d_{-1},a,e\}} LD_{t}(S) $$
(12)

where LD v (S) is the set of local descriptors of vertical lines in S. In the same way, LD h (S) is set for horizontal lines, \(LD_{d_1}(S)\) is set for diagonal lines having slope 1, \(LD_{d_{-1}}(S)\) is set for diagonal lines having slope −1, LD a (S) is set for arcs, and LD e (S) is set for ellipses.

Further, let I be a test image that will be compared with S. Similarly as the case of S we define LD(I) as:

$$ LD(I)=\displaystyle \bigcup\limits_{t \in {v,h,d_1,d_{-1},a,e}} LD_{t}(I) $$
(13)

The matching process is performed between local descriptors corresponding to the same keyshape type. Since the number of local descriptors is much lower than in the case of having keypoints like those used by the SIFT or the shape context approach [1, 19], we could solve an instance of the bipartite graph problem using the well known Hungarian method [18] between LD t (S) and LD t (I), t = h,v,d 1,d  − 1,a,e. The final match results from the union of the partial matches.

Specifically, the Hungarian method solves an instance of the assignment problem where, given two sets A, B, the goal is to assign objects of B to objects of A. This produces a one-to-one relationship between A and B. Moreover, an assignment between two objects produces a cost known as the assignment cost. Thus, the objective is to set an appropriate assignment between objects of A and objects of B minimizing the total assignment cost.

Coming back to our case, A is the set of local descriptors of S (LD(S)), and B is the set of local descriptors of I (LD(I)). The objects are unitary vectors which represent local descriptors. Finally, the cost function we use is the Manhattan distance.

Let M(S,I) be the resulting match between S and I from the previous process, defined as follows:

$$\begin{array}{lll} M(S,I)& = &\{(i_S, j_I, c) | (i_S, j_I, c) \textrm{ is a match } \nonumber \\ && \textrm{whose cost is } c, \wedge i_S \in LD(S) \wedge \nonumber\\ && j_I \in LD(I) \}, \end{array} $$
(14)

we define the following properties on M:

  • |M(S,I))|: The cardinality of M, i.e the number if matches between S and I.

  • A(M(S,I)): The average match cost defined as:

    $$ A(S,I)=\sum\limits_{\forall (i,j, c) \in M} \frac{c}{|M(S,I)| }. $$
    (15)

    In the case that no matches occur, A(S,I) = max(LD(S), LD(I)).

  • U(M(S,I)): The number of unmatched descriptors. It is defined as follows:

    $$ U(S,I) = \mathrm{max}(LD(S), LD(I)) - |M(S,I)| $$
    (16)

3.3.1 Match filtering

The matching is followed by a filtering step that aims to discard matches discordant with a pose estimation. To this end, we apply a vote based approach that takes into account spatial information from the corresponding matched keyshapes in order to estimate a geometric transformation. This transformation considers the scale and location parameters of the computed keyshapes. In this way, each match vote for a certain transformation, the predominant transformation will correspond to the pose estimation. Only the matches that voted for such a pose are held, the remaining matches are discarded. In Algorithm 5 we show how to filter a set of matches.

In Algorithm 5, scale(k) gives the scale of a keyshape k. The scale for lines is simply the length of the underlying line, the scale for arcs is the associated radius, and the scale for ellipses is the corresponding major radius. In addition, for the quantization step we divide both the x-dimension and the y-dimension by a factor 3.

3.3.2 Dissimilarity score

After the match filtering step we need to get a score value that represents the similarity or dissimilarity between both images. To this end, we define three dissimilarity functions based on the average match cost, the number of matches, and the number of unmatched descriptors.

  1. 1.

    D 1(S,I) = A(S,I) + β·U(S,I).

  2. 2.

    D 2(S,I) = 1/|M(S,I)|.

  3. 3.

    D 3(S,I) = A(S,I)/|M(S,I)|.

The dissimilarity function can be regarded as a cost function. Thus, D 1 and D 3 are based on the average match cost A(·, ·). Furthermore, the number of unmatched descriptors U(·, ·) also can express a cost function. Thus, as the value of U(·, ·) increases, the resemblance level between the comparing images decreases. In contrast, as the number of matches |M(·, ·)| increases, the resemblance level between both images also increases. Therefore, we also use the number of matches in D 2 and D 3, but inversely. Furthermore, in the case of D 1 we use a factor β = 0.15 with respect to the number of unmatched descriptors to avoid any possible bias effect due to the different ranges of the two involved functions.

To exploit the benefits of all dissimilarity functions defined above, we compute different rankings combining local descriptor with dissimilarity functions. Thus, we define \(r^S_{(D,F)}\) as the resulting ranking for an input sketch S, where D indicates one of the local descriptors (KASD, HKO, or CD) and F is one of the dissimilarity functions (D 1, D 2, or D 3) that allows us to sort the test images.

Let rank(r, Γ) be a function that provides the position where a test image Γ appears in the ranking r. For computing the final rank, we need to determine a new score (the final score) for each test image. This score is computed as follows:

$$ final\_score(\Gamma)=\displaystyle \sum\limits_{\forall r} rank(r,\Gamma). $$
(17)

The final ranking then is formed in an increasing order with respect to the new scores. Images with low scores must appear first in the final ranking. Therefore, if an image appears in the top of all rankings, it also will appear in the top of the combined ranking.

In order to get the rankings, we propose to use the followings:

  • \(r_{(CD,D_2 )}\): Ranking produced by the combined descriptor and the dissimilarity function D 2.

  • \(r_{(KASD_4, D_3)}\): Ranking produced by the KASD descriptor using 4 partitions and the dissimilarity function D 3.

  • \(r_{(KASD_8, D_1)}\): Ranking produced by the KASD descriptor using 8 partitions and the dissimilarity function D 1.

3.4 Computational complexity

In this section, we discuss the computation complexity of our approach. For the sake of clarity, we divide the discussion of the computational complexity of our algorithms in three groups. First, we discuss the complexity of the detection keyshape stage. Second, we discuss the complexity of computing the proposed descriptors. Finally, we discuss the complexity of comparing a test image with an input sketch.

3.4.1 Keyshape detection

This stage involve many algorithms including pre-processing tasks, edgelink detection, inflection point detection and the keyshape classification.

  • Pre-processing: This task consists in analyzing each pixel of an image to obtain an edge map representation. The complexity of this task is O(N) where N is the number of pixels of the underlaying image.

  • EdgeLink: In this stage, we trace the edge map pixels to determine potential strokes. Therefore, the complexity of this algorithm is O(E), where E is the number of edge pixels.

  • Set of Maximum Deviation Points: To get the Set of Maximum deviation (SoM), the algorithm receives a set of strokes and, by approximating straight lines, determines points of maximum deviation with respect to those lines. Tho this end, the algorithm has to evaluate a number of points proportional to the size of E. Therefore, the complexity of getSoM is O(E).

  • Inflection Points: To determine inflection points the algorithm evaluates each maximum deviation point together with a local region of 25×25 pixels. As the size of SoM is less than the size of E, the complexity of this algorithm is also bounded by O(E).

  • Detection of Keyshapes: In this stage, each stroke piece is evaluated to determine what keyshape the underlying stroke piece corresponds to. However the algorithm for classifying arcs and lines may have a cost proportional to the number of straight lines that approximate a stroke piece. Therefore, this stage has a complexity of O(NL), where NL is the number of approximating straight lines, which is less that the size of E.

Therefore, the complexity of detecting keyshapes is O(N + E ). However, since N > E, the complexity results to be linear with respect to the image size.

3.4.2 Keyshape descriptors

We have presented two descriptors. The first one, KASD, processes an image in time O(K 2), where K is the number of keyshapes. The second approach, HKO, runs in time O(K×W), where W is the size of the local regions around a descriptor is computed.

3.4.3 Keyshape matching

We use the Hungarian method to find a set of matches between two sets of keyshapes. This algorithm runs in cubic time with respect to the size of the input [28]. As our input is a set of keyshapes, the time of the algorithms runs in O(K 3), where K is the number of keyshapes . In practical issues, the average number of keyshapes computed in our data set is approximately 30, which requires low computational cost in terms of matching. Experimentally, the reported time for comparing two sets of keyshapes is approximately 8 ms.

In conclusion, the total cost for comparing two images is O(N + K×W + K 3), where N is the number of pixels in the image, K is the number of keyshapes and W is the size of the local region used for computing local descriptors.

4 Experimental evaluation

4.1 The benchmark

Due to sketch-based image retrieval being a young research area, there are a few benchmarks for comparing SBIR methods. In this regard, Eitz et al. [11] conducted an interesting systematic work to propose a benchmark in this area. We chose this benchmark because this is the only benchmark that takes into account the user’s opinion about the relevance of an image with respect to an input sketch. This fact turns very important as the main goal of a retrieval system is, precisely, to calculate a ranking very close to what the user is expecting to get.

This benchmark consists of 31 query sketches, each one associated with a set of 40 test images. An example of two query sketches and seven of their associated test images in the data set are shown in Fig. 14.

Fig. 14
figure 14

Two query sketches with seven associated images.

Furthermore, each test image has been ranked by people using a 7-point Likert scale [11] with 1 representing the best rank to 7 representing the worst rank. This provides the baseline ranking which is called the user ranking.

We proceed to compare the ranking of the proposal method against the user ranking. To this end, Eitz el al. propose to use the Kendall’s correlation τ that determines how similar two rankings are. The correlation coefficient τ may vary from 1 to − 1, with − 1 indicating that one ranking is the reverse of the other, 0 indicating independence between the rankings, and 1 indicating that the two rankings have the same order. Therefore, as we achieve a correlation value near 1, our proposal will be better, indicating that the resulting ranking is very similar to that expected for the users.

Considering that both the user ranking and the ranking of a proposal method may include tied scores, a variation of the Kendall’s correlation is used. This variation is denoted by τ b and it is defined as:

$$ \tau_b = \frac{n_c-n_d}{\sqrt{(n_0-n_1)(n_0-n_2)}}, $$
(18)

where

  • n c  = Number of concordant pairs.

  • n d  = Number of discordant pairs.

  • n 0 = n(n − 1)/2.

  • n 1 = \(\sum_{i=1}^{t} t_i(t_i-1)/2\).

  • n 2 = \(\sum_{i=1}^{u} [u_i(u_i-1)/2\).

  • t i  = Number of tied values in the ith group of ties for the user ranking.

  • u i  = Number of tied values in the ith group of ties for the proposed ranking.

  • t = Number of groups of ties for the user ranking.

  • u = Number of groups of ties for the proposed ranking.

4.2 Involved parameters

In order to make our approach be repeatable, we present in Table 1 the values of the parameters used during the experimental evaluation.

Table 1 Values of the parameters involved in our approach used during the evaluation process

4.3 Discussion of the results

In the experiments conducted by Eitz et al. [11], the best correlation value they achieved was 0.277. This result was obtained using a Bag of Features methodology with a variant of the histogram of gradient descriptor. The approach uses a 1000-size codebook, that means that the codebook is formed with 1000 codewords. In addition, a correlation value under 0.18 is reported for the Shape Context method [1].

Our results show a slight increment in the effectiveness of the retrieval process without requiring a tedious learning stage. We achieve a correlation value of 0.289. A graphic showing the difference correlation value for different SBIR methods is depicted in Fig. 15. Considering that our proposal is based on the structural feature of a sketch representation, which has not been taked into account for current methods, our method is potentially adequate for being combined with a leading method to increase the retrieval effectiveness.

Fig. 15
figure 15

Kendall’s correlations for SBIR methods. Our proposal (the last bar on the right) outperforms that of the state of the art. Our keyshape based approach achieves a correlation value of 0.289 which outperforms the correlation achieved by the BoF approach proposed by Eitz et al. (0.277)

To show the goodness of combining our method with a current leading method, we show, in this document, the results of combining our proposal with the BoF approach proposed by Eitz et al. [11]. For combination, we follow the same approach we use to combine our two proposed descriptors. In Fig. 16 we show that the combined method (Keyshape+BoF) allows us to achieve a correlation value of 0.337 which means an increase of almost 22 % with respect to the correlation value achieved by the BoF approach. To validate these results, we run a statistical test (T-Test) comparing the results produced by the BoF approach and the results achieved by our combined proposal. The statistical test gave a p_value of 0.1 % which indicates that our combined method improves significantly the BoF approach.

Fig. 16
figure 16

Kendall’s correlations for the approaches: BoF (0.277), Keyshape-based (0.289), and the Keyshape+BoF (0.337)

Furthermore, we present in Figs. 17 and 18 the correlation value achieved for each sketch using the BoF approach, our keyshape-based proposal and the combined method. We note that although our approach gains a significant improvement for some queries (for instance, see query 25 and 22), the overall improvement is not significant enough. However, when we analyze the correlation achieved by the combined method, we see that it achieves an improvement in 24 queries with respect to the BoF results. Additionally, the overall effectiveness is improved in almost 22 %.

Fig. 17
figure 17

Correlation values for the first 15 queries using the BoF, Keyshape-based and Keyshape+BoF approaches

Fig. 18
figure 18

Correlation values for the last 16 queries using the BoF, Keyshape-based and Keyshape+BoF approaches

From our results showed in Figs. 17 and 18 we can also note that our method achieves impressive improvement for queries Q2 and Q25 (see Fig. 19) with respect to the correlation achieved by the BoF approach (BoF achieves negative correlation). In the case of Q2, our keyshape based approach achieves a correlation value of 0.3984 which is much better that the correlation (0.1326) achieved by the Eitz’s approach. In the case of Q25, our keyshape approach achieves a correlation of 0.4373 which is again higher than the correlation value achieved by the Eitz’ approach, that in this case is negative (− 0.0258). A possible reason for this fact is that these queries can be represented by a set of very discriminative keyshapes. In particular, Q2 can be represented by a set of arcs and ellipses, while Q25 can be represented by a family of straight lines. In this sense, the keyshape based descriptors seems to be very appropriated to represent them.

Fig. 19
figure 19

Sketch queries for what our method achieves significant improvement with respect to the BoF approach. Q2 appears on the left and Q25 appears on the right

It is also important to note that some images may not be good represented by keyshapes. In particular, we note that Q30 (see Fig. 20) is the query with the lowest correlation achieved by our approach (− 0.0026). This poor result may be explained from the viewpoint of the test images related with this sketch. Indeed, although the sketch may be appropriately represented by a set of arcs or straight lines, the images related with this sketch seem to be badly represented by our approach. After analysing these images, we realized that these are characterized by being cluttered image which may also justify the poor performance obtained by BoF approach (0.0979). This shows that a cluttered image is not only a problem for our approach but also for current approaches. Therefore, dealing with cluttered images represents a challenging problem that we have to address in the area of sketch based image retrieval. Additionally, to show the performance of our proposal, we present in Fig. 21 the top five retrieved images for seven input sketches.

Fig. 20
figure 20

A query for what our method achieves the lowest correlation (− 0.0026)

Fig. 21
figure 21

Example of SBIR using our proposal after retrieving the first five images from the test database. The first column shows the input sketch (the query) and the next five columns show the first five retrieved images, respectively

Finally, although the benchmark used in this work is focused in evaluating the correlation of the resulting rankings, we also evaluate the precision and recall after retrieving the correspondent images for each sketch. In the Fig. 22 we show the precision-recall graphic comparing the BoF, Keyshape-based and Keyshape+BoF approaches. This evaluation shows that the performance of the keyshape based proposal and the BoF approach are similar. However, the combination of these two approaches allow us to increase the retrieval effectiveness achieving a MAP of 0.87. This again shows that our keyshape based approach exploits different features from those exploited by the BoF approach which allow us to get a feasible combination.

Fig. 22
figure 22

Precision-Recall graphic comparing the BoF, Keyshape-based and Keyshape+BoF approaches

5 Conclusions

In this paper we have presented a novel local method for sketch based image retrieval. Our method is based on detecting simple shapes called keyshapes. This allows us to get structural representation of images leading to an improvement of retrieval effectiveness. Our proposal improve the results of the state of the art methods achieving a correlation value of 0.289.

Furthermore, we analyze a combined proposal exploiting the results of the BoF approach and the results of our keyshape based approach showing that our method is complementary to the BoF approach. This fact is reflected by the results achieved by the combination of both methods. The results shown that using a combined method allows us to increase the retrieval effectiveness in almost 22 %. Additionally, we shown that this result is significantly better than that of the state-of-the-art methods.

In addition, we have presented an efficient method for detecting simple shapes on an image. This strategy could be applied for other applications requiring a reduction of the image complexity.

Of course, sketch based image retrieval is still a challenging task. In this vein, our current work is focused on defining a more efficient metric for comparing keyshapes. In addition, we are working on evaluating our work in large scale dataset as well as extending our approach to the sketch based 3D models retrieval.