Keywords

1 Introduction

Relational databases enjoy a considerable popularity and have been used for a number of decades now. Their main advantages include durability of data storage, transaction processing, relational model, error handling and the SQL language. However, in distributed systems and in the case of data with different structure so-called NoSQL databases are used much more frequently. For Big Data processing special databases are developed, which work on special file systems supported for instance in Hadoop and dedicated frameworks for fast and parallel processing (e.g. MapReduce). NoSQL databases are different from commonly-used relational databases (RDBMS) in terms of the following aspects: not using the SQL language, not having to follow the ACID model (Atomicity, Consistency, Isolation, Durability), and also having no relationships and tables of a defined structure.

One of the tasks in which a NoSQL database can be used is effective browsing and searching a large number of images. NoSQL databases can successfully store enormous amounts of data including image data. In solving image processing and retrieval problems algorithms from different fields of computational intelligence are used [18, 22, 23, 25], in particular fuzzy systems [14, 15], rough neuro-fuzzy systems [16, 17, 26], evolutionary algorithms [20, 24], swarm intelligence [2729], mathematics [21, 30, 31], decision tree [19] and data mining [32, 33]. One of the most popular and widely spread algorithms used for indexation and image retrieval is the bag-of-words model (BoW) [4, 5], known also as a Bag-of-Features (BoF) or Bag-of-Visual-Words. This algorithm is based on a concept of text search methods within collections of documents. Single words are stored in dictionaries with emphasis on appearing in various documents. The BoW in a similar way creates dictionaries of characteristic features appearing in images. Additionally, the classification process enables during the search to determine what type of image class we are dealing with.

Practical aspects of the BoF algorithm implementation in image classification are rather rare in the literature. There are many modifications of this algorithm which, for example, use various image features [68] or various clustering and classification algorithms [9, 10], but there are no examples of practical applications of this algorithm. Most simulations and experiments are carried out with the use of OpenCV library, Matlab environment or multi-core processors. In practice direct usage of this particular kind of algorithms is connected with considerable computing capacities required when using classification algorithms and having no information concerning using data bases. The possibility of using a NoSQL database allows us to make a quite simple use of a number of computers to store large amount of data and do parallel computing. In most cases parallelism can be successfully carried out on a database which has been properly managed.

The article is divided into a few sections. Section 2 outlines the algorithms of which the whole image storing system. Section 3 presents the results of the experimental research testing the efficiency of the presented algorithms as well as the details connected with the NoSQL database being used for the implementation of this method. The conclusions in the last section present ideas concerning further improvement of the system efficiency.

2 Description of Algorithms

We are considering herein a set of given images \( {\mathbf{I}}_{i} \), where \( i = 1, \ldots ,{\mathbf{I}}_{M} \) and \( M \) is the number of all images. Each image \( {\mathbf{I}}_{i} \) has a class \( c({\mathbf{I}}_{i} ) \) assigned to it, where \( c({\mathbf{I}}_{i} ) \in \varOmega \), \( \varOmega = \{ \omega_{1} , \ldots , \omega_{C} \} \) is a set of all classes and C is the number of all classes. The images \( {\mathbf{I}}_{i} \) make the initial data which will be stored in the NoSQL database and will be used to create a dictionary for the BoF method (see Sect. 2.3). K-means algorithm (see Sect. 2.2) groups characteristic features retrieved from an image concurrently reducing their number and creates words included in the dictionary. The modification of the k-means algorithm which we introduce allows for an automatic selection of the number of groups. Image characteristic features are obtained as a result of the operation of the SURF algorithm. (see Sect. 2.1).

2.1 SURF

SURF (Speeded Up Robust Features) is a robust local feature detector, first presented in [1]. It is partly inspired by the SIFT descriptor [2]. SURF gives description of an image by selecting its characteristic features. First, an integral image and filter approximation of block Hessian determinant is applied. Next, to detect interesting points, a special Hessian-matrix approximation is used. For features, orientation is based on information from circular region around the pixel. Then, a square region aligned to selected orientation is constructed and the SURF descriptor is extracted from it. It uses the sum of the Haar wavelet responses around an interest point. The local feature around the point is described by a 64-number vector \( {\mathbf{x}} = \left[ {x_{1} , \ldots ,x_{64} } \right] \).

2.2 Modified k-Means Algorithm

The k-means algorithm is the most frequently used clustering algorithm used in the BoF. Its only drawback involves having to define the initial number of classes \( c \). In this section we present an automatic selection mechanism of the number of classes during the operation of this algorithm. We have used the growing method used in the Growing Self-Organizing Map (GSOM) algorithm [3]. In that method a cluster is divided when the number of its data exceeds a certain threshold value ϴ. Operation of the said algorithm starts with setting the threshold value ϴ and defining two clusters (\( c = 2 \)). In the subsequent steps the algorithm works as a classic k-means with the only difference being that at the end of each iteration the number of points belonging to each cluster \( \tau_{j} ,\,j = 1, \ldots ,c \) is checked. If the number \( \tau_{j} \) exceeds the threshold already set at ϴ, then another cluster \( c + 1 \) is created. The algorithm is presented below in detail.

Let \( {\mathbf{X}} = {\mathbf{x}}_{1} , \ldots ,{\mathbf{x}}_{n} \) be a set of points in d-dimensional space, and \( {\mathbf{V}} = {\mathbf{v}}_{1} , \ldots ,{\mathbf{v}}_{c} \) be cluster centers, where n is the number of samples, \( {\mathbf{x}}_{i} = \left[ {x_{i1} , \ldots ,x_{id} } \right] \), c is the number of clusters, and \( {\mathbf{v}}_{j} = \left[ {v_{j1} , \ldots ,v_{jd} } \right]. \)

  1. 1.

    Let the number of cluster \( c = 2 \). Determine ϴ.

  2. 2.

    Randomly select c cluster centers \( {\mathbf{v}}_{j} \), \( j = 1, \ldots ,c \), for example:

    $$ v_{ji} = rand\left( {\hbox{min} \left( {x_{ij} } \right),\hbox{max} \left( {x_{ij} } \right)} \right), $$
    (1)

where \( rand(a,b) \) is a random number generated from the interval \( [a;b] \).

  1. 3.

    Calculate the distance \( d_{ij} \) between each data point \( {\mathbf{x}}_{i} \) and cluster centers \( {\mathbf{v}}_{j} \):

    $$ d_{ij} \, = \,\parallel {\mathbf{x}}_{i} - {\mathbf{v}}_{j} \parallel , $$
    (2)

where \( \parallel \cdot \parallel \) is a distance measure between two vectors (e.g. Euclidian or Manhattan distance).

  1. 4.

    Assign the data point \( {\mathbf{x}}_{i} \) to the cluster center \( {\mathbf{v}}_{s} \) whose distance from the cluster center is a minimum of all the cluster centers

    $$ {\mathbf{x}}_{i} \in {\mathbf{v}}_{s} \to d_{is} \le d_{im} , \;m = 1, \ldots ,c $$
    (3)

and increase counter of winnings \( \tau_{s} = \tau_{s} + 1 \).

  1. 5.

    Recalculate the new cluster center using:

    $$ {\mathbf{v}}_{i} = \frac{1}{{c_{i} }}\varSigma_{j = 1}^{{c_{i} }} {\mathbf{x}}_{j} , $$
    (4)

where \( c_{i} \) represents the number of data points in i-th cluster.

  1. 6.

    If in the center \( s \) the number \( \tau_{s} \) is greater than the threshold value ϴ, create a new cluster, \( c: = c + 1 \) and

    $$ {\mathbf{v}}_{c} = {\mathbf{x}}_{rand(j)} , $$
    (5)

where \( rand(j) \) generates a random index of point \( {\mathbf{x}} \) belonging to center \( {\mathbf{v}}_{s} \).

  1. 7.

    Remove clusters for which \( \tau_{s} = 0 \). Refresh the number of clusters c.

  2. 8.

    If no data point was reassigned, then stop; otherwise, repeat starting from step 3.

As a result of the algorithm operation we obtain c clusters with the centers in points \( {\mathbf{v}}_{j} \), \( j = 1, \ldots ,c \).

2.3 The Bag-of-Features Algorithm

The classic Bag-of-Features algorithm used in image classification most frequently uses classifiers (e.g. Support Vector Machine – SVM is used) during the stage when the decision is being made on the image class. The BoF comprises several stages:

  1. 1.

    Generate of characteristic features from images, which are most frequently saved in the form of number vectors.

  2. 2.

    Characteristic features are clustered and obtained clusters are treated as words, which create a dictionary.

  3. 3.

    Words (cluster centers) to which characteristic features of a given image belong make a histogram. Each element in the histogram specifies how many times a given word is present in the histogram.

  4. 4.

    The classifier is learnt to recognise histograms and to assign particular classes to them.

The points listed above show that it is possible to store three groups of data in a database, i.e. image characteristic features, centers of the clusters found, and histograms presenting membership of features in clusters. After a classifier has been learnt, it no longer needs to have access to the database.

A problem occurs when a query is made. The query image needs to have its features assigned to specific clusters in such way so as to create a histogram and to have it classified. This process requires a large number of computations given in formula (2). Thus the data-storing cluster centers need to be taken from the database. We can use this situation to our advantage and instead of using an ordinary query taking required data we can use the database engine in order to do computations and classification concurrently. A complex classifier (e.g. the abovementioned SVM) can be successfully replaced by the majority vote. We present below a BoF algorithm version which consists of two modules (one module – preparing data and the other – the classification process), which has been created in order to be able to use a NoSQL database.

The first of the BoF algorithm modules is supposed to prepare the database by creating a dictionary of characteristic features of sample images (to be used in the system learning process). This is carried out in a few steps presented below.

  1. 1.

    Starting operation of the algorithm generating image characteristic features. In our case we have used a well-known and fast SURF algorithm [1] which provides 64-number vectors \( {\mathbf{x}}_{i} = \left[ {x_{i1} , \ldots ,x_{id} } \right] \) describing the surrounding of a characteristic point, where \( i = 1, \ldots ,L \), L – the total number of all characteristic points, d – the dimension of the vector describing a characteristic point (\( d = 64 \)).

  2. 2.

    Starting operation of the k-means clustering algorithm. We have used the algorithm version presented in Sect. 2.2. As a result we obtain c clusters with the centers in points \( {\mathbf{v}}_{j} \), \( j = 1, \ldots ,c \), which are treated as words in the BoF dictionary.

  3. 3.

    The value of the number of classes \( i \) of cluster j is calculated and defined as \( k_{ji} \). This value is computed by counting the points \( {\mathbf{x}}_{n} \) which belong to the center j provided that \( {\mathbf{x}}_{n} \in {\mathbf{I}} \) and \( c\left( {\mathbf{I}} \right) = \omega_{i} \):

$$ k_{ji} = \varSigma_{n = 1}^{L} \delta_{nj} \left( i \right), \;j = 1, \ldots ,c,\;\;i = 1, \ldots ,C, $$
(6)

Where

$$ \delta_{nj} \left( i \right) = \left\{ {\begin{array}{*{20}c} 1 & { {\text{if }}\;d_{nj} < d_{nm} \,\;{\text{for }}\,\;{\mathbf{x}}_{n} \in {\mathbf{I}}\; \,{\text{and}}\; \,c\left( {\mathbf{I}} \right) = \omega_{i} , \;m = 1, \ldots ,c,\; j \ne m} \\ 0 & {\text{otherwise}} \\ \end{array} } \right. $$
(7)

The variable \( \delta_{nj} \left( i \right) \) is an indicator if a cluster \( {\mathbf{v}}_{j} \) is the closest vector (a winner) for any sample \( {\mathbf{x}}_{n} \) from an image \( {\mathbf{I}} \) and \( c\left( {\mathbf{I}} \right) = \omega_{i} \). Next, the values \( k_{ji} \) are normalised:

$$ k_{ji} = \frac{{k_{ji} }}{{\mathop \sum \nolimits_{j = 1, \ldots ,c} k_{ji} }} $$
(8)
  1. 4.

    Saving the values of centers \( {\mathbf{v}}_{j} \) together with the information about the number of classes \( k_{ji} \) in the database.

The classification process, i.e. the process testing whether a given query image belongs to a particular class, requires that an additional pre-processing module be applied. This module is supposed to:

  1. 1.

    Use a feature extraction algorithm (the SURF algorithm – see Sect. 2.1) on query image \( {\mathbf{I}}_{q} \) in order to obtain values of characteristic features \( {\mathbf{x}}_{i}^{q} \), \( i = 1, \ldots ,L_{q} \), \( L_{q} \) – the number of obtained features.

  2. 2.

    Save points \( {\mathbf{x}}^{q} \) in the database.

  3. 3.

    Assign points \( {\mathbf{x}}^{q} \) to clusters in such way so as to compute values \( k_{i}^{q} \). Computations are carried out as follows:

$$ k_{i}^{q} = \varSigma_{n = 1}^{{L_{q} }} \alpha_{n} (i), \;i = 1, \ldots ,C $$
(9)
$$ \alpha_{n} (i) = \left\{ {\begin{array}{*{20}c} {k_{ji} } & {\text{if }d_{jn}^{q} \le d_{mn}^{q} ,\,\;j \ne m} \\ 0 & {\text{otherwise}} \\ \end{array} } \right. $$
(10)

Where

$$ d_{jn}^{q} = \;\,\parallel {\mathbf{v}}_{j} - {\mathbf{x}}_{n}^{q} \parallel ,\,\;j = 1, \ldots ,c $$
(11)

and

$$ d_{mn}^{q} = \,\;\parallel {\mathbf{v}}_{m} - {\mathbf{x}}_{n}^{q} \parallel , \; m = 1, \ldots ,c. $$
(12)
  1. 4.

    Assigning to class \( c({\mathbf{I}}_{q} ) \) is done by the majority vote checking the maximum value \( k_{i}^{q} \):

$$ c\left( {{\mathbf{I}}_{q} } \right) = {\text{argmax}}_{i = 1, \ldots ,c} \,k_{i}^{q} $$
(13)

The algorithm facilitates an easy implementation by using only one SQL query. The algorithm’s details and experimental research are presented in Sect. 3.

3 Experimental Research

In this section we present the results of the BoF algorithm discussed in Sect. 2.3 together with the modified k-means algorithm (outlined in Sect. 2.2). Practical application of the presented method in image classification with the use of the database needs two modules to be used:

  • the module which is supposed to fill with the use of the Bag-of-Features algorithm the database with the cluster center values and also with the information on which class they belong to, i.e. creating the dictionary. The module creating the dictionary for the database should comprise the following steps:

    • preprocesing, i.e. extracting characteristic features from training images,

    • starting the operation of the k-means algorithm in order to perform clustering of the data and obtaining clusters with image characteristic features,

    • save the above data in the database;

  • the module preparing for query image classification and saving its features in the database. It also comprises a few steps:

    • preprocesing, i.e. extracting characteristic features from a query image,

    • saving data in the database,

    • performing a query which will produce information on the class to which a particular image belongs.

Preprocessing algorithms were implemented in the Java language with the use JavaCV [11] library function. JavaCV is a library which adopts functions available in OpenCV [12] for the Java language needs. The research was performed on the Caltech 101 image database [13]. Six sample categories comprising motorbikes, car sides, revolvers, airplanes, leopards and wrenches were selected. Out of the remaining group of images, 20 % are randomly selected and marked as a set of testing images. During the operation of the SURF algorithm over 100,000 characteristic points are identified in the database for 180 images. The main value which is used to compute classification efficiency is a percentage of correctly classified images.

Three structures are used in data storing: learning image features, testing image features, and for values of the cluster centres. Structures are stored in collections:

  • train_images and eval_images – collections storing the features values of learning and testing images have the following values:

    • imageId – image identifier, file name,

    • classId – image class,

    • pk1, pk2, …, pk64 – values of the feature vector obtained as a result of the SURF algorithm operation.

  • centers_500 – the collection stores cluster centers and \( k_{ji} \) values (see formula (8)):

    • center_id – cluster identification,

    • c1, c2, …, c64 – cluster center values,

    • k1,…,k6 \( k_{ji} \) values for each of the class.

The script performing the query image classification comprises two stages. The centers_temporary collection is created during the first stage and it stores temporary results of computations checking membership of query image features in particular clusters. Particularly, values \( \alpha_{n} \left( i \right) \) are computed here (according to formula (10)) and values \( k_{ji} \) from cluster j that are closest to the sample n are returned for all classes \( i \). The script made for a commonly-used MongoDB database presents as follows:

The next step is to call the mapReduce function, which is supposed to compute for each class the \( k_{i}^{q} \) value (according to formula (9)) and to classify an image by applying the majority vote method according to (13):

Table 1 presents the results of the operation of the algorithm for different ϴ values (the parameter for the k-means method). The other columns present: ϴ threshold value, the number of obtained clusters as a result of the operation of the k-means algorithm, and also the efficiency (given in percentages) of image recognition accuracy for the training and testing groups. As it can be noticed, the results proved best for \( \theta = 50 \) and \( \theta = 100 \). However, as far as the number of obtained clusters is concerned, value \( \theta = 100 \) proves to be the best choice under this research.

Table 1. Percentage efficiency of image classification for the presented algorithm in relation to \( \theta \) threshold value.

4 Conclusions

In this paper we present a Bag-of-Features algorithm which we have developed. This algorithm works on the basis of a NoSQL database. We have presented the algorithm implementation on the example of image storage and classification. We have modified the classic Bag-of-Words algorithm so as to facilitate its implementation in a NoSQL database without compromising its efficiency. The tests which we have carried out show that the algorithm operates correctly, thus proving its efficiency. Although operation of the BoF algorithm cannot be compared to other image-recognition algorithms (e.g. a deep-learning network), it still proves efficient enough and simple in its implementation so that it can be successfully used in applications working on less sophisticated hardware. Better query efficiency can be achieved by means of reducing lengths of the vectors describing particular image features. The methods which reduce the number of dimensions can be used for this purpose and these methods, for instance, include the PCA algorithm, or some other completely different feature generating algorithms. Another possible procedure reducing the number of calculations is the exclusions of those clusters which are the least likely to affect unequivocally image class determination.