Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Image classification research aims at finding representations of images that can be automatically used to categorize images into a finite set of classes. Typically, algorithms that classify images require some form of pre-processing of an image prior to classification. This process may involve extracting relevant features and segmenting images into sub-components based on some prior knowledge about their context [1, 2]. In [3] we introduced a new distance function, called Universal Image Distance (UID), for measuring the distance between two images. The UID first transforms each of the two images into a string of characters from a finite alphabet and then uses the string distance of [4] to give the distance value between the images. According to [4] the distance between two strings x and y is a normalized difference between the complexity of the concatenation xy of the strings and the minimal complexity of each of x and y. By complexity of a string x we mean the Lempel-Ziv complexity [5]. In [6] we presented a serial algorithm to convert images into feature vectors where the \(i^{th}\) dimension is a feature that measures the UID distance between the image and the \(i^{th}\) feature category. One of the advantages of the UID is that it can compare the distance between two images of different sizes and thus the prototypes which are representative of the different feature categories may be relatively small. For instance, a prototype of airplane category can be a small image of an airplane over a simple background such as blue sky.

In this paper we introduce a parallel distributed algorithm which is based on the serial algorithm of [6]. Compared to [6] the current version of the algorithm offers a very large acceleration in processing speed which allows us to test the algorithm on more challenging image classification tasks. On a standard graphics processing unit (GPU) it improves the execution speeds relative to [6] by more than three orders of magnitude. The algorithm converts an input image into a labeled case and doing this for the corpus of images, each labeled by its class, yields a data set that can be used to train any ‘off-the-shelf’ supervised or unsupervised learning algorithm. After describing our method in details we report on the speed and accuracy that are achieved by this method.

It is noteworthy that our process for converting an image into a finite dimensional feature vector is very straightforward and does not involve any domain knowledge or image analysis expertise. Compared to other image classification algorithms that extract features based on sophisticated mathematical analysis, for instance, analyzing the texture, or checking for special properties of an image, our approach is very basic and universal. It is based on the complexity of the ‘raw’ string-representation of an image. Our approach is to extract features automatically just by computing distances from a set of prototypes images that are selected once at the first stage.

The algorithm that we present here is designed with the main aim of scalable distributed computations. Building on recent ideas [7], we designed it to take advantage of relatively cheap and massively-parallel processors that are ubiquitous in today’s technology. Our method extracts image features that are unbiased in the sense that they do not employ any heuristics in contrast to other common image-processing techniques [1, 2]. The features that we extract are based on information implicit in the image and obtained via a complexity-based UID distance which is an information-theoretic measure.

2 Distance

The UID distance function [3] is based on the LZ- complexity of a string. The definition of this complexity follows [4, 5]: let S, Q and R be strings of characters that are defined over the alphabet \(\mathcal{A}\). Denote by l(S) the length of S, and S(i) denotes the \(i^{th}\) element of S. We denote by S(ij) the substring of S which consists of characters of S between position i and j (inclusive). An extension \(R=SQ\) of S is reproducible from S (denoted as \(S\rightarrow S\)) if there exists an integer \(p\le l(S)\) such that \(Q(k)=R(p+k-1)\) for \(k=1,\ldots ,l(Q)\). For example, \(aacgt\rightarrow aacgtcgtcg\) with \(p=3\) and \(aacgt\rightarrow aacgtac\) with \(p=2\). R is obtained from S (the seed) by first copying all of S and then copying in a sequential manner l(Q) elements starting at the \(p^{th}\) location of S in order to obtain the Q part of R.

A string S is producible from its prefix S(1, j) (denoted \(S(1,j)\Rightarrow S\)), if \(S(1,j)\rightarrow S(1,l(S)-1)\). For example, \(aacgt\Rightarrow aacgtac\) and \(aacgt\Rightarrow aacgtacc\) both with pointers \(p=2\). The production adds an extra ‘different’ character at the end of the copying process which is not permitted in a reproduction.

Any string S can be built using a production process where at its \(i^{th}\) step we have the production \(S(1,h_{i-1})\Rightarrow S(1,h_{i})\) where \(h_{i}\) is the location of a character at the \(i^{th}\) step. (Note that \(S(1,0)\Rightarrow S(1,1)).\) An m-step production process of S results in parsing of S in which \(H(S)=S(1,h_{1})\cdot S(h_{1}+1,h_{2})\cdots S(h_{m-1}+1,h_{m})\) is called the history of S and \(H_{i}(S)=S(h_{i-1}+1,h_{i})\) is called the \(i^{th}\) component of H(S). For example for \(S=aacgtacc\) we have \(H(S)=a\cdot ac\cdot g\cdot t\cdot acc\) as the history of S. If \(S(1,h_{i})\) is not reproducible from \(S(1,h_{i-1})\) then the component \(H_{i}(S)\) is called exhaustive meaning that the copying process cannot be continued and the component should be halted with a single character innovation. A history is called exhaustive if each of its components is exhaustive. Every string S has a unique exhaustive history [5]. Let us denote by \(c_{H}(S)\) the number of components in a history of S. The LZ complexity of S is \(c(S)=\min \left\{ c_{H}(S)\right\} \) where the minimum is over all histories of S. It can be shown that \(c(S)=c_{E}(S)\) where \(c_{E}(S)\) is the number of components in the exhaustive history of S.

figure a

A distance for strings based on the LZ-complexity was introduced in [4] and is defined as follows: given two strings X and Y of any finite alphabet, denote by XY their concatenation then define

$$ d(X,Y):=\max \left\{ c(XY)-c(X),c(YX)-c(Y)\right\} $$

(see several normalized versions of d in [4]). In [3, 6] we have found that the following normalized distance

$$\begin{aligned} \mathtt{d}(X,Y):=\frac{c(XY)-\min \left\{ c(X),c(Y)\right\} }{\max \left\{ c(X),c(Y)\right\} } \end{aligned}$$
(1)

is useful for image classification.

In [7] we introduced a parallel distributed processing algorithm (LZMP) for computing the complexity c(X) of a string X. Let us denote by \(\mathtt{dp}(X,Y)\) the distance between X and Y where the complexity c is computed by the LZMP algorithm. Thus (1) is now represented by its parallel counterpart

$$\begin{aligned} \mathtt{dp}(X,Y,a,b):=\frac{LZMP(XY)-\min \left\{ a,b\right\} }{\max \left\{ a,b\right\} } \end{aligned}$$
(2)

where a, b are the LZ-complexity values of the string X, Y, respectively, and for efficiency they are pre-computed as seen for instance in Procedure DMat, step 2(IV).

3 The Algorithm

We describe the parallel algorithm for image feature extraction, starting with a listing of several procedures followed by the main part which is split into several sub-algorithms. Procedure LZMP computes the LZ-complexity of a given string. It runs in parallel over the symbols that comprise the string. The procedure appears in [7] and we enclose it here for completeness. Procedure VLZMP computes the LZ-complexity of a set of input strings in parallel. Procedure DMat computes the UID distance of every pair of strings from two given input lists, in parallel. The variable \(i_{p,q}\) denotes an index variable of the computing block \(B_{p,q}\) (each block has its own local memory and set of variables). The main algorithm is split into sub-algorithms (as done in [6]) which are numbered from 2 to 4 and the letter P denotes that it is a parallel computing algorithm. Algorithm 2P selects the prototype images (its serial version is Algorithms 1 and 2 of [6]).

figure b
figure c
figure d

Algorithm 3P computes the set of cases (feature vectors) for images in the input corpus. The algorithm utilizes a number of computing blocks which begin to run in parallel in step 12. Steps 6 to 11 which run in serial are responsible for converting the input images into strings of symbols.

figure e
figure f

Algorithm 4 is identical to that of [6] and we present it for completeness. It uses the training cases that are produced in Algorithm 3P and uses any off-the-shelf supervised learning algorithm to produce a classifier.

figure g

4 Results

The following hardware was used: a 2.8 Ghz AMD Phenom©II X6 1055T Processor with number of cores \(n=6\) and a Tesla K20C board with a single GK110 GPU from nVIDIA. This GPU is based on the Keppler architecture (with compute capabilities of 3.5). The CUDA is release 6.0 and the operating system is Ubuntu Linux 2.6.38-11-generic.

We tested the algorithm on several two-category image classification problem obtained from the CALTECH-101 corpus [1]. Due to the lack of space, we present one such problem which has as categories, airplane and ketch (yacht). We chose 10 prototypes of each category simply by collecting small images of airplanes and boats. The prototypes of airplane are of size \(150\times 70\) pixels and the prototypes of ketch are of size \(150\times 130\). Figure 1 shows a few examples of such prototypes.

Fig. 1.
figure 1

Three prototypes from category airplane

Fig. 2.
figure 2

Input images from category airplane and ketch and their respective sub-images

The corpus of input images consist of 74 images of airplanes of size \(420\times 200\) and 100 images of yachts of size \(300\times 300\). It takes 345 seconds for Algorithm 3P to produce the 174 cases starting from the image corpus. Figure 2 displays two examples of input images, one from category airplane and one from ketch and their corresponding divisions into sub-images of size \(150\times 150\) (obtained in Algorithm 3P, step 7). Note that the algorithm permits the size of prototypes to differ and the size (or number) of sub-images to differ from one feature category to another. We ran four learning algorithms, multi-layer perceptrons, decision trees J48, naive-Bayes and lazy IB1, on a ten-fold cross validation using the 174 input images. Table 1 presents the accuracy results versus the baseline algorithm (rules.ZeroR) which classifies based on the prior class probability. The configuration parameter values of the learning algorithms used in WEKA [8] are displayed under the accuracy result. As can be seen, the J48 decision tree learner achieves the highest accuracy of \(96.54\,\%\) (relative to the baseline accuracy of \(57.52\,\%\)).

Table 1. Classification result for airplane v.s. ketch problem

Next, we considered a more challenging problem of recognizing different image textures. We obtained the 1000 images of the Texture Database [2] which has 25 categories of various types of real textures, for instance, glass, water, wood, with 40 images each of size \(640\times 480\) per category. We chose as feature categories the categories themselves and selected five small prototypes of size \(150\times 150\) from each one without using Algorithm 2P (just picking parts of images in a random way to be prototypes). It takes about 25 h for Algorithm 3P to produce the 1000 cases starting from the image corpus. We ran the following classification learning algorithms: lazy IB1, decision trees J48, multi-layer perceptrons, naive Bayes, random forest. Ten fold cross validation accuracy results are displayed in Table 2 (parameter settings are displayed under the accuracy results). As shown, the best obtained accuracy result is \(70.73\,\%\) which is achieved by the random forest algorithm; this is 17.6 times better than the baseline ZeroR classification rule.

Table 2. Classification result for the texture problem

Considering how little effort and no-expertise is needed in our approach to image feature extraction, we believe that the results are impressive and can serve well in settings where very little domain knowledge is available, or as a starting point from which additional analysis and specialized feature extraction can be made.

5 Conclusions

In this paper we introduce a new parallel processing algorithm for image feature extraction. Given an input corpus of raw RGB images the algorithm computes feature vectors (cases) that represent the images with their associated classification target labels. Using these cases, any standard supervised or unsupervised learning algorithm can learn to classify or cluster the images in the database. A main advantage in our approach is the lack of need for any kind of image or data analysis. Aside of picking once at the start a few small image prototypes, the procedure is automatic and applies to any set of images. It can therefore be very useful in settings with little domain knowledge or as a starting point for a more specialized image data analysis. Our experiments indicate that the algorithm yields relatively high accuracies on image texture classification problems.