1 Introduction

Object Recognition in images is an important concern while learning visual classifications and distinguishing different occurrences of those classifications. Invariantly, all computer vision assignments largely depend on the ability to recognize objects, items, faces and scenes from the image. Visual recognition has many potential applications that involve artificial intelligence and information retrieval. Content-based image search, video data mining or object identification for mobile robots are some of the examples. Visual analysts categorize recognition in two classes as specific and generic as shown in Fig. 1. The presented work focuses on recognition of generic object categories. Humans have the great ability to generalize the objects, even if objects differ and are not exactly the same. Feature extraction in the key while doing object recognition using computer vision. First the useful features are obtained from the images and then matched with the known reference features. Classifiers are used to do the matching and classification job. The extracted features could be hand crafted or machine learned. Hand crafted features are obtained based on the subject knowledge of the algorithm designer ed. as sliding window based, contour based, graph based, fuzzy based or content based [26].

  • Forest for same feature set. To compare the results with CN based feature classifiers.

Fig. 1
figure 1

Types of object recognition considered by vision analysts

This paper propose a novel strategy for generic object recognition using Scale Invariant Feature Transform features (SIFT) [21] and Fast-Rotated and BRIEF (ORB) [31]. The integrated technique extracts the best robust features from the image. The SIFT descriptor utilizes a 128 element feature dimension in one key point and ORB descriptor uses 32 elements in one key point, which requires a high memory space for storing features and high complexity. Further, K-Means clustering [15] and Locality Preserving Projection (LPP) [13] are used to reduce the complexity of the integrated technique.

This paper consists of seven sections. Section 1 has the introduction of the present work. Related work is exhibited in Section 2. Feature extraction techniques are discussed in Section 3. Section 4, presented features dimensionality reduction techniques. A novel framework for object recognition using ORB and SIFT feature extraction techniques has been proposed in Section 5. Section 6 presents experimental results in view of the proposed framework. Finally, Section 7 has the concluding remarks.

2 Related work

There is the vast volume of literary work on feature extraction and classification techniques; a few approaches related to the present work have been discussed in this section. Initially, Ballard [2] has described a technique for circle detection and recognition from images by using Hough transforms. Ullman [30] used basic visual operators like color, shape, texture, etc., which are integrated in different ways, for detection of complicated objects. Peterson and Gibson [25] highlighted the role of acquaintance in object recognition. Minut and Mahadevan [20] recommended a visual attention prototype which was based on reinforcement learning of an image. Belongie et al. [3] proposed the shape context, which conquers the spatial distribution of all other points relative to each point on the shape. This semi-local illustration allowed establishment of point-to-point communication between different features of an object even under yielding deformations. Elnagara and Alhajj [9] described an accurate partition path and further the separation of numerals and restoration was proposed. Yu and Shi [32] proposed a technique based on segmentation using partial detection instead of global shape descriptors. Lowe [19] described SIFT (Scale Invariant Feature Transform) for extracting a large collection of feature vectors which were constant to image translation, scaling and rotation and were robust across a substantial range of affine deformations, noise addition and enlightenment changes. The key location of an object was defined as maxima and minima of the result of DOG (Difference of Gaussians). Mori et al. [22] discussed about the connection between the shapes of an object and recognized these shapes by finding an aligning alteration for correspondence between points of the two shapes. Leordeanu et al. [17] presented an unusual way, by programming relations between all pairs of edges. Nadernejad et al. [23] reviewed and compared different edge detection techniques, i.e. the Marr-hildreth Edge Detector, the Canny Edge detector, Threshold and Boolean function based edge detection, color and vector angle based edge detection for object recognition. Ferrari et al. [11] presented an object recognition approach which entirely integrates the paired strength presented by shape matchers. Toshev et al. [29] proposed a ground segmentation technique for drawing out of image regions that can be similar to the global properties of model border construction and chordiogram. Soltanshahi et al. [27] used Scale Invariant Feature Transform (SIFT) for Content Based Image Retrieval (CBIR). They considered SIFT feature extraction technique and used two approaches for SIFT descriptor. In the principal approach, they clustered a SIFT descriptor into 16 clusters and in the second approach, they focused on the registration of descriptors in view of directions. Alhassan and Alfaki [1] introduced a color and texture fusion based technique for CBIR. Peizhong et al. [24] extracted texture and color based feature elements from images. Texture features like LBP (Local Binary Pattern) were mostly used and in shade based features like CIF (Color Information Feature) were used. Recently CNN based solution was suggested by [18]. The machine learned features were used for object detection and classification, Comparable results were obtained for hand crafted and machine learned features.

3 Feature extraction techniques

Feature extraction is the most important phase of the proposed object recognition system. In this phase, the significant features are extracted for object category classification. In the present paper, two feature extraction techniques, namely, ORB and SIFT are considered for object recognition. These techniques are briefly discussed in following sub-sections.

3.1 ORB (Oriented FAST Rotated and BRIEF)

ORB is developed in “OpenCV” which uses FAST (Features from Accelerated Segment Test) key point detector and binary BRIEF descriptor. ORB is used to extract the fewer but the best features from an image. The cost of computation is also less when contrasted with SIFT and SURF, but the magnitude is faster than SURF. ORB first apply the FAST key point detector, which detects the large number of key points and then the ORB uses a Harris corner detector to find good features from those key points. Extracted features generate better results and are less sensitive to noise. The centroid of the image can be calculated using the patch moment in ORB, using eq. (1):

$$ {m}_{pq}={\sum}_{x,y}{x}^p{y}^q\left(x,y\right) $$
(1)

The orientation of the corners is calculated using the intensity centroid of image patches using eq. (2):

$$ C\left(\frac{m_{10}}{m_{00}},\frac{m_{01}}{m_{00}}\right) $$
(2)

From the center of the patch to centroid, the angle is given by eq. (3)

$$ atan2\left(\frac{m_{10}}{m_{00}},\frac{m_{01}}{m_{00}}\right)= atan2\left({m}_{01},{m}_{10}\right) $$
(3)

The ORB uses OFAST for the quick key focuses on the direction and RBRIEF for the BRIEF descriptor with orientation (rotation) angle. The sample used in BRIEF descriptor is transformed using the orientation.

3.2 SIFT (Scale Invariant Feature Transform)

The SIFT detector extracts a number of attributes from an image in such a way which is reliable with changes in the lighting impacts and perspectives alongside with other imagining viewpoints. The SIFT descriptor will distinguish nearby elements of an image.

The major stages in SIFT are: -

  • Detector

    Find Scale Space Extreme Detection.

    Key point localization and filtering

  • Descriptor

    Orientation Assignment.

    Feature Description

  • Similarity

  • Feature Matching

4 Feature dimensions reduction techniques

The following techniques are used to reduce the dimensions of the feature vector, as a large number of features has been extracted using the integrated technique.

4.1 K-means clustering

In K-means clustering algorithm, the Euclidean distance method or the max-min measurement is used to calculate the distance between the centroid of the cluster and an object. A set of n descriptor array and K-means clustering algorithm is used to cluster n descriptor array into a K number of clusters and to minimize the intra cluster variance as described in the following eq. (4).

$$ {\mathit{\arg}}_s=\mathit{\min}{\sum}_{i=1}^k{\sum}_{x_j\varepsilon {s}_i}\left\Vert {X}_j-{\mu}_i\right\Vert $$
(4)

Where k = 1, clusters Si = 1, 2 ... K(64) and μi is the center of all points Xj in Si.

The entire process can be described as follows:

  1. Step 1:

    Randomly select 64 instances from a set of n descriptor array.

  2. Step 2:

    Assign every instance to its closest center point.

  3. Step 3:

    Update a cluster center.

  4. Step 4:

    Finally obtain 64 clusters by assigning instances according to closest center points.

  5. Step 5:

    Calculate a mean of each cluster.

4.2 Locality preserving projection

LPP (Locality Preserving Projection) is utilized to reduce the higher dimensional space into lower dimensional space for information storage. LPP takes a shot at neighborhood data about the informational LPP that is keeping the desired information and discarding the undesired data. PCA (Principal Component Analysis) keeps less information about the data than LPP. LPP for dimensionality diminishing works in three stages as follows:

  • First an Adjacency graph for data is generated using space xiRdand nodes for undirected graph i = 1.. . N.

  • Weights are selected for graphs which are represented using similarity matrix [pij].

    $$ {p}_{ij}=\left\{\begin{array}{c}{e}^{-{\left\Vert {x}_i-{x}_j\right\Vert}^2/t},\kern1.75em if\ {x}_i\in kNN\left({x}_j\right) or\ {x}_j\in kNN\left({x}_i\right)\ \\ {}0,\kern0.5em \mathrm{otherwise}\end{array}\right. $$
  • Eigenvector and Eigen value equation are computed:

    $$ XL{X}^T\sigma =\uplambda \mathrm{XD}{X}^T\sigma . $$

Where X =⟨x1| …….| xn⟩, D is a matrix with sum of P. Using row or column diagonal terms, dii\( =\sum \limits_j{p}_{ij} \), L is Laplacian matrix L = D − P, σ projection matrix.

5 Proposed integrated system for object recognition

The proposed system of object recognition consists of various phases, namely, image acquisition, pre-processing, feature extraction, classifier prediction and lastly object recognition. Block diagram of the proposed framework for object recognition is depicted in Fig. 2. Image acquisition is the process of acquiring an image from the environment from any source (whether using a camera or from already available database) and to convert an acquired image into digital form. The digital image then undergoes the preprocessing stage. Preprocessing is a preliminary phase of an object recognition system. The most crucial stage is feature extraction, which is used to extracts different features (may be hand crafted or machine learned) of the preprocessed image that results in some quantitative information of interest. This phase analyzes a set of features that can be used for distinctively grading the shape present in the image. Extracted features are global and local in nature and aims to detect the objects present in the image. Global features include the geometrical shapes, texture, size, color etc. after feature extraction last stage is to train the classifier using these features and then test the images for accurate classification. In the proposed work, image acquisition and preprocessing is not required as a public dataset Caltech 101 is used for experimentation. Preliminary features from the input image are extracted to recognize an object by using geometrical shape features (stage 3). SIFT and ORB features extraction methodologies are used for extracting image feature descriptors. Further, K-means clustering algorithm is used to generate ‘K’ number of clusters using the descriptor array. Lastly, LPP is used to reduce the dimensions of the feature vector.

Fig. 2
figure 2

Block diagram of Object Recognition System

5.1 Proposed algorithm

Input: Query Image.

  1. Step 1:

    Extract a feature descriptor from images of training dataset using ORB and SIFT feature extracting methods.

  2. Step 2:

    Use K-means clustering algorithm to generate 64 clusters, for every descriptor array. Then, compute the mean of every cluster to obtain a 64 dimensional feature vector for every descriptor.

  3. Step 3:

    Use LPP dimensionality reduction algorithm to reduce the feature vectors of 64 units into 8, 16 and 32 components.

  4. Step 4:

    Integrate the both feature vectors i.e. ORB and SIFT and store the combined feature vector in database.

  5. Step 5:

    Train the proposed system using a combination of ORB and SIFT feature vectors to obtain the classifier model.

  6. Step 6:

    Test the trained classifier model by inputting the query object image and extract the ORB and SIFT features of a questioned image.

  7. Step 7:

    Predict the similarity between query object data features and trained dataset using the model classifier.

6 Dataset and experimental results

This section, presents the experimental results obtained using the proposed object recognition system. A dataset namely Caltech 101 consisting of 9197 samples, has been considered for experimental work [10]. We have taken 8000 samples (100-classes) from this entire dataset. This dataset consists of 100 generic object categories. A few samples of this dataset are shown in Fig. 3. Entire dataset is divided into two categories, namely, training dataset and testing dataset. In training dataset 70% data of the entire database has been used and remaining 30% data is considered as testing dataset.

Fig. 3
figure 3

A few samples of dataset

All object images of the dataset are tested using ORB and SIFT features. Each image produces a different array length due to different sizes of images, so K-Means (K = 64) clustering algorithm is applied to build the uniformity in the dimensions of the feature set. Further, the LPP dimensionality reduction method is used to reduce the dimensions into 8, 16 and 32 dimensional feature vectors. The ORB and SIFT features are extracted independently and finally, the integration of ORB and SIFT features is used for the improvement of the performance of the proposed system. In the classification phase, three different classifiers are considered, namely, k-NN, Decision Tree, and Random Forest. All these classifier algorithms have different advantages and disadvantages. High bias and low variance classifiers e.g., Naive Bayes (Cestnik et al. [6] are more suitable for smaller training set. But low bias/high variance classifiers e.g. KNN [7] are beneficial when the training set is large. Decision trees [28] are easy to interpret and explain. They can handle feature interactions which are non-parametric. Random forests [5] are most popularly used for problems in classification as they are fast, scalable and no tuning of parameters is required as in SVMs.

The performance of the proposed system is measured based on four parameters: TPR (True Positive Rate), FPR (False Positive Rate), Precision Rate and AUC (Area under Curve). TPR states the rate at which the positive instances are correctly classified. High recall indicates the class is correctly recognized. FPR states the rate at which the negative class labels, which has been classified correctly. Precision is the ratio of correctly classified positive instances to that of the total number of instances that have been classified as positive by the classifier [8]. All the experimental results are tabulated in Table 1.

Table 1 Experimental Results Based on SIFT and ORB on Caltech 101

The precision comparison for all the classifier are illustrated in Fig. 4. The best accuracy is obtained for SIFT+ORB (8 + 8) feature set with random forest, i.e. 85.42% with a low dimensionality of 16 features.

Fig. 4
figure 4

Precision comparison of proposed features

A precision rate of 76.9% is obtained using SIFT feature descriptors and Random forest classifier as shown in Fig. 5. A precision rate of 69.8% has been achieved using ORB feature descriptors and Random forest classifier, as shown in Fig. 6. A combination of these two feature descriptors, namely, SIFT and ORB has been also tested.

Fig. 5
figure 5

Experimental Results Based on SIFT Features

Fig. 6
figure 6

Experimental Results Based on ORB Features

Using this combination, the best precision rate of 85.6% has been achieved using 16 feature descriptors (8 feature descriptors of SIFT and 8 feature descriptors of ORB) as depicted in Fig. 7. Figure 7 shows a decrease of AUC compared with Fig. 6 or Fig. 5, because the AUC is depends on the quality of feature and number of components in feature vector for recognition.

Fig. 7
figure 7

Experimental Results Based on SIFT+ORB Features

6.1 Comparison with other techniques based on hand crafted features

Table 2. Shows the results for multi-class classification on Cal-Tech101 dataset. Our approach outperforms the existing state-of-the-art techniques for the same setup. Different algorithms utilizing shape matching, texture descriptors, visual similarity and pyramid matching are compared with the proposed technique. The best results on Caltech101 dataset (75.67%) are currently reported by Huang et al. [14] based on visual similarity. The experimental setup of Grauman and Darrell [12] and Zhang et al. [33] is imitated i.e. training is done on 30 images per class and the number of test images is limited to 50 per class. These results demonstrated that the SIFT8 features have outperformed all the existing state of art hand crafted algorithms. ORB8 performed better than all but visual descriptors. The results for SIFT8 and ORB8 are much encouraging as they further improved the classification accuracy remarkably.

Table 2 Comparison with existing hand crafted feature based state-of-the-art work

6.2 Comparison with other techniques based on CNN features

Now a days the hand crafted features are getting replaced with CNN based machine learned features. These features are extracted automatically in contrast to features extracted using human knowledge. Although these proved to be robust but still their consistency and effectiveness needs to be tested thoroughly. Liu et al. has experimented CNN algorithm like ResNet, Vgg19 and AlexNet with different classifiers and dataset to prove their effectiveness for object recognition. We have compared the results of proposed features with those CNN based features using decision Tree classifier. The Accuracy obtained for decision tree classifier is tabulated in Table 3.

Table 3 Accuracy obtained for decision tree classifier

The best accuracy obtained with decision tree classifier is using SIFT16 + ORB16 features i.e. 81.13. If we compare this accuracy with CNN based classifier results, these are better than ResNet and Vgg19 results as illustrated in Table 4.

Table 4 Comparison with existing CNN feature based state-of-the-art work

Only AlexNet achieved 85.2% accuracy. But if use random forest instead the accuracy achieved id comparable with AlexNet i.e. 85.42%. Hence, the efficiency of hand crafted features is well proved. Further, it would be interesting to experiment that whether the integration of hand crafted and CNN based features could improve the results further. But, it is out of the scope of present work and can be experimented in future.

7 Conclusion

Computer vision and image processing is a domain where a lot of advancement is possible for feature extraction and classification. One of them is to improve the recognition accuracy. In this paper, a novel integrated technique using ORB and SIFT features are proposed for generic object recognition. Further, feature dimension is reduced by K-means clustering and LPP approach. It is demonstrated that the proposed technique performs better than already existing techniques for object recognition. Each feature component is firstly evaluated individually as it is of the interest to assess. Then, their integration is analyzed. This evaluation is performed and analyzed thorough experimental work. The performance of proposed methodology is assessed by considering various parameters such as True Positive Rate (TPR), False Positive Rate (FPR), Precision Rate, and Area Under Curve (AUC). Authors have achieved an efficient precision rate and accuracy of 85.6% and 85.42%, respectively, for 8000 samples of 100-class samples of generic objects. The presented work is compared with many state of art algorithms. Encouraging findings are obtained for the proposed classifier when compared with CNN feature based classifiers.