Keywords

1 Introduction

Deep learning had led to greater advancement in the field Artificial Intelligence [1]. Deep learning though sufficient enough, it requires huge amount of data for training and takes too much of time to train the model from the scratch. As mentioned by the authors of VGG model, it took them almost 3 weeks to train the model. Transfer learning comes as a solution, wherein learned parameters (weights) from a pre-trained model are borrowed and used to initialize the model, so that the model gains a kick start when used. Transfer learning aids in achieving better results just within a few minutes [2].

Convolution neural networks (CNN) are basic architecture which captures the major features from an image, by gaining knowledge of unique features which identifies a class [3]. Various kinds of convolution operation like transposed convolution, dilated convolution, depth-wise separable convolution, spatial separable convolution, each with distinguish function capable of extracting a significant feature [4]. Maxpool layers follow up the Convolution layers to reduce the dimension of the data. The data is flattened and passed onto fully connected neural network before making the final prediction. A generalized model of CNN architecture is shown in Fig. 1.

Fig. 1.
figure 1

Generalized CNN Architecture

2 Literature Review

A graph based feature fusion based methodology for image retrieval where outliers are eliminated using Three Degree Binary Graph (TDBG) which is a greedy algorithm, the proposed methodology was tested on publicly available datasets [5]. A method to classify natural scenes which is able to utilize most of the text and character presence in them since text based segmentation methods are more effective. It detector combined with 4 different kind of header is able to classify both at character and text level [6]. A graph based method which concentrates even on local structural regions of the image through hard samples called Graph-based Reasoning Attention Pooling with Curriculum Design (GRAP-CD) is proposed for content based image retrieval [7]. A thorough analysis of various text identification methods like convolutional neural network, maximally stable extreme regions, LSTM for text and character recognition from a scene is presented [8].

A recurrent network based architecture for scene text classification where fixed width, rotation with multi ratio bounding boxes are utilized later proposed sequential regions are further analyzed for textual lines [9]. Mask R-CNN has achieved immense success in the field of object detection yet considerably fails when multiple object instances are present and contains large text case. To overcome, a MLP based decoder which is able to detect and propose compact masks for multiple instances based on shape is proposed. The method shows a significant improvement on five benchmark dataset [10]. Detecting text from a scene has drawn huge attention from various researchers, but success can only be visualized with respect to horizontally and vertically oriented texts. To detect arbitrary and curved texts, a combined architecture based on Proposal Feature Attention Module (PFAM) and One-to-Many Training Scheme (OTMS) is designed which eliminates ambiguity and detects effective feature based on the proposals [11].

To promote comparative diagnostic reading in medical imaging to detect and classify normal and abnormal features separately from images a neural network based architecture is proposed. It classifies the images based on semantic component present and the generated synthesized combined vector [12]. K-nearest neighbor algorithm to get the most To diagnose the lung cancer based on CT images, pre-trained models Vgg16 and Resnet are used to fetch the nearest images for patients. Furthermore the features fetched from Vgg16 are passed onto relatable image [13]. A Fuzzy C means clustering a unsupervised method is used to segment MRI images based on spatial information. The method is able to locate the clusters even in the presence of noise, without affecting the underlying correlation [14]. With the advancement in cloud computing and cloud storage, the data is encrypted to ensure security. A method which performs effective image matching on encrypted data called Similarity Image Matching (SESIM) is proposed [15].

3 Deep Learning Model

3.1 VGG16 – OxfordNet

Visual Geometry Group from Oxford developed VGG16 a Convolutional Neural Network based model, it won 1st runner in the ILSVRC (ImageNet) Challenge of the year 2014 [16]. It is one of simplest yet effective architecture ever proposed to extract the features from the image. The architecture consists several blocks of 3*3 convolutional layers followed by max-pooling layers, increasing the depth gradually from 64, 128, 256 to 512. At the top of the stack, the data is passed to series of Fully Connected Layers (Dense), before making the final prediction [17]. The architecture of VGG16 is shown in Fig. 2.

Fig. 2.
figure 2

Architecture of VGG16

3.2 Inceptionv3 – GoogLeNet

The model which won the 2014 ILSVRC challenge was InceptionNet. Inception-v3 also called as the second generation Inception, is an architecture proposed by the authors to improve the efficiency of the classifier along with reduced computational complexity of the model. The major architectural changes proposed 3 different kind of inception blocks - Factorized convolutional block- single 5*5 conv was replaced by two 3*3 conv, replacing 3*3 conv by 1*3 and 3*1 conv, the idea was to make the architecture not only deeper but wide enough to capture the spread out features [18]. The authors also added batch normalization layer into auxiliary classifiers, along with label smoothing. The building blocks of Inception Model are shown in Fig. 3.

Fig. 3.
figure 3

Architecture Blocks of Inception-v3

3.3 ResNet – 50

ResNet is the architecture which won the 2015 ILSVRC of image classification. Resnet was mainly designed to address the persistent problem of exploding/ vanishing gradient whenever the network is deeper [19]. The issue addressed by adding Residual blocks in the architecture which is a skip connection between the layers. To further reduce the complexity of the model 1*1 – 3*3 – 1*1 conv blocks were added as sandwich layers. The architecture of ResNet-50 is shown in Fig. 4.

Fig. 4.
figure 4

Architecture of ResNet-50

4 The Proposed Ensemble Model

For the proposed ensemble approach Vgg16, Inceptionv2 and ResNet50 pre-trained models with top is used as feature extractor, the feature vectors are merged together with merging layer, followed by two fully connected layers of 1000 neurons, before making the final prediction. Before going to final ensemble, vgg+inception was used to analyze the behavior of various feature merging techniques. To merge the feature extracted from different models merging layers like - Add, Concat, Max, Min, Subtract were analyzed based on which adding the features together was chosen as most ideal for the final ensemble model. The outline of the proposed model is shown in Fig. 5.

Fig. 5.
figure 5

Architecture of proposed ensemble model

figure a

5 Results

Caltech-101 and Caltech-256 object category dataset is chosen for the study and analysis of the proposed ensemble model. Caltech-256 is one highest object category dataset next to ImageNet with around 30,000 images [20]. The analysis made with different merging techniques on Caltech-101 and Caltech-256 dataset, using 2 pre-trained models it can be clearly seen from the results obtained that ADD – adding the feature vectors together will increase weightage of the particular feature detected resulting in better performance as shown in Table 1. The performance of the models is measured and compared in terms of accuracy (%).

Table 1. Analysis of various feature merging techniques on Caltech-101 and Caltech-256 dataset on Vgg16+Inceptionv3 Model

The selected merging technique is applied in the proposed ensemble model and evaluated for 15-Train, 30-Train, as opposed to deep learning papers where the authors choose to split the train data with 60% of weightage. Comparative analysis of Caltech-101, Caltech-256 dataset results with previous work is tabulated in Table 2 and Table 3 respectively. It is evident from the result obtained that proposed ensemble model which is a combination of 3 different pre-trained models and the merged features, outweighs the performance of all the previous work.

Table 2. Comparative analysis of Proposed Ensemble Model on Caltech-101 result with previous work
Table 3. Comparative analysis of Proposed Ensemble Model on Caltech-256 result with previous work

6 Conclusion

The Proposed Ensemble Model outperforms the state-of the-art techniques like VGG16, Inception and ResNet50. Study also justifies the choice of these models for merging as each of the models has its own unique nature and methodology to extract feature from an image. The model performs well even with smaller train data, in comparison to other research work where 60% of train data is used to training the model. The idea of merging the features has powered the model with discriminating ability to further increase the accuracy achieved as against the individual models. The research also justifies the fact that with the aid of transfer learning, innovative, efficient, simple models can be designed. However, the proposed model fluctuates with validation data, henceforth concentrating on stabilizing the model still poses itself as a challenge.