Keywords

1 Introduction

Nowadays, dogs are the most common pets to be adopted at home. Security personnel also prefer some specific dog breeds for security purposes. As per statistics of the ASPCA [1], nearly 1.6 million dogs are adopted every year. By observation, one can conclude that the intra-class differences in dog breeds are more than the inter-class differences. So, identification of dog breeds becomes very difficult particularly, for the new pet enthusiasts. Dogs are also the most genetically diverse animals on the earth. The current advancements of deep learning, especially Convolutional Neural Network (CNN), have already proved its superiority over human capabilities toward object identification. So, CNN-based dog breed identification is very much essential to decrease the complexity of dog breed classification.

Deep Learning is a widely growing field with continuous up-gradation in various domains like image recognition, speech recognition, object detection, data generation, etc. But still, much work needs to be done in this field to explore complex problems. In recent times with the advent of various CNN architectures like Xception, ResNet, DenseNet, etc. and with the help of Transfer Learning [2], improving classification model performance has become quite easier.

This paper aims to classify different breeds of dogs with improved accuracy compared to the existing dog breed identifiers in the literature. To develop the classifier, we have used the Kaggle Dog Breed Identification dataset [3]. The goal was to make a generalized model which would be able to predict the dog breeds irrespective of any class and with higher accuracy. We have experimented with different state-of-the-art deep CNN models like DenseNet201, Xception, ResNet50, VGG19, etc. We have also made some modifications in the Xception model architecture to improve the overall classification accuracy. In our experiment, our modified-Xception model has obtained the highest, 87.40%, overall accuracy on this applied Kaggle Dog Breed Identification dataset. Based on the previous works on this dataset, our classification model accuracy is a cut above the other proposed models in the literature. Hence, the foremost contributions of this article are as follows: (i) proposed a modified-Xception model to identify the different dog breeds; (ii) compared the adopted method with various pre-trained models; (iii) comparative performance analysis between similar previous works and the proposed approach.

The rest of this article has been structured as follows: a quick overview of the related works of dog breed recognition has been provided in Sect. 2, where our proposed methodology has been explained in Sect. 3 in detail. The experimental results and analysis are presented in Sect. 4 and finally, the article ends with the conclusions and future works.

2 Literature Review

From the past decade, many researchers have earlier tried to construct a dog breed image classifier. In most cases, they have used different CNN architectures. Mulligan et al. [3] have used the Kaggle Dog Breed Identification dataset and experimented with Xception followed by a Multi Linear Perceptron (MLP), but got a very low overall accuracy of 54.80%. There is still a chance of increasing the accuracy by applying more neurons and fine-tuning [4]. In the same context, Shi et al. [5] have also used the same Kaggle Dog Breed Identification dataset to classify the different dog breeds. They have applied various pre-trained CNN models, and among them, DenseNet161 has achieved the best overall accuracy of 85.64%. Kim et al. [6] have also used the same dataset to develop their dog breed classifier model. They have applied proper data augmentation and got the highest overall classification accuracy of 83.22% using the ResNet152 model.

Sinnot et al. [7] have used the Stanford Dog dataset for identifying the different breeds of dogs. They have also used proper data augmentation and got superior classification model performance using the VGGNet model. Their image classification model has achieved an overall accuracy of 85% for 50 classes but, it drops to 63% for 120 dog breeds. In the same context, Ráduly et al. [8] have also used the Stanford Dog dataset. They have used proper data augmentation and hyperparameter tuning. They have applied the ResNeInception-ResNet-v2 model, which has achieved a decent overall accuracy of 90.69%. But, due to its massive amount of weight, it is computationally quite expensive.

Zou et al. [9] have contributed by developing a new dataset named Tsinghua for dog breed classification. They have removed similar images by computing image structural similarity (SSIM). Further, they have applied three different deep neural networks: PMG, TBMSL-Net, and WS-DAN. Throughout their experiment, WS-DAN has provided superior classification performance over the other models. It has achieved around 86.04% overall accuracy for eighty classes, but this accuracy falls to 58.14% on the Stanford dog dataset.

Liu et al. [10] have used the Columbia dataset for dog breed classification. But they have taken traditional machine learning approaches instead of using CNN. They have used SIFT features descriptor and SVM algorithm for their experiment and got only 67% overall accuracy. Borwarnginn et al. [11] have also experimented on the same Colombia dataset. They have implemented the NASNet model, which has achieved 89.92% overall accuracy. In the same context, LaRow et al. [12] have used the same dataset to classify the different dog breeds. As data pre-processing, they have extracted the facial key point from the dog images and have used a 17-layer CNN architecture for feature extraction. They have used the SVM model for classification. But their CNN-SVM approach has achieved a very low, 52%, overall accuracy. Table 1 represents the classification performance of the various implied methodologies in the literature for dog breed classification.

Table 1 Comparative analysis of the implied methodologies in literature for dog breed classification

3 Methodology

This section vividly describes our proposed approach to building a CNN-based dog breed classifier model. The main steps of our methodology are data pre-processing, feature extraction, model training, and prediction on the new images.

3.1 Dataset Description

We have experimented on the Kaggle Dog Breed Identification dataset [3] that is collected from Kaggle. This dataset contains 10,222 training images in 120 classes and 10,357 testing images. Each image of this dataset is in RGB format of random sizes. Figure 1 represents the graphical plot of the class-wise data size where x-axis and y-axis denote the dog breed name, and the number of data in a particular class, respectively. The Scottish Deerhound dog breed contains the highest, 126, dog images, whereas Briard and Eskimo dog breeds contain the lowest, 66, dog images.

Fig. 1
figure 1

Graphical representation of class-wise data size

3.2 Data Pre-processing

In the Kaggle Dog Breed dataset, the available test set is not labeled, so it is troublesome to evaluate the classification model performance from this test set. To tackle this problem, we have split the training set into a ratio of 80:20 for training and testing purposes, respectively. To seize the overfitting problem, we have also used various data augmentation techniques. We have applied a width-shift-range of 0.25, height-shift-range of 0.25, zoom-range of 0.2, and horizontal-shift and generated many images from each image of the training set. Moreover, we have resized all images in 224 × 224×3 for our experiment.

3.3 Convolutional Neural Network (CNN)

CNN is a specialized neural network for image classification. It mimics the visual cortex of the animal brain to recognize and process images. CNNs consist of several building blocks such as a convolutional layer, pooling layer, activation function, and fully connected layer. Figure 2 depicts the basic architecture of a CNN model.

Fig. 2
figure 2

Basic architecture of a CNN model

The convolutional layer uses convolution operation to find the features from an image. Equation 1 mathematically expresses the convolutional operation.

$$(X*K)(i, j) = \sum_{p}\sum_{q}K(p, q)X(i-p,j-q)$$
(1)

where K is the kernel and X denotes the inputs.

Pooling layers generally reduce the dimensions of the feature maps. Thus, the number of trainable parameters decreases and computation time becomes lower. Activation functions introduce non-linearity between the inputs and the outputs. In our methodology, we have used two activation functions: Leaky ReLU and Softmax [13]. We have applied Leaky ReLU and Softmax activation functions in the hidden layers and the output layer, respectively. Leaky ReLU and Softmax functions are mathematically expressed by Eqs. 2 and 3, respectively.

$$\mathrm{Leaky ReLU }(\mathrm{m}) = \left(\begin{array}{c}am, if m\le 0\\ m, if m>0\end{array}\right)$$
(2)
$$\mathrm{softmax}{(\mathrm{z})}_{\mathrm{i}} = \frac{\mathrm{exp}({\mathrm{z}}_{\mathrm{i}})}{\sum_{\mathrm{j}=1}^{\mathrm{n}}\mathrm{exp}({\mathrm{z}}_{\mathrm{j}})}\mathrm{ for i }= 1, ...,\mathrm{ n} {\rm {and}} z = ({\mathrm{z}}_{1}, ..., {\mathrm{z}}_{\mathrm{n}})\upepsilon {\mathbb{R}}^{\mathrm{n}}$$
(3)

where m is the input of Leaky ReLU function, a = 0.01 and \({z}_{i}\) is the ith element of the input vector z for Softmax function.

3.4 Modified-Xception Model

In this paper, we have applied various deep CNN models like ResNet50, VGG16, DenseNet201, and Xecption and have modified the Xception model by replacing its top layers with one Global Average Pooling layer, three consecutive Dense layers, 50% dropout in all of them and finally, one Softmax layer. The dropout layers are added to tackle the overfitting problem during model training. Moreover, we have used the Leaky ReLU activation function rather than ReLU [13] in the hidden layers. ReLU is the most popular activation function, but it offers zero output for the negative inputs. As a result, it causes vanishing gradient problems [14] during model training. On the other hand, Leaky ReLU seizes this vanishing gradient problem by offering a small output for the negative values. As a result, the classification performance of the CNN model increases. Figure 3 depicts the architecture of our modified-Xception model.

Fig. 3
figure 3

Model architecture of our modified-Xception model

3.5 Experimental Setup

To classify the different dog breeds, we have trained our modified-Xception model along with the pre-trained models for 100 epochs. The models are compiled with the Adam optimizer having a 0.0001 learning rate and a loss function, namely categorical cross-entropy. Too many epochs often cause overfitting problems in the learning phase of a classification model. To overcome this problem, we have used the Early Stopping algorithm [15]. It halts the training phase whenever generalization error does not decrease. Moreover, we have reduced the learning rate to avoid any stagnation in the model learning phase. Throughout the experiment, we have used the TensorFlow framework for model training and data pre-processing; Matplotlib and seaborn for data visualization on Google Colaboratory.

4 Results and Discussions

Here, we have presented our experimental outcomes and also compared the proposed methodology with the previous works in the Kaggle Dog Breed Identification dataset. We have applied different deep CNN models like ResNet50, VGG16, DenseNet201, and Xception and proposed a modified-Xception model. Figure 4 graphically represents our experimental results.

Fig. 4
figure 4

Graphical representation of the overall accuracy using deep CNN models

Throughout the experiment, our proposed modified-Xception model has achieved the highest 87.40% overall classification accuracy, whereas the original Xception model has achieved the second highest 84.15% overall classification accuracy. To find the overall classification accuracy, we have utilized Eq. 4.

$$\mathrm{accuracy }= \frac{\mathrm{A}+\mathrm{C}}{\mathrm{A}+\mathrm{B}+\mathrm{C}+\mathrm{D}}$$
(4)

where A is the number of items whose true labels are positive and also classified as positive; B is the number of items whose true labels are negative but predicted as positive; C is the number of items whose true labels are negative and also correctly classified as negative; D indicates the number of items whose true labels are positive but classified as negative.

Figure 5 depicts the confusion matrix achieved by the proposed modified-Xception model. To evaluate our model performance, we have presented a comparative performance analysis of the proposed methodology with the existing approaches to classify the dog breed images on the Kaggle Dog Breed Identification dataset in Table 2. It can be observed that on this dataset, our proposed model has achieved better accuracy as compared to others.

Fig. 5
figure 5

Confusion matrix achieved by the modified-Xception model

Table 2 Comparative performance analysis between the implied methodology and the previous approaches of literature on Kaggle Dog Breed Identification dataset

Our methodology also provides a decent recognizing rate to recognize the new images. To recognize a particular image, our proposed modified-Xception model takes around 1.56 ms.

5 Conclusion

This article briefly investigates the ability of the CNN model to classify the different dog breeds. We have also shown the usefulness of using Transfer Learning and fine-tuning techniques for image classification tasks. Here, we have demonstrated the importance of proper data augmentation and pre-processing to improve classification accuracy. Our proposed modified-Xception model has achieved the highest 87.40% overall accuracy on the Kaggle dog breed identification dataset. In the future, we intend to work on a very deep Densely Connected Neural Network and ensemble network models for further improvements. However, our work will inspire the researchers to introduce more developments in dog breed identification.