Keywords

1 Introduction

Image classification is used in various applications, such as security, educational, and promotional systems. In recent years, much research has been done to design automated systems to extract fundamental features from images. Convolutional neural network (CNN) is an effective method for image classification which uses convolutional, pooling, and fully connected layers for the learning process [11]. CNN has redefined the state of the art in many real-world applications, such as facial recognition, image classification, human pose estimation, and semantic segmentation [19]

The training of machine learning models for image processing is a process that consumes many resources [10]. Despite the computational power we currently have at our disposal in personal computers, especially in graphics processing units (GPU) [5], this power is still limited for usage in computer vision, mainly in CNNs. This class of application usually relies on heavy computations on massive datasets. Therefore, parallel computing is traditionally considered to run the training process in a feasible time using the GPU [2]. Acquiring these hardware implies risks under and over-utilization, depreciation of the hardware, and failures. There are also costs related to maintenance, energy, and human resources [2]. However, there are solutions in the cloud in all significant providers, Amazon AWS [16], Microsoft Azure [1], and Google Cloud [3], that can aid in this process, making the machine learning process more accessible. The researcher/developer does not take the risk of acquiring hardware that can quickly become obsolete and only pay for the resources that he will use [10]. However, some free alternatives in the cloud, namely Google Colaboratory (commonly referred to as “Google Colab” or just “Colab”) [6] and Paperspace Gradient [14], are the ones used in this work. Computational resources are not the only requirement for training machine learning models; in-depth knowledge of applied mathematics and deep learning libraries is also required [10]. While this might pose a problem, there are, however, libraries that simplify this challenge, namely ImageAI. ImageAI is a Python library built to empower developers, researchers, and students to build applications and systems with self-contained deep learning and computer vision capabilities using simple and few lines of code [13].

This study aims to compare some of the ImageAI provided algorithms ResNet50, MobileNetV2, InceptionV3, DenseNet121, and two more EfficientNetB0, NASNetMobile, which were made available through some custom code, on the IdenProf [12] dataset and two customizations of it, with different train and test sizes in order to recognize professions in images, using Paperspace Gradient and Google Colab as research environments. This document is organized into five sections. In Sect. 2, we discuss related work in the area. Section 3 presents the method applied in study. Section 4 shows the obtained results. Finally, in Sect. 5, conclusions are drawn, followed by future work guidelines.

2 Related Work(s)

This paper focuses on the topic of computer vision and its technology. It is easy for humans to describe and understand the objects that we see from the world. Our visual system can perceive a three-dimensional structure with enough information, such as the objects’ shape, appearance, and color. However, this is not easy for a computer [20]. Researchers in this field try to mimic the capacity of how human vision works using computers. However, it is not an easy task, and literature on artificial visual processing is usually categorized into visual processing algorithms, which consist in the recreations of the human vision, and classifiers, which are remodeling of the human decision techniques [4].

Computer vision is a vast research field where mathematics, geometry, and physics are applied [20]. However, some tasks are commonly accomplished with computer vision, object detection, recognition, and classification. This paper is focused on the classification part. Image classification was once a task that required domain expertise and the use of problem-specific models. Much of this has changed with the emergence of deep learning as a general-purpose modeling technique for predictive tasks in computer vision. Both the machine learning literature and image classification contests are now dominated by deep learning models that often do not require domain expertise since such models identify and extract features automatically, eliminating the need for feature engineering [9].

Usage of libraries like ImageAI allows us to train and generate image classification models with CNN without extensive knowledge of the inner network workings. However, it allows us to use its potential as this kind of algorithm is broadly used in computer vision. However, the requirements of computation for training models with these algorithms are high. With our experimentation, we used ResNet50 [7], DenseNet121 [8], and InceptionV3 [17] as these are also used by the majority of studies related to image object recognition, and we also used MobileNetv2 [15], NASNetMobile [21], and EfficientNetB0 [18]. CNNs provide high accuracy. The main reason for this is because the number of features increases dramatically. The research done about computer vision relies on the accuracy of the validation and improving that accuracy. Also, studies are showing us that if we keep training our model with thousands of pictures, we can reach into overfitting issues. There is a balance that needs to be respected when training models [10] Last but not least, as shown by some studies, the quality of the datasets impacts creating and training a model.

Setting up an on-premises solution to research and build models from machine learning algorithms can be expensive and not the only available option. All major cloud providers offer services in the cloud for the same purpose with access to a huge computing capacity. Some of these offers are Amazon Sagemaker [16], Microsoft Azure Machine Learning [1], Google Cloud AI infrastructure [3], and also free options like Google Colab [6] or Paperspace Gradient [14]. These last two platforms used in the experiment.

3 Experimental Setup

This project’s architecture consists of using Paperspace Gradient provided Docker containers, which provided the necessary infrastructure for the code developed in Jupyter Notebooks and the base storage. Google Colab was also used (and integrated with Google Drive) in the final work for collecting graphics and metrics of TensorBoard logs.

The Paperspace Gradient free supplied containers include dedicated NVIDIA Maxwell GPU with 8 GB of GPU memory, 30 GB of memory, 8vCPU, and 5 GB of storage space. These resources are all free but with a limit of the run session of 6 h maximum. For Google Colab, the type of GPUs available in Colab varies over time, often including Nvidia K80s, T4s, P4s, and P100s. The standard available RAM in Colab is 12 GB. It should be noted that these resources are shared between users of the platform, so the available capacity of the resources varies over time.

3.1 Datasets

The author of ImageAI [13] Python library also created a dataset IdenProf [12], which was used as the base for this study, but we also created two custom datasets from it. The IdenProf dataset contains 11,000 images that span over ten categories of professions. Each profession category consists of 1100 images, 900 of which are used for training and the remaining 200 for testing. Our custom datasets consist of the same 1100 as the base dataset. However, the training and test sizes are different, 800 and 300 for one dataset and 1000 and 100. These datasets will be referenced as DS100, DS200, and DS300, matching the respective test sizes from now on.

The images in the dataset have a resolution of 224 * 224 pixels and represent subjects dressed in uniforms of their respective professions. The dataset distribution as of represented subject is as follows, 19.4% female an 80.6% male, 91.1% white, and 8.9% dark skins.

The process of acquisitions of the dataset images, as the dataset author describes it “The images in the dataset were obtained from Google Image search. The images were searched and collected based on the 15 most populated countries in the world. The dataset does not comply with EU GDPR has the individuals whose images were contained were not explicitly contacted for consent” [12].

3.2 Parameters

ImageAI library provides several algorithms that can be used for image classification, namely ResNet50, DenseNet121, MobileNetV2, and InceptionV2. Other two algorithms not present in ImageAI were also used, namely NASNetMobile and EfficientNetB0. All algorithms were used to train models from the datasets during 25 epochs in three different image batch sizes 16, 32, and 64 for the three datasets.

A preliminary test using a batch size of 128 was additionally thought, but due to resources limitations and time constraints was not carried over. In Table 1, we can observe a resume of setup parameters.

Table 1 Setup parameters

4 Results and Analysis

From the facts gathered, accordingly to Table 2 InceptionV3 was the algorithm that lost less trainable parameters followed by ResNet50, all other algorithms had losses over 1%. Both algorithms also detect more parameters, being, in this case, ResNet50 has more parameters detected, which translates into more significant model sizes. MobileNetV2 algorithm was the one that lost most parameters that could be trained; also, this was the one that detected fewer parameters, which translates to smaller size models. A final fact we can observe from analyzing all algorithms, as the number of parameters detected grows, so does the model sizes grow proportionally.

Table 2 Algorithms facts collected

As we can observe in Table 3, related to temporal data, the faster algorithm in dataset DS100 was MobileNetV2 for batch sizes 16 and 64; however, for a batch size of 32, it was EfficientNetB0. For dataset DS200, MobileNetV2 was faster for batch size 16, in batch size 32, it was EfficientNetB0, and in batch size 64, it was NASNetMobile. For dataset, DS300 EfficientNetB0 was faster in batch size 16 but slowest in batch size 32, where InceptionV3 was faster with batch size 64.

Table 3 Training times facts

In other facts gathered, for dataset DS100, there was a reduction in the train time from a batch size of 16 to a batch size of 64. As of dataset DS200, that only happened with NASNetMobile and EfficientNetB0, from batch size 16 to batch size 32, there was an overall reduction excluding MobileNetV2. In dataset DS300, only NASNetMobile had time reduction from batch size 16 to batch size 64, and all others had time increased. For DenseNet121 and InceptionV3, as the batch size increased, so did the train time. We can extract from the facts that the best training time obtained across all datasets and all batch sizes was for InceptionV3 in dataset DS200 and batch size of 32.

According to Tables 4 and 5, we can observe that for all datasets and all batch sizes, the training accuracy is slightly better than validation accuracy; however, for NASNetMobile algorithm only for batch sizes of 16, this difference is slight for other batch sizes, validation accuracy is much lower, and the same behavior occurs with MobileNetV2 for dataset DS100 on a batch size of 32 and remaining datasets in batch size 64. Also for an expected behavior with all algorithms across all datasets and batch sizes, accuracy tends to increase with batch size increase, as for validation accuracy that tendency is inverse, meaning with batch size increase validation accuracy tends to decrease, but for some batch sizes and algorithms that is not always the case. Analyzing Table 4, another result was gathered, for all algorithms without exception, in dataset DS100 higher accuracies were achieved than those of DS200, and these were greater than DS300.

Table 4 Train accuracy on datasets across batch sizes
Table 5 Validation accuracy across batch sizes and datasets

Table 6 is relative to the training loss. This is a metric worth analyzing, as this can indicate how good the predictive model is, as lower the loss, the better the predictions are. Observing these values, we see some similarities between some algorithms. In datasets, DS200 and DS300, all algorithms excluding EfficientNetB0 decreased loss as the batch size increased, as for EfficientNetB0 decreased from batch size 16–32 and increased slightly from batch size 32–64, still lower than batch size 16. EfficientNetB0 and MobilNetV2 for dataset DS100 decreased loss as the batch size increased, while all other algorithms had the same behavior, decreasing from batch size 16–32 and increasing from 32 to 64. NASNetMobile had the lower loss in all datasets, in DS100 was in batch size 32, as for the others was with 64.

Table 6 Train loss across batch sizes and datasets

Observing the values in Table 7, relative to the validation loss, we saw some similarities between some algorithms, and MobileNetV2 and NASNetMobile stand out as having much higher losses than the rest in all datasets for the majority of batch sizes. For EfficientNetB0, the loss behavior was the same per batch size across the datasets, increasing loss as the batch size increased. ResNet50 for dataset DS100 and DS200 performed the same with loss decrease from batch size 16–32, but with an increase in 64, while for dataset DS300, as the batch size increased, the loss decreased. DenseNet121 and InceptionV3 had the same behavior as ResNet50 in dataset DS100, while in dataset DS200, InceptionV3 maintained the same behavior, and DenseNet121 increased loss as batch size increased for DS200 and DS300 datasets. The lower loss was obtained with DenseNet121 for dataset DS100 and batch size of 64, while for the remaining datasets, InceptionV3 had a lower loss with batch size 32.

Table 7 Validation loss across batch sizes and datasets

Table 8 represents the difference between the losses from training and from validation. This is a fundamental metric to pay attention to, as we can check if the models trained might be overfitting or underfitting. Ideally, this difference should be zero, or as close as we could get, but usually, some overfitting occurs. At first glance, NASNetMobile stands out as having a more significant difference than the rest for all datasets, which indicates a case of overfitting. MobileNetV2 also generated models overfitting in all datasets, for batch size 32 in DS100 and batch sizes 32 and 64 for DS300; as for batch size 16 in all datasets, this algorithm trained some of the best-adjusted models. Overhaul as the batch size increases, so do the generated models overfitting. The less overfitting models for datasets DS100 and DS300 were MobileNetV2 with a batch size of 16, as for DS200, it was ResNet50 also with a batch size of 16.

Table 8 Difference between validation loss and training loss across batch sizes and datasets

5 Conclusions and Future Work

After testing the six algorithms and observing the results, we can conclude that all algorithms tend to achieve higher accuracy when the dataset train size increased and test size increased. Another conclusion we extract is that accuracy is achieved between algorithms and batch sizes are similar, with slight differences. The best algorithm does not exist; they all tend to adapt to the datasets.

For the three datasets, the algorithm which achieved higher accuracy under 25 epochs was InceptionV3. Excluding NASNetMobile and MobileNetV2, which presented overfitting, all the other algorithms had validation/training loss differences lower, meaning slightly less overfitting than InceptionV3. Considering accuracy and validation/training loss difference, DenseNet121 appears to be a better algorithm for the datasets in the study.

EfficientNetB0 in all datasets showed that increasing batch size, the accuracy decreased slightly; the same happened to ResNet50 and DenseNet121. This result makes us believe that increasing batch sizes are not beneficial for these algorithms. MobileNetV2, contrary to previous ones, seems to increase accuracy slightly, which seems that increasing batch size might benefit this algorithm. Given its generated smaller model size, this algorithm might be worth exploring in situations where limited resources are available.

As cloud platforms allow developers, scientists, and researchers to use free resources for machine learning, despite its limitations, these solutions prove worth the time exploring, as it would be in-viable carrying out this study on typical household computers. Paperspace Gradient proved us to be a powerful and versatile platform for carrying out these tests.

5.1 Future Work

Our tests showed a slight tendency for more accurate models trained on datasets with more extensive train sets. Another tendency was that bigger batch sizes reduce both train time and accuracy. Similar tests could be done on different datasets and adaptations from the same dataset to validate and improve this study, trying to test bigger batch sizes and more extensive datasets to see if these findings still hold valid.