Keywords

1 Introduction

A popular trend on the global tourist map is culinary tourism [1], sometimes known as food tourism. Dishes frequently reflect the traits of the locals and each culture. The geography, culture, religion, and climate [2] of any culture are just a few examples of the many distinct aspects that influence a culinary culture. For instance, olive oil and herbs are frequently used in Mediterranean cooking [3]. Steak and cutlets are staples of Western European cuisine. Additionally, cheese and wine are well-known elements in sauces. Rice is the staple cuisine in Eastern Asia and Indochina [4], while fish, shrimp, and soybean are typically used to make sauces.

Several more tourists are willing to pay to sample distinctive delicacies from the surrounding areas [5]. Therefore, there are numerous chances for economic growth with food tourism [6]. In order to enhance the culinary tourist experience, concentrating on creating a method to identify food. Transfer learning and Convolutional neural networks methods [7, 8] are used in the system's design, that is based on deep learning.

With nine well-known traditional dish classifiers [9], the structure concentrates on Vietnamese culinary culture: Banh-Chung, Bun, Banh-Mi, Com-Tam, Banh-Tet, Banh-Trang, Pho, Banh-Xeo and Goi-Cuon. The system is easy configurable to accommodate additional classifiers [10] and delicacies from different nations.

The following are the primary areas where this paper makes major contributions. In order to recognize food [11], a novel CNN is first created using EfficientNet [7] and the transfer learning method. Second, giving researchers and scientists access to dishes dataset [12], train the model, and acquired parameters so that can reuse and expand it with more dish classifiers. In order to tackle the problem of dish identification, a mobile application is finally built. It also gave some helpful information about foods.

The residual parts of the hypothesis are arranged as follows. A thorough analysis of similar techniques was presented in Sect. 2. Next Section presents the materials and the recommended procedure for classifying food into nine categories. The creation of a software platform to enhance the experience of tourists is presented in Sect. 4 together with the trial findings. Section 5 brings the process to a close.

2 Review of Similar Methodology

Food or dish characterization is a fascinating issue with numerous practical applications. It can also be used to examine the calories in foods to govern the best meal in terms of calories for persons. There are frequent works that address the issue. A Chinese food recognition [13] and value assessment system constructed on multi-design support vector machines (SVM) and Adaboost has been proposed [14] by Chen et al. With the use of SVM, Kawano and Yanai [15] created mechanism for sensing meals on the go. The device is able to measure calories and nourishment as well as identify dishes. Another food image recognition system was created by [16] Yanai and Kawano utilising a pre-trained convolutional neural and tuned deep convolutional neural system.

Yadav and Chand [17] created a different automated method for classifying food images using the VGG network. Martinel et al. [4] presented using broad range hysteresis networks to identify food based on residual learning. A new CNN acknowledged as a personalised classification approach [18] for nutrition recognition was proposed by Horiguchi et al. [19] Zahisham created a different way for identifying food using the ResNet-50 model. The above-mentioned techniques have a few major downsides. The approaches SVM or Artificial neural strategies have constraints effectiveness attributable to the dish identification obstacle requires various learning tools [20] and sources. Other CNN-related techniques are more efficient, but need a large amount of training data. CNN models have this quality. Additionally, some techniques, like VGG, that were trained on intricate CNN models consumed a lot of system resources. Due to their high efficacy, CNN models are a major focus of this paper. Additionally, observe CNN model performance that is appropriate for mobile applications. As a result, use the Model EfficientNet and transfer learning method to create a dish identification system.

3 Components and Techniques

  1. A.

    Components

The collection of 7848 photos of food known as UEHVDR was compiled from various online sources. Each image is in RGB colour mode and comes in different sizes. The picture database is divided into three sections: a practise set of 6273 pictures (about 80% data at irregular intervals), an authentication set of 780 pictures (about 10% data at irregular intervals), and a test database 795 photos (about 10% data at irregular intervals).

Vietnamese cuisine has a wide variety of foods, yet data may be categorised into a number of fundamental groups. For illustration. Numerous variations of pho, including pho bo, pho ga, pho chay, and others, are included in the pho group. Just to make things easier, refer to them all as Pho. International visitors from all over the world are also familiar with this name. Banh-Mi, another well-known Vietnamese meal, is the same in this instance. Nine well-known and traditional groups—Banh-Chung, Bun, Banh-Mi, Com-Tam, Banh-Tet, Banh-Trang, Pho, Banh- Xeo and Goi-Cuon—were employed in the study.

It should remain highlighted the titles of the Dishes are only available in Vietnamese because interpretations might affect their popularity or meaning. For tourists, learning about a culture through the names of regional foods is also interesting. Materials from the UEH-VDR collection, in component are shown in Fig. 1. Figure 2 illustrates the number of photos in individual class for the training dataset, a validating test, and a training and testing sets.

Fig. 1
A set of 20 photographs of Vietnamese cuisine are Goi cuon, Com tam, Banh xeo, Banh mi, Pho, Bun, and Banh tet.

A portion of the dataset of Vietnamese cuisine images

Fig. 2
A grouped bar and line graph depicts the number of photos versus class training, testing, and evaluation datasets for the T set, validation set, and test set. The highest value of the T set is 1100 for Banh Mi, validation set is 150 for Banh M i. The values are approximated. The line depicts a nearly constant trend.

Number of photos in each class’s training, testing, and evaluation datasets

  1. B.

    Techniques

  1. (1)

    EfficientNet-B0: An optimized collection of basis systems was created using neural network—based using convolution layers by uniformly scaling width, height, and resolution over whole parameters. Figure 3 displays EfficientNet-B0, a member of the EfficientNet [7] family of designs. Layers in the EfficientNet-B0 configuration include:

    Fig. 3
    A flowchart illustrates the EfficientNet-B 0 structure starting from Conv 3 by 3 and ending at Batch Normalization. In that S means stride, and B means feature maps.

    EfficientNet-B0 structure, where K × K is the dimensions of the filter, S is the stride, B is the feature maps, h × w is the size of the image

A 2-D convolution layer is called the conv layer (Conv2D). It creates a kernel to generate the output, integrating it with data from the previous level. Depthwise Conv layer Each input channel will receive a single convolutional filter from the depth-wise convolution layer.

By analysing the dimensions average and standard deviation, the BatchNormalization layer normalises input components. It then determines the normalized [21] activation function by:

$$ x_{i} = \frac{{x_{i} - \mu B}}{{\sqrt {\sigma_{B}^{2} + g} }}, $$
(1)

where xi is the quantity of information items, µB and σB2 are dimensions average and standard deviation, and 1 is the constant of quantitative consistency.

As for the input, following is a flat and non-monotonic activation function that the Swish layer will use:

$$ swish\left( x \right) = \frac{x}{{1 + \exp \left( { - x} \right)}} $$
(2)

For each channel, the SE layer will apply distinct weights rather than the same.

In addition to functioning well and persistently on ImageNet, EfficientNet-B0 may also be used satisfactorily to other datasets.

  1. (2)

    EfficientNet-B0 transfer learning

Any machine-learning model can apply the transfer learning technique, but deep learning is where it first gained popularity. Neural networks using convolution layers is viewed as having training in particular database sets in order to retrieve characteristics. The prototype is able to forecast outcomes based on learnt attributes. A lot of data must be used to train the model for CNNs in order to increase their accuracy. There are certain challenges to gathering a lot of data in practise. Additionally, if anything in the obtained information is lacking, the data is still not trustworthy sufficient to be implemented with CNN models. The transfer learning method is the answer to these problems. An examination of the transfer learning approach is shown in Fig. 4.

Fig. 4
A set of 2 C N N model transfer learning approaches without and with transfer learning. Without transfer, learning includes learned features to Goi Cuon, Banh M i to Banh Xeo. With the transfer, learning consists of new data, pre-trained C N N, new task, Goi Cuon, and Banh M i.

A CNN model transfer learning approach

Transfer learning allows a model to keep its ideal parameters after being tested on a well-known dataset like ImageNet. The following learning challenge uses the previously taught characteristics. As a result, the model's overall accuracy will increase.

Transfer learning will be used in this project for EfficientNet-B0, which was trained on ImageNet. Figure 5 shows the suggested model's architecture utilising EfficientNet-B0 and transfer learning. The design of the suggested model is essentially the same as that of EfficientNet-B0.

Fig. 5
An image of a square-shaped blank box.

Architecture of EfficientNet-B0

Fig. 6
Two-line graphs process of accuracy and loss training depicts accuracy and loss versus Epochs for training and validation. The value of the history of accuracy is increasing trends, and the value of the history of loss is decreasing trends.

Process of accuracy and loss training

In other words, it gets information from just about every neuron in the layer before it. By monitoring the dimensions average and standard deviation, the BatchNormalization layer normalises input components before estimating the standardized activation.

  1. C.

    Error metrics

The following metrics will be employed to gauge how to see how the dish categorization is doing and how it stacks up against additional links with a similar focus:

$$ accuracy = \frac{{Tr_{p} + Tr_{n} }}{{Fa_{n} + Fa_{p} + Tr_{p} + Tr_{n} }}, $$
(3)
$$ precision = \frac{{Tr_{p} }}{{Tr_{p} + Fa_{p} }}, $$
(4)
$$ recall = \frac{{Tr_{p} }}{{Tr_{p} + Fa_{n} }}, $$
(5)
$$ F1\;score = \frac{{Tr_{p} }}{{2Tr_{p} + Fa_{p} + Fa_{n} }}, $$
(6)

where, in turn, Trp , Trn, Fap, Fan (stands for) True Positive, True Negative, False positive and False Negative. Additionally, employ region under the Relative Operating Characteristic curve (AUC) measures. The ROC curve's total area, together with the vertical and horizontal axes, are used to evaluate AUC.

For TPR and FPR, the ROC curve is marked at various minimal values:

$$ TPR = \frac{{Tr_{p} }}{{Tr_{p} + Fa_{n} }}, $$
(7)
$$ FPR = \frac{{Fa_{p} }}{{Fa_{p} + Tr_{n} }} $$
(8)

Additionally considered is Cohen's Kappa (Kappa score). The Kappa rating is determined by:

$$ k = \frac{{p_{0} - p_{e} }}{{1 - p_{e} }}, $$
(9)

where pe stands for anticipated agreement and p0 refers to the empirical probability.

4 Results of Experiment and Discussions

  1. A.

    Model training and configuration

On the Google CODLAB platform, develop the conceptual system using the training data, validation datatset and test set in the same ratio of 8:1:1. 30 epochs are the predetermined number. The categorical cross-entropy is what is employed for the loss function:

$$ Loss = - \sum\limits_{c = 1}^{M} {y_{0,c} \ln \left( {p_{0} ,c} \right),} $$
(10)

where c is a class tag, po, c denotes the frequency that observation o relates to class c, yi, c signifies a binary pointer (0 or 1), which yields 1 if class tag c accurately distinguishes observation o, and M denotes the number of classes.

Through using Keras framework's default configuration and the optimum Adam technique [23], model is trained.

All of the representations analyzed in the study VGG16, ResNet50, and EfficientNetB0 are used using the above- mentioned settings. In Fig. 6, the accuracy and loss training graphs are displayed.

Fig. 7
A confusion matrix of 9 crosses 9 for Banh Chung, Banh M i, Banh Tet, Banh Trang, Bun, Pho, Goi Cuon, Com Tam, and Banh Xeo. The highlighted values are 0.96, 0.97, 0.98, 0.92, 0.92, 0.85, 0.86, 0.96, and 0.81.

The confusion matrix

  1. B.

    Outcomes and hypotheses

The trained system is exported and use the model on the test dataset to gauge how well the suggested strategy works. Figure 7 displays the confusion matrix using the data source.

Fig. 8
A screenshot of the A P I U E H-V D R application. Right, indicates the photograph of the 4 different recipes. The middle and the left indicate the photograph and the steps for preparation of Pho.

The API UEH-VDR application: showing the recipe (right), identifying the food and showing its properties (middle), and synchronising an input photo (left)

Then, assess the mistake metrics built on the confusion matrix. The suggested method's accuracy, recall, precision, AUC, Kappa scores and F1 score are shown in Table 1 together with the values for VGG-16, EfficientNet-B0 and Resnet-50. The EfficientNetB0 based on transfer learning produced the greatest results across all statistics, as can be shown.

Table 1 Efficiency evaluation of the CNN model
  1. C.

    Mobile application development

The training data is used to create an application called Dish Recognition in order to apply the findings to the recognition of Vietnamese dishes (VDR). Based on the Flutter framework, the application was created. Because of this, the VDR app may run on many operating systems, as for Android and iOS, using just an individual source code. Visit the website:https://play.google.com/store/apps/details?id=com.ueh.vdr to access the Google Play Store and download the Android app. The application will eventually be released on the Apple App Store.

The application's user interface is depicted in Fig. 8. Users of the application can upload images from the local computer memory or take new ones with the integrated camera. The software will identify the dish categorizer in any of the nine available classifiers and display some helpful information, like the dish's name, pronunciation in Vietnamese, an explanation of its ingredients, preparation instructions, and locations where it may be eaten.

5 Conclusion

In the essay, it has put forth a brand-new approach to dish recognition that combines the EfficientNet model with transfer learning. A number of significant layers, including BatchNormalization, GlobalAveragePooling2D, Dropout, and Dense are added to the EfficientNetB0. Then, transfer learning was applied to make advantage of the knowledge gained throughout the ImageNet pretraining procedure. The suggested strategy delivered excellent performance and accuracy for dish recognition. Additionally, a mobile application is created to use the training dataset to support food tourism. While the suggested approach operates successfully and consistently, some shortcomings must be addressed in the forthcoming, including expanding the quantity of categories for food and integrating with dish calorie evaluation.