Keywords

1 Introduction

Food intake monitoring is used to help a person in keeping track of the details of the consumed food, especially for those with overweight problems and health issues. The conventional way of food intake monitoring requires an approach of self-reporting which the implementation is less suitable in this modern era [1]. Along with the advancement of technology, development of automatic, objective, accurate, robust, and reliable food intake detection system have been initiated. While more research had been done to complete the system from as simple as food detection to as complicated as volume estimation [2], specifically for dietary monitoring applications.

The use of a single sensing approach might not be enough for the development of comprehensive automatic food intake monitoring; it is however could provide an important input to perfecting the system. Several methods can be used to capture the food intake detection such as hand-to-mouth (HtM) movement, bite, chew, and then swallow. However, chewing shows good potential as it occurs repeatedly in sequence. The commonly used sensors for food intake monitoring specifically based on chewing activity are acoustic [3], piezoelectric [4], electromyography (EMG) [5], and accelerometers [6]. The sensors were placed to the preferred location by either using a wearable device or direct attachment.

During food intake, chewing will provide a sequential movement that involves muscle movement of the masseter, the temporalis, the medial pterygoid, and the lateral pterygoid muscles. The temporalis movement had captured the researcher’s attention as it has the capability in providing significant chewing signals. Sensors that are used to capture chewing activity based on the temporalis movement are piezoelectric, EMG, and accelerometer. The piezoelectric sensors have been used by [4] and [7] for chewing detection based on the temporalis muscle movement, where the sensors are attached to the temporalis muscle by using medical tape and the wearable device, respectively. EMG sensors used to capture the temporalis muscle movement by using the wearable device of eyeglass which used dry EMG electrode [8] and stainless steel dry electrode [5]. The implementation of an accelerometer in capturing the temporalis movement has been done by [9] and [6], by attaching the sensor to the headband and eyeglass. However, some of the disadvantages of using EMG and the piezoelectric sensor are the requirement of direct attachment of the sensor which might be not suitable for all, such as people with allergies. While for accelerometer, eventhough the user does not require a direct attachment to the skin, the signals are impacted by the physical activities. The use of non-contact sensor-based has been proposed by researchers such as photoplethysmography (PPG) sensor [10], a photo sensor [11], a proximity sensor [12, 13], and a doppler sonar sensor [14]. Most of the listed research which proposed a non-contact sensor based on capturing the chewing activity used the jaw movement body sensing [12, 13, 14]. The distance-based sensors are placed around the neck area by using a necklace-based wearable device at which its performance might affect by physical activity. In a condition of controlled environment or laboratory, the previously discussed sensor of the temporalis movement-based detection gives a F1-score of 80% [8], 91.5% [6], 94.2% [5], and 99.85% [4], while for the non-contact-based detection gives F1-score of 91.9% [12] and accuracy of 91.4% [14].

This study used the new approach of non-contact-based detection to capture the movement of temporalis muscle during food consumption. The proposed method is based on a proximity sensor that will be attached to the temple of eyeglass by using 3D printed housing. The objective is to provide a new option in detecting chewing activities for food intake monitoring or diet monitoring applications. The chewing count estimation and the chewing rate data were also extracted. The analyses were then extended, to study the possibility of relating or differentiating the chewing count and chewing rate of different food hardness. The labeling of the chewing, chewing episode and chewing count are based on self-reporting where the subject requires to push the pushbutton.

2 Hardware Design

In this study, the chewing activity was captured based on the movement of the temporalis muscle during food consumption. A wearable sensor is designed using a click board of proximity 9 clicks from Mikro-Elektronika (MikroE) that equip with a VCNL4040 proximity sensor (by Vishay semiconductor). The sensor is a combination of proximity and an ambient light sensor that capable of measuring the ambient light, white light, and proximity within the range of 20 cm. For this study, the proximity function is used, where the sensor was mounted at the right temple of a wearable device of eyeglass using a 3D printed housing. Generally, the sensor works by emitting continuous infrared light and the amount of reflected light is used to measure the distance. The sensor could detect an object within the range of 20 cm, however, the measured distance is based on the qualitative measurement or change of the object distance from the last reading not based on quantitative distance readings. Hence, for this study, the temporalis muscle acts as the targeted object to be measured. During chewing, the temporalis muscle movement causes the change in distance between the sensor and the temporalis muscle, the collective distance changes represent the chewing pattern which then captures by the sensor.

The data from the sensor transferred to a microprocessor (Arduino Uno board) using a sampling rate of 50 Hz. Besides the chewing sensor and microcontroller, two pushbuttons were included in the chewing detection system. Each of the pushbuttons is used for validation purposes to provide separated labeling or ground truth for eating activity and chewing activity respectively. Figure 1 shows the implementation of a wearable sensor for capturing the temporalis muscle movement.

Fig. 1.
figure 1

Chewing detection system: (a) the proximity sensor into the 3D printed housing, (b) Attachment of wearable sensor to the eyeglass, (c) Temporalis muscle position.

3 Methodology

3.1 Data Collection

The summary of the methodology used in this study is shown in Fig. 2. A total of ten set data were taken from a single subject. The data were collected in a controlled environment where only eating and resting activity were considered. For eating activity, the subject requires to eat three test foods with different hardness. The foods are banana, apple, and carrot which represent food hardness of soft, medium, and hard respectively as used by [7]. The portion for each intake was based on 1 spoonful which also represents by 9 g. The relation between 1 spoonful and 9 g of the test food was achieved by cutting the test food into small pieces, placed it in a measuring spoon, and weighted using a food scale. All test food gives a weight of 9 g. The test food served to the subject in the form of cylindrical shape with the same thickness (±15 mm), diameter (±27 mm), and weight (9 g). Figure 3 shows the food test preparation.

Fig. 2.
figure 2

Summary of methodology

Fig. 3.
figure 3

Test food portion preparation and measurement: (a) one spoonful of test food being weighted, a cylindrical test food being (b) weight, and (c) measure.

The subject performed a total of ten sets of the same activity sequence of resting for 15 s, eating carrot for about 90 s, eating a banana in 30 s, eating an apple in 30 s with 30 s resting in between food intake, and 15 s after the last foods are taken. Each set of data takes 240 s and a total of the 2400 s are required to complete data collection for ten sets of data. The sequence for a set of activities is shown in Fig. 4. While performing the activities, the subject requires to label the activities of chewing and chewing episodes using pushbuttons.

Fig. 4.
figure 4

The sequence of activity for each of the data

3.2 Chewing Detection

Data Pre-processing

In this stage, the dataset of the raw sensor signal will be prepared to be used in the classification stage. A suitable pre-processing method will help to amplify the desired signal by removing the unwanted signal or noise. In determining the method, the raw data will be first observed by either using time, frequency, or time-frequency representation or combinations. Then, depending on the application, the pre-processing methods such as normalization, detrend, smoothing, down-sample, and filtering could be used to obtain the desired signal.

Based on the observation of the raw data shown in Fig. 5 (a) and the spectrogram as shown in Fig. 6, the normalization and bandpass filter will be used in pre-processing the signals. The normalization is required to eliminate the amplitude variation due to a variety of factors such as the position of the sensor and the distance between the sensor and the temporalis muscle. Several methods can be used as the normalization such as z-score normalization, minimum to maximum normalization, and signals to median normalization. Additionally, the researchers in [15] believe that the normalization method could improve the classification accuracy. The z-score normalization that represents in (1) is used. Next, the bandpass filter is used to preserve the desired signal. The lower cut-off frequency, fc1 used to remove the DC component and the high cut-off frequency, fc2 used to keep the signal in the range of chewing frequency. Previous studies had defines the frequency range of 0.94 Hz to 2.17 Hz [16], 1.25 Hz to 2.5 Hz [17], and 0.5 Hz to 2.5 Hz [9]. Then, the Time-Frequency distribution is observed by using the spectrogram as in Fig. 6. Based on the signal distributions, it shows that the high-power spectral density in the range of 3 Hz and 5 Hz. Hence, this study will analyze the chewing classification using fc1 of 0.5 Hz and ranges of fc2 of 2.5 Hz to 10 Hz. Additionally, the effect of setting up the fc2 to 15 Hz and 20 Hz was also observed.

$$ z = \frac{{x - \bar{x}}}{S} $$
(1)

Where, x = sample data, \( \bar{x} \) = mean of the sample, and S = standard deviation of the sample.

Fig. 5.
figure 5

An example of Proximity signal: (a) Raw (b) Normalized (c) Filtered

Fig. 6.
figure 6

Time-Frequency representation (Spectrogram) of the raw signal

Segmentation and Feature Extraction

In extracting the features of the signal, the pre-processed signal will be segmented into an appropriate window type, size or length, and overlap length. The selection of the segmentation parameter are depending on the signal application where it is used to capture the energy envelope of the signal. For chewing detection, there is no standard size [18], however, the most commonly used time resolution were 3 s [3, 4, 19], 4 s [20], 5 s [12, 21,22,23], 15 s [24] and 30 s [17].

This study will extract features from the Time-domain (TD), Frequency-domain (FD), and Time-Frequency domain (TFD). In TD and TF, the features will be directly extracted with the segmentation setting of 3 s and 50% overlap. For TFD, the segmented TD will be multiplying with the Hanning window and converted to TFD by using Fast Fourier Transform (FFT). Since the FFT is based on the segmentation of the window, the process is known as a short-time Fourier transform (STFT). The STFT is usually represented by using a spectrogram, where the information of the magnitude values of STFT and power spectral density (PSD) for 0 to half of the sampling frequency can be obtained. The features extracted for chewing classification is based on the significant features that have been applied by [12,13,14, 18, 19], was computed in this study. A total of 40 features has been extracted, which includes 10 features from TD, 3 features from FD, and 27 features from TFD. The list of features extracted is shown in Table 2.

Table 1. Extracted features
Table 2. Classifier and its performance for variation of the upper cutoff frequency

Classification and Evaluation

The final stage is to classify the candidate segmented signal to either chewing or non-chewing. The classification model is first required to be trained and validated. In training and validation, the signal references are required for each segmented signal. Where in this study the self-reporting signal labeling by using pushbuttons is used to indicates the chewing episodes (food intake) activity and chewing activity. For the chewing episodes, the subject is required to push the chewing episodes pushbuttons to mark the starting and ending point of each food consumption. While, for the chewing label, the subject is required to push the chew label pushbutton during each chew. For the computation of the signal references for the training and validation, only chewing label signals will be used. The chewing label will be first segmented according to the chewing signal segmentation setting. Then, taking the average of the segmented chew label, the signal references will be label as chew (C = +1) if the average is more than 0, otherwise, it will label as non-chew (C = −1). An example of the chewing label and chewing count as ground truth is given in Fig. 7.

Fig. 7.
figure 7

Ground truth: (a) Labeling of chewing and chewing episode, (b) chewing count

All chewing features and labels will be feed to the Classifier learner application in MATLAB 2020 (from Mathworks, Inc) for classification and evaluation. The classifier was trained using the k-fold cross-validation (CV) method. The k-fold cross-validation method divides the training and testing data according to the parameter ‘‘k’’. This study used a cumulative duration-based evaluation, where an individual set of data (240 s) with a total of ten sets, will combine to form datasets (2400 s). The dataset will partition according to the individual dataset duration (240 s) which leads to the “k” parameter in k-fold CV is set to 10. The dataset will be validated it will iteratively ‘‘k’’ times and the final evaluation is the average of evaluated performance metrics. The classifier performance is based on the accuracy and F1-score value where the calculation is based on (2), (3), (4), and (5), where, TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively.

$$ Precision = \frac{TP}{TP + FP} $$
(2)
$$ Recall = \frac{TP}{TP + FN} $$
(3)
$$ F_{1} \,score = \frac{2 *Precision *Recall}{Precision + Recall} $$
(4)
$$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$
(5)

3.3 Chewing Analysis

Chewing Count

In chewing count analysis, the pre-processes signal with upper cut-off frequency, fc2 of 2.5 Hz, 5 Hz, and 6 Hz is used for estimating the chew count. The selection of the fc2 was based on the results of the classification stage. The 5 Hz and 6 Hz selected as it gives the highest classifier performance compared to other frequencies. Meanwhile, eventhough 2.5 Hz gives the lowest accuracy, it was chosen as researchers agreed that the chewing frequency was in the under the range of 2.5 Hz. To further analyze the effect of fc2 on the estimation of chew count analyses, fc2 of 2.3 Hz and 2.4 Hz were also analyzed. The window was segmented to 240 s and the number of peaks was computed for each segment. For fc2 of 2.3 Hz, 2.4 Hz, and 2.5 Hz, only the number of peaks that were in the range of chewing label episodes (refer to Fig. 7 (a)) and peaks value greater than 0 will be counted. For 5 Hz and 6 Hz, an additional restriction of minimum peak prominence of 0.33 and 0.35 was implemented, respectively. The value of minimum peak prominence is based on trial and error which contributes to the smallest total error.

This study used to test food with different hardness and chewing time. Hence, instead of estimating the chew count for the whole segmented window, estimation of the chewing counts was based on the chewing episodes. For this analysis, the counted number of peaks from the chewing signals were compared with the number of peaks counted from the chewing label as in Fig. 7 (b). The capability of the chewing detection approach in estimating the chewing counts were evaluated using the percentage of error based on each chewing episodes and total chewing for each set of data. The percentage of absolute error for the individual chewing episode and mean percentage of absolute error of chewing episodes was calculated using (6) and (7), respectively. Besides that, the effect of the different food hardness on the chewing rate was also observed. The chewing rate is calculated using (8).

$$ \left| {{\text{\% }}Error} \right| = \left| {\frac{{C_{Act} \left( n \right) - C_{Est} \left( n \right)}}{{C_{Act} \left( n \right)}}} \right| \times 100 $$
(6)
$$ \left| {{\text{\% }}Error} \right| = \frac{1}{\text{M}}\mathop \sum \limits_{n = 1}^{M} \left| {\frac{{C_{Act} \left( n \right) - C_{Est} \left( n \right)}}{{C_{Act} \left( n \right)}}} \right| \times 100 $$
(7)
$$ C_{R} = \frac{{C_{Act } \left( n \right)}}{{C_{T} \left( n \right)}} \times 100 $$
(8)

Where \( C_{Act} \) is actual chew count based on the chew count label, \( C_{Est} \) is the chew count estimation, M is the numbers of the chewing episode, \( C_{T} \) is the time taken for each chewing episode, \( C_{R} \) is the chewing rate or chewing frequency, and n is the respective chewing episodes.

4 Results

4.1 Chewing Detection

The chewing activity approach used in this study was first evaluated based on its capability of classifying the chewing activity. The upper cut-off frequency, fc2 was varied in the range of 2.5 Hz to 20 Hz. The accuracy and F1-score were computed for each fc2 of 2.5 Hz, 3 Hz, 3.5 Hz, 4 Hz, 5 Hz, 5.5 Hz, 6 Hz, 6.5 Hz, 7 Hz, 8 Hz, 9 Hz, 10 Hz, 15 Hz, and 20 Hz, where the result is shown in Table 3 along with the classifier. The performances of the classifier were then plotted and compared in the form of a graph as is Fig. 8. Based on the results, accuracy and F1-score show significantly the same trend. Then, by comparing the performance, 2.5 Hz gives the lowest accuracy of 92.6% using Medium Gaussian Support Vector Machine (SVM), while 6 Hz gives the highest accuracy value of 97.4% using Quadratic SVM classifier. The accuracies of the classifier decrease with a constant rate and maintain in the range of ±97% as the fc2 increase.

Table 3. Mean absolute error of chewing count estimation
Fig. 8.
figure 8

Classifier performance for a different upper cutoff frequency of bandpass filter

4.2 Chew Count Estimation

The estimation of the chew count analysis is solely based on the number of peaks of the sensor signal. Signals with different upper cut-off frequencies were considered and only peaks that were in the range of chewing episode labels will be considered. In each chewing episode, the number of peaks was counted and compared with the chew count algorithm, which’s developed based on the automatic chew label. Each segmented window consists of three chewing episodes based on three different test foods.

The results of the average chew count estimation and its absolute error, when compared to the actual chew count (based on chew label), is shown in Table 3. The plot of absolute error for each test foods is shown in Fig. 9. Based on the results mean absolute error of the total chew count estimation for fc2 of 2.3 Hz to 2.5 Hz gives the error less than 4.0% when compared to fc2 equal to 5 Hz and 6 Hz which gives the error of 11.77% and 12.11%, respectively. Among the three frequencies, 2.3 Hz gives the smallest error when chewing the banana with 5%, 2.4 Hz gives an error of 6.41% when chewing the apple, and 2.5 Hz the smallest error of 2.9% when chewing the carrot. However, based on the total error, 2.4 Hz gives the smallest error of 2.69%, compared to 2.3 Hz and 2.5 Hz that give the total error of 3.90% and 3.21%, respectively.

Fig. 9.
figure 9

The absolute error of chewing count for different upper cut-off frequency

Since, fc2 equal to 2.4 Hz provides the smallest error in chewing count estimation, the details of the chewing count estimation, means, standard deviation, percentage of error, and absolute error for each chewing episode of a dataset is shown in Table 4. The collected dataset consisted of 2635 chewing count from 30 chewing episodes, were 1713 from chewing a carrot, 429 from chewing a banana, and 520 from chewing an apple. The mean of the chewing count for 10 sets of data was given by 171.3, 42.20, and 52 with a standard deviation of 19.36, 10.10, and 12.95 for test food of carrot, banana, and apple, respectively. While the distribution of the mean absolute error is shown in Fig. 10. Next, the evaluation of the chewing counts was based on the sum of the chewing count of all 10 sets of data, and the results are shown in Table 5. The percentage of error based on the sum of chewing count estimation for eating a carrot, a banana, an apple, and a total is 0.94%, 6.72%, 2.99%, and 0.76%, respectively. By observing the sum of chewing count based on food test, carrot requires more chewing, followed by an apple and banana where the estimated chewing count is represented by 1713, 520, and 429, respectively. Hence, there is a possibility that food hardness is related to the chewing count.

Table 4. Percentage of error based on total chewing count for fc2 equal to 2.4 Hz
Fig. 10.
figure 10

Distribution of mean absolute error of the chewing count for fc2 equal to 2.4 Hz

Table 5. The details of the chew count estimation in a dataset for fc2 equal to 2.4 Hz

The chew count analysis was extended to study the possibility of differentiating the chewing rate of different food hardness. The results of the chewing rate for fc2 equal to 2.4 Hz are given in Table 5, while Fig. 11 presents the chewing rate of 10 sets of data according to the food type. The chewing rate for all food types was in the range of 1.7 Hz to 2.3 Hz. By observing the graph, there is no significant pattern that could differentiate the food hardness according to the chewing rate. The chewing label data or ground truth data of chewing count, chewing time, and the chewing rate was given in Table 6, while, the details of the chewing label data are given in Table 7.

Fig. 11.
figure 11

The chewing rate based on food type for fc2 equal to 2.4 Hz

Table 6. The chewing rate for fc2 equal to 2.4 Hz
Table 7. The chewing label data

5 Discussions

The works presented a new approach to chewing detection and classification based on the proximity sensor. The proposed approach aims to support the development of non-contact-based chewing detection that could guarantee the users’ comfort while maintaining its reliability. The proximity sensor of VCNL4040 in a form click board manufactured by Mikro-Elektronika (MikroE) was used. The sensor was placed in 3D printed housing, which then attaches to the temple of eyeglass near the temporalis muscle to capture the temporalis muscle movement. The labeling or ground truth of chewing, chewing episodes, and chewing count are based on the self-reporting using pushbuttons.

In chewing classification, several upper cutoff-frequency of a bandpass filter with the segmented window of 3 s were used. Based on the results the use of fc2 equal to 6 Hz gives the highest accuracy of 97.4% using Quadratic SVM classifier compared to fc2 equal to 2.5 Hz with an accuracy of 92.6% using Medium Gaussian Support Vector Machine (SVM). Evethough, the previous study agreed that the chewing frequency is in the range of under 2.5 Hz, however, for this study the accuracy of the 2.5 Hz does not gives comparable accuracy with 6 Hz as the fc2. As this study only considered chewing food and resting, the signal noise due to the motion artifacts could be neglected.

The chewing signals were then further analyzed in terms of chewing count and chewing rate. The windows were segmented based on a set of data that is equal to 240 s. For chewing count estimation only features of peak count were extracted and the chewing count estimation is based on the chewing episodes and total chewing in segmented windows. Similarly, the significance of the fc2 value was selected based on the classification accuracy. By referring to the results of the chew count estimation, fc2 of 2.4 Hz gives the smallest total absolute error of 2.69% compared to other fc2. The total absolute error obtained is comparable or even smaller compared to the previous study 8.09 ± 7.16% [25], 10.4% ± 7.0% [21], 9.66% [26], 3.83% [27], and 12.2% [9] which using method of the peak detection algorithm, histogram-peak detection algorithm, multiple regression model, multivariate regression model, and maximum frequency component (MFC), respectively.

Eventhough, fc2 of 6 Hz gives the highest accuracy during the classification stage, yet it does not give a good percentage of error in chewing count estimation. Additionally, the use of 6 Hz required a restriction during peak extraction, when compared to 2.4 Hz which does not the use of restriction during its peak extractions. By focusing on the use of fc2 of 2.4 Hz and 6 Hz and referring to the classification stage results and chewing count estimation results, an inference can be made that the chewing frequency is in the range of 2.5 Hz. However, the 2.5 Hz does not give good accuracy in the classification stage as the labeling of the chewing signal is based on the self-reporting (using pushbutton). There chewing signal and the chewing label does not tally, due to delay in pushing the pushbutton or during data collection (obtaining the label data) as the self-reporting label approach was used. The unsynchronized data and label would affect when shorter window segmentation was used as the chewing data wrongly label. This was proven as the chewing classification stage used a shorter window of 3 s compared to the chewing count estimation of 240 s.

Next, the analyses of finding the relation between the food hardness with the chewing count and chewing rate. Based on the work done, the total chewing count could be used to differentiate the food hardness. However, the chewing rate does not show an obvious pattern during chewing food with different hardness.

6 Conclusion

The analyses of the new approach of chewing detection were done. The proposed system was able to give high accuracy with 97.6% and F1-score of 97.6% of chewing detection using fc2 equal to 6 Hz in its bandpass filter. Eventhough, as fc2 is set to 2.5 Hz the accuracy reduced to 92.6%, however, the percentage of mean absolute error gives a good value of 3.21% compared to 6 Hz with 12.11%. The fc2 was then changed to fc2 of 2.4 Hz aiming to find the optimal fc2, and the results do improve with the percentage of error of 2.69%. While the results of relating the chewing count with the different food hardness show a potential and could be further investigated. The results suggest that the proposed approach could be used in characterizing the chewing activity. However, further modification of labeling methods by either using manual or improving the current self-reporting labeling method is required. Besides that, more data will be collected with different subjects in proving the effectiveness of the systems.