Keywords

1 Introduction

Chewing is one of the main ways of how we perceive food texture [25]. While there is a huge variety of products with different food textures (e.g. crispy products like potato chips, hard and moist products like apples and cucumbers), some textures are generally perceived as more pleasant and desirable than others [11]. It is clear from several studies that food texture and structure are becoming more important in understanding eating behavior, especially in food intake regulation and weight management [1, 17, 29, 31]. Taking food structure into account in dietary advice can may support the prevention and dietary treatment of overweight and obesity [6, 9, 14]. However, the effects still need to be supported by longer-term studies outside the laboratory. The current, existing dietary assessment methods, such as diaries and recalls, rely on memory and do not provided detailed information about the texture of foods [7].

Currently, there is a strong effort in creating automated tools for dietary monitoring, in the context of understand and preventing obesity and eating disorders [22, 27]. One of the first approaches has been to automatically detect chewing based on the audio captured by an in-ear microphone [4]. Such audio signals have also been used to extract information such as the food type [3, 23]. There are also alternative types of approaches for identifying food types; for example, leveraging photos that people take with their mobile phones can be used to detect food-relevant photos and then subsequently to perform image segmentation and identify the food type [10, 13, 18]. Alternatively, user input can be requested when an automatic eating detection system detects eating activity [5].

Identifying food-content–relevant information such as food type from audio signals is commonly formulated as a multi-class problem where each class is a different food type from a list of pre-determined food types [2, 4]. The selection of food types is usually aimed at creating a diverse set with different textures, such as crispy and non-crispy food types; crispiness, however, is only one of the attributes of texture. In the early work of [19], a set of texture reference scales were introduced that include multiple attributes. Table 1 presents three groups of attributes: attributes related to surface attributes and springiness, attributes assessed during mastication (chewing), and those assessed during manual manipulation. Out of the three groups, the attributes assessed during mastication are the ones of interest to this work. The more recent work of [30] presents a more extensive and complete review of the state of knowledge for food texture. According to that work, it is commonly accepted that “texture is the sensory and functional manifestation of the structural, mechanical and surface properties of foods detected through the senses of vision, hearing, touch and kinesthetics”. As a result, no single modality sensor (such as a microphone) can completely identify texture.

Table 1. Food-texture attributes as presented and organized in [19], and their correspondance with the food-attributes used in this work.

It is worth mentioning that there are additional, non-medical fields where research in understanding human eating behavior is also useful. More specifically, in the field of food design and engineering, understanding how people perceive certain attributes of food (such as crispiness) has been found to correlate with freshness (in particular for the case of apples [12, 24]) which in turn is the most important factor in consumer’s choices [16, 26]. Thus, providing tools that can objectively measure such food attributes cannot only help to assess eating behavior and food intake, but can also help food technologists to design food with more desirable and pleasant characteristics.

In this work we use an in-ear microphone that is part of the wearable, prototype sensor developed in the context of the SPLENDID project and focus on the audio captured during chewing in order to recognize three attributes of food texture from a variety of food types. The three attributes are crispiness, wetness, and chewiness. Each one corresponds to each of the attributes “assessed during mastication” according to [19] (Table 1). The attributes “related to surface attributes and springiness” can also be loosely mapped to these three attributes. The attributes that are “assessed during manual manipulation” are not of interest to this work (but are only included in the table for completeness).

We propose an algorithm for recognizing each of the three food-texture attributes of this work (i.e. crispiness, wetness, and chewiness) given a single chew based on three individual binary SVMs (one for each attribute). We also propose a modified version of the algorithm that operates on entire chewing bouts. We evaluate the generalization both in new subjects and in new food types.

2 Attribute Recognition Algorithms

The algorithms presented in this work require audio of at least 8 kHz sampling rate. The original audio recordings of the dataset used in this work (see Sect. 3) have been sampled at 48 kHz. We have experimented with down-sampled versions of the signal, in particular 2, 4, 8, 16, 32, as well as the original 48 kHz. We have observed that down-sampling as low as 8 kHz does not cause any noticeable drop in recognition effectiveness (by repeating the experiments presented in this paper for all different sampling frequencies), however, down-sampling lower than 8 kHz does. In all the following, we use the 8 kHz down-sampled versions of the audio signals.

We also apply a high-pass Butterworth filter to the down-sampled signal to remove low frequency components. The filter is of 9-th order with a cut-off frequency at approximately 20 Hz. We propose two algorithms: the first recognizes attributes for each chew individually, while the second one for entire chewing bouts.

2.1 Chew-Level Algorithm

A feature vector is first extracted from each chew; note that start and stop time-moment annotations for chews have been made available by manual extraction based on acoustic and visual observation of the captured audio signals (in a fully automated application, a chew-detection algorithm such as [22] can be used to obtain them). The extracted features consist of signal energy in 11 log-scale frequency bands based on time-varying spectrum (TVS) estimation, fractal dimension (FD), condition number (CN) of the auto-correlation matrix, and higher order statistics (i.e. third and fourth-order moments). These features have been used in [22] for chewing detection and are thus a good starting point for food-texture–attribute recognition. Since the audio sampling frequency is only 2 kHz in [22] (compared to 8 kHz of this work) we have added two more bands in the TVS, in particular 2 to 4 and 4 to 8 kHz. It is also worth noting that each chew has different duration (average of 0.56 s and standard deviation of 0.15 s in the data-set used in this work) in contrast to windows of fixed length often used in signal processing. The features we have selected, however, are invariant to the length of the audio segment used to extract them.

Before the classification stage each feature is standardized by subtracting the mean and then dividing by the standard deviation (the mean and standard deviation of each feature are estimated over the available training set for each case). The multi-label classifier we use is an array of three binary SVMs; each SVM is related to one of the three food-texture attributes: crispiness, wetness, and chewiness. We use a radial-basis function (RBF) kernel. Parameters C and \(\gamma \) are selected automatically using 5-fold cross-validation on the training set. The optimal values are chosen using Bayesian optimization [15]; care to escape local minima of the objective function is also taken using a threshold for the standard deviation of the posterior objective function [8, 28].

As a result, each chew can be classified individually for each of the three attributes. As the food type does not change within a bout, we can obtain one decision per bout based on the chews that belong to it. One way to do so is to use majority voting across the chews of a bout (for each attribute). Another way is to consider only the first n chews of a bout, since processing of the food in the mouth transforms it into a wet bolus, thus altering the attributes of the food in its unprocessed state. All these evaluation methods are presented in Sect. 4.

2.2 Chewing-Bout–Level Algorithm

Chewing-bout–level detection shares the same pre-processing steps with chew-level detection: the audio signal is down-sampled and the same high-pass is applied to remove low-frequency components.

Bout segments are then obtained based on the chews that belong to each bout. A bout audio segment starts at the start time of its first chew and stops at the stop time of its last chew. The average bout duration in the data-set used in this work is 15.22 s with a standard deviation of 10.7 s. Since bouts are significantly longer than chews and also contain non-chewing sounds between each chew (see Fig. 1) we do not directly extract the features over the entire bout duration. Instead, we obtain overlapping windows of 0.5 s length and 0.1 s step from each bout and extract the features from each window separately. Thus, we obtain one list of feature vectors for each bout; each list contains, in general, a different amount of vectors.

Fig. 1.
figure 1

An example of a chewing bout. The first four chews are marked by the gray rectangles. When entire chewing bouts are used, the audio between two successive chews is also used.

We then use a bag-of-words (BoW) approach to obtain a single feature vector of fixed length for each bout. In particular, given a set of bouts, we obtain all of the feature vectors from each bout and use k-means to select a set of k centroid vectors. Once we obtain the k centroid vectors we can transform any new bout into a feature vector of fixed length (equal to k) by computing the normalized histogram of the bout’s feature vectors against the set of k centroid vectors. Each feature vector is assigned to one of the k centroid (i.e. hard assignment).

Using the BoW approach offers many advantages. It allows to use the same features of the previous, chew-level algorithm (Sect. 2.1) on short windows (with similar duration to that of chews). In addition it allows to extract feature vectors of fixed length from audio segments with (highly) varying length. Finally, it also allows to handle the non-chewing sounds that occur between successive chews within each bout: window-based feature vectors that correspond to such in-between audio segments will likely be similar and will be clustered together; corresponding cluster centers are equally present in signals of different food types and the SVM models are expected to learn to ignore them.

The BoW features are then standardized similarly to Sect. 2.1. Classification is performed in exactly the same way as in chew-level recognition: we use an array of three binary SVM classifiers with RBF kernel and hyper-parameter selection using Bayesian optimization.

3 Data-Set and Evaluation Metrics

The data-set used to evaluate the proposed algorithms has been collected at Wageningen University, Netherlands, in the context of the EU-funded SPLENDID projectFootnote 1 and is the same data-set as the one we use in [21]. The recording apparatus is an in-ear microphone (FG-23329-D65 model manufactured by Knowles) connected via wire to a computer audio interface. The sensor housing and recording has been done by CSEM S.A. [22]. In this work, the first version of the SPLENDID sensor is used; more details about this version and the future versions of the sensor can be found in [5]. In total, 21 subjects were enrolled for the data collection trials, however, signals from only 9 could be used in this work due to problems with data acquisition (such as incorrect sensor placement or corrupted audio due to hardware/software malfunction). Each subject consumed a variety of food types (complete list can be found in [5]).

We have selected 9 different food types that we can clearly annotate their attributes. Not all 9 subjects have consumed all 9 food types. Table 2 lists these food types along with their attribute values we have assigned them. This data-set of 9 participants and 9 food types is the one we use to evaluate this work.

Table 2. List of food types and their attributes for the evaluation data-set.

For this evaluation data-set, we have manually created ground truth on chew and chewing bout levels (with start and stop time-moments) based on the available experimental logs as well as audio and visual inspection of the captured signals. It contains 4, 989 chews with a total duration of 46.31 min which belong to 238 chewing bouts; total duration of bouts is almost 1 h and is greater than total duration of chews because bouts also contain the audio segments between each successive pair of chews. Tables 3 and 4 show duration statistics for the chews and chewing bouts.

Table 3. Statistics of chews duration across the food types for the evaluation data-set.
Table 4. Statistics of bouts duration across the food types for the evaluation data-set.

We evaluate each food attribute, namely crispiness, wetness, and chewiness, as binary classification problems. We regard crispy, wet, and chewy as the positive classes, and non-crispy, dry, and non-chewy as the negative ones respectively. To account for class imbalance, which is particularly large for chewiness, we calculate weighted accuracy as

$$\begin{aligned} \frac{w \cdot TP + TN}{w \cdot (TP + FN) + TN + FP} \end{aligned}$$
(1)

where \(w = (TN + FP) / (TP + FN)\) is the ratio of priors.

4 Evaluation and Results

The chew-level algorithm can also be modified to operate on bout-level by taking into account each bout’s chews. We explore how the bout-level decision is affected by the number of chews that are taken into account. We train models both in the typical leave-one-subject-out (LOSO) fashion as well as in leave-one-food-type-out (LOFTO) fashion. In both types of experiments we use the entire available training data (from the non–left-out subjects or food types) to both obtain the BoW centroid vectors and train the SVM classifiers.

Table 5 presents per-chew classification results for each of the three attributes; the BoW centroids, SVM models, and SVM hyper parameters have been trained in LOSO fashion. Crispiness is the attribute that the algorithm identifies more effectively. The majority voting approach consistently improves results by 2 to \(5\%\).

While it makes sense to assume that food type as such does not change during a bout, the attributes of the food within the mouth do as food is grinded and lubricated during oral processing/chewing. Given that we are interested in identifying the attributes that the food has in its unprocessed form, we can consider only the first few chews of each bout in the majority voting stage, during which the food type’s original attributes are still retained to some degree. Figure 2 shows the recognition effectiveness for each attribute when considering only the first n chews of each bout for \(n=1\) to 20. Recognition of crispiness exhibits high effectiveness, however, the highest (weighted) accuracy is obtained by considering the first 6 to 10 chews. Almost the same range of 5 to 10 chews seems to be the best choice for recognizing wetness as well. Chewiness results are somewhat different since considering only the first few chews seems to yield erratic effectiveness. The situation improves as more chews are taken into account.

Table 5. LOSO results for chew-level recognition.
Fig. 2.
figure 2

Weighted accuracy for each attribute for the LOSO and LOFTO experiments.

Table 6 presents results for the LOFTO experiments. Crispiness is the most easily recognizable attribute again, however, wetness does not generalize well across different food types. Majority voting improves results for crispiness and chewiness only. Looking at Fig. 2, the highest effectiveness for wetness is achieved by considering only a few chews (6 to 7 chews).

Table 6. LOFTO results for chew-level recognition.

Table 7 presents LOSO and LOFTO results for the bout-level algorithm. Comparing these results with chew-level results we can see that the bout-level algorithm achieves almost similar results for LOSO: slightly lower weighted accuracy for crispiness (but still quite high), and almost the same for wetness. Recognition accuracy for chewiness is worse.

On the other hand, the bout-level algorithm seems to be able to better generalize across different food types. Looking at the results of the LOFTO experiment, the algorithm achieves only slightly lower weighted accuracy (compared to the chew-level algorithm with majority voting) for crispiness and improves significantly for wetness and chewiness.

Table 7. Results for bout-level recognition.

5 Conclusions

In this work we have proposed two algorithms for automatically and objectively recognizing three attributes of food texture from audio signals captured by an in-ear microphone. The algorithms combine feature extraction and binary SVMs and operate both for single chews and entire chewing bouts. We have examined their ability to generalize not only to new subjects (LOSO) but also to new food types (LOFTO). With the use of these algorithms, the SPLENDID sensor was able to recognize 3 important food-texture attributes affecting eating behavior. In particular, crispiness was recognized with weighted accuracy of at least 0.9 across all experiments. Results for recognizing wetness and chewiness are promising but there is a large margin for improvement: introducing more suitable features in the algorithms could possibly help improve the overall effectiveness of the proposed algorithms. Including more food types in the training set could also potentially improve recognition, however, certain food types with attributes that are not easy to annotate (e.g. food types that are neither completely dry nor wet) might not be suitable for a crisp-label classification problem and thus require alternative methods. As a result, this type of digital devices will make it possible to further study the objective exposure to different food textures in relation to eating behavior and longer term outcomes, such as weight change, in a real life setting.