Keywords

1 Introduction

The increased prevalence of overweight and obesity has become a major factor in public spending in countries around the world  [6]. Studies estimate that 57.8% of the world population will be overweight or obese by 2030 if current trends continue  [6]. The obesity is commonly associated with several metabolic dysfunctions, such as insulin resistance  [2, 36], metabolic syndrome  [29, 35], increased blood glucose  [1], dyslipidemia, hypertension and the development of other diseases such as type 2 diabetes, cardiovascular diseases  [11, 20] and atherosclerosis  [1].

World Health Organization (WHO) has defined obesity as “an abnormal or excessive fat accumulation that presents a risk to health”  [13]. Currently the diagnosis of overweight and obesity is based on body mass index (BMI). The WHO (2004)  [37] proposed the cut-off points for defining underweight, normal weight, overweight and obesity in their different degrees. Among the limitations of BMI is the impossibility of discriminating between fatty tissue and muscle tissue, tending to produce false negatives in people with a high percentage of body fat but a normal BMI, and false positives in people with high BMI and high muscle tissue  [3, 23], because of this the use in concomitant of the body fat percentage (BFP) and BMI is recommended for a obesity diagnosis.

The BFP is calculated from several methods, among them are the bioelectric impedance and formula of Siri  [33] that uses two, four and seven different skinfolds as variables, in these research the Siri formula with two skinfolds were used to compute the BFP  [10]. Currently, there are no established limits for the abnormal BFP, mainly due to the limitation of the existing data around the world. Numerous studies have evaluated the relationship between overweight and obesity with BMI through skinfold thinness, finding a high directly proportional correlation between BMI, BFP and skinfold thinness  [4, 26].

Machine learning techniques to classify overweight and obesity have been already used  [7]. Certain studies have used k-means to differentiate overweight and obesity from normal subjects using biochemical variables  [21]. Some other studies use the k-means to detect overweight populations, based on anthropometric measures such as waist and hip circumference  [9] and indicators of comorbidity such as diabetes, depression and atherosclerosis.

The aim of this work is to evaluate the k-means clustering algorithm using anthropometric measures to classify subjects with obesity and abnormal BFP. A database of 1053 subjects with anthropometric measurement (weight, height, arm circumferences, flexed arm circumferences, waist circumference, hip circumference, thigh circumferences, calf circumferences, triceps skinfolds, subscapular skinfolds, suprailiac skinfolds, abdominal skinfolds, thigh skinfolds, calf skinfolds, diameter of humerus and diameter of femur) values was used. In the following section the database and k-means method used in this investigation will be explained. In the Sects. 3 and 4 the results and discussion will be presented. And finally, in Sect. 5, conclusions and proposals for future work will be presented.

2 Methodology

2.1 Database

Between 2004 and 2012  [16], 1053 (male = 308) adult men and women from the district capital of Venezuela were recruited into the Nutritional Assessment Laboratory of the Simón Bolívar University. Anthropometric measurements such as: height, weight, height, arm circumferences, flexed arm cicumferences, waist circumference, hip circumference, thigh circumferences, calf circumferences, triceps skinfolds, subscapular skinfolds, suprailiac skinfolds, abdominal skinfolds, thigh skinfolds, calf skinfolds, humerus diameters and femur diameters were performed on each subject.

The diagnosis of overweight was made using the WHO guidelines which state that an overweight person has a BMI greater than or equal to 25. From the group of overweight subjects, 23 participants had a BMI greater than or equal to 30, indicating that they suffer from obesity  [28]. Both overweight and obese subjects were placed in the same group for this study since we wanted to classify subjects with dysfunctional weight values.

Since there are no established limits for the abnormal BFP, the diagnosis of abnormal BFP were made according to  [8, 18, 27] that established a cut off points of BFP < 25% for men and BFP < 30% for women as the limit of normality, above these limits are considered abnormal BFP.

All the procedures carried out in the study were in accordance with the ethical standards of the Bioethics Committee of the Simón Bolívar University and the 1964 Declaration of Helsinki and its subsequent amendments or comparable ethical standards. All subjects accepted the study by signing an informed consent form. Table 1 shows the characteristics of the dataset used, describing the values of each of the anthropometric variables by their mean and standard deviation of both normal and overweight subjects. While Table 2 shows the characteristics of normal and abnormal BFP subjects of the dataset used.

Table 1. Anthropometrics variables characteristics for obesity and overweight.
Table 2. Anthropometrics variables characteristics for body fat percentage.

2.2 k-means Implemented

k-means  [15] is a method that divide n observations into k clusters. In the k-means algorithm each observation is allocated to a cluster with the nearest centroid using a distance function, then, the centroids in each cluster are calculated again. This process is repeated until the centroids are the same between each step, and the final clusters are established.

In this study k-means were applied to each anthropometric measurement as separate variables (except height and weight because the BMI use them as variables in the case of obesity and overweight diagnosis; and triceps and subscapularis skinfolds in the case of abnormal BFP because Siri formula used them as variables); the number of groups was set to two (k = 2), to assess the ability of each variable to classify between obese/overweight and normal weight subjects, and between normal and abnormal BFP subjects. The Euclidean squared distance were used to calculate the distance between each variable of data set with centroids and the process were replayed 10 times to prevent local minima. The silhouette coefficient (SC) was used to assess the assignment of the data set in the respective cluster  [32].

2.3 Metrics Calculation

The confusion matrix  [12, 31] is a table that contrasts the real classification with the classification made by the clustering model. In the Table 3 an example of confusion matrix is showed, the columns (\(Class_1\), ..., \(Class_n\)) represent the k-means classification and the rows represent the real classification. The numbers in the main diagonal (\(A_{11}, ...,A_{nn}\)) are the right k-means method classification and n is the amount of the total classes. In this study, the objective is to classify obese subjects from normal weight subjects, and normal BFP subjects from abnormal BFP subjects, as a consequence of that, the number of classes is two (n = 2).

Table 3. Confusion Matrix.

The accuracy (Acc)  [31] represents the rate between the correctly classified instances and the total. Equation 1 shows the expression of accuracy, where \(A_{ij}\) are the instances for \(i = 1,...,n\) and \(j = 1,...,n\), and n is the number of total classes.

$$\begin{aligned} Acc = \frac{\sum _{i=1}^{n} {A_{ii}}}{\sum _{i=1}^{n} {\sum _{j=1}^{n} A_{ij}}} \end{aligned}$$
(1)

The precision (\(P_i\))  [31] of a \(Class_i\) (see Eq. 2) represents the rate between of correctly classified instances of the \(Class_i\) (\(A_{ii}\)) (true positives) and the total classifications of the \(Class_i\) (\(A_{ji}\)). In this study the precision reported is the class precision average.

$$\begin{aligned} P_i = \frac{A_{ii}}{\sum _{j=1}^{n} A_{ji}} \end{aligned}$$
(2)

The recall (\(R_i\))  [12] of a \(Class_i\) (see Eq. 3) is the rate between the \(Class_i\) (\(A_{ii}\)) correctly classified instances and the total number of instances that have the \(Class_i\) as the true label (\(A_{ij}\)). The recall reported in this study is the average of the entire class recall.

$$\begin{aligned} R_i = \frac{A_{ii}}{\sum _{j=1}^{n} A_{ij}} \end{aligned}$$
(3)

2.4 Statistical Analysis

To determine the differences between groups of two, the Wilcoxon non-parametric paired pair statistical test was used and a p-value \(\le 5\%\) was considered to be statistically significant  [22].

3 Results

Table 1 reports the anthropometric measurements of the normal weight and overweight/obese subjects. The database consists of 1053 subjects, 83.86% belong to the normal weight subjects group and 16.14% are overweight/obese. The classification of overweight/obesity was made according to the WHO, all subjects with \(BMI\ge 25\) were classified as overweight. The 13,5% of the subjects who belong to the overweight/obesity group have \(BMI\ge 30\) indicating that endurance obesity. Table 2 reports the anthropometric measurements of the normal and abnormal BFP subjects. The classification of abnormal BFP group were made according to  [18, 27], that established as abnormal \(BFP\ge 25\) in men and \(BFP\ge 30\) in woman. The 9.88% of the subjects of the database presents an abnormal BFP and the 90.12% have a normal BFP; the subjects with abnormal BFP has a \(BMI\ge 25\) indicating that they also belongs to the overweight/obesity group.

Table 4 and Table 5 show the confusion matrix of the variables with the best performance in the k-means non-supervising clustering for overweight/obesity and abnormal BFP classifications, respectively. In addition, the silhouette coefficient (SC), accuracy (Acc), precision (P) and recall (R) coefficient for overweight/obesity and abnormal BFP diagnosis was reported for \(k=2\) as it is shown in Table 6 and Table 7, respectively. Figure 1 shows the assignment of individuals to cumulus clusters for \(k=2\), using the anthropometric measurements. The character X represents the centroids of each cluster.

4 Discussion

Table 1 shows the descriptive and anthropometric measurements of the normal weight and the overweight/obese subjects. All parameters showed significant differences between the groups, except for age and height. All skinfolds showed higher values in overweight/obese subjects compared to normal weight subjects. On the other hand, Table 2 shows the descriptive and anthropometric measurements of the normal and abnormal BFP subjects. All parameters showed significant differences between the groups, except for age, height and epicondylar humerus diameter. All skinfolds showed higher values in abnormal BFP subjects compared to normal BFP subjects. All those facts are expected since obese and higher BFP subjects tend to have a thicker adipose panicle than normal weight and BFP subjects  [17, 34].

The k-means clustering method (Table 4) is capable of classifying obese subjects from normal weight subjects with the following anthropometric measures: right arm circumference, left arm circumference, right subscapular skinfold, left subscapular skinfold, waist circumference and hip circumference, with a \(Acc \ge 0.78\), a \(P \ge 0.78\), and a \(R \ge 0.68\). On the other hand, k-means clustering demonstrated that is capable of classifying subjects with normal and abnormal BFP (Table 5) with the following anthropometric measures: right arm circumference, left arm circumference, waist circumference, hip circumference, suprailiac and abdominal skinfolds, with a \(Acc \ge 0.73\), a \(P \ge 0.73\), and a \(R \ge 0.64\). Acceptable levels of accuracy and precision indicate that the method is capable of classifying subjects with the two pathologies. Slightly lower recall values indicate that the method is able to classify cases with the disease but gives a series of false negatives. It can also be seen that the silhouette coefficient (SC) is greater than 0.5 in all cases, indicating that all subjects were classified into a group for each of the parameters. In the parameters with the best Acc, P and R values, \(SC \ge 0.55\) (Table 6 and Table 7).

Table 4. Confusion matrix of k-means non-supervised classification of the variables with the best performance in the prediction of obesity and overweight

Figure 1 shows that subjects with high skinfold values were located in cluster 2 (red) and subjects with lower skinfold values in cluster 1 (blue). Furthermore, the cluster 1 is where the highest percentage of normal weight and BFP subjects are found and the cluster 2 is where the highest percentage of overweight/obese and abnormal BFP subjects are found. This may be due to the fact that overweight subjects have a thicker adipose panicle and higher BFP than normal weight and BFP subjects  [19, 34]. The same fact is observed in the case of waist and hip circumference, where the method places the subjects with the largest hip and waist circumference in cluster 2, which is the group with the highest percentage of overweight/obese and abnormal BFP subjects. It should be noted that waist circumference is strongly related to abdominal obesity and, in particular, it is used today as a risk factor for diseases such as cardiovascular disease and diabetes  [5, 39].

Fig. 1.
figure 1

Instance assignment (circles) to clusters for k = 2, using anthropometrics parameters. Red circles belong to cluster 1 and blue circles to cluster 2. Character X represents the cluster centroids. (Color figure online)

Table 5. Confusion matrix of k-means non-supervised classification of the variables with the best performance in the prediction of abnormal BFP
Table 6. Silhouette coefficient (SC), accuracy (Acc), precision (P) and recall (R) from k-means algorithm for overweight/obesity classification.
Table 7. Silhouette coefficient (SC), accuracy (Acc), precision (P) and recall (R) from k-means algorithm for abnormal BFP classification.

In the case of overweight/obesity diagnosis the arm circumference values show the best Acc, P and R (0.79, 0.84 and 0.71) compared to the all other measures. The subjects with the largest arm circumference were placed in cluster 2, which is the group with the highest percentage of overweight/obese subjects (91.76%). This result indicates a strong relationship between high arm circumferences and high BMI values, corroborating some studies  [5, 24]. On the other hand, in the case of abnormal BFP diagnosis, the right abdominal and suprailiac skinfolds show the best Acc, P and R (0.77, 0.68 and 0.64) compared to the all other measures. The subjects with largest suprailiac and abdominal skinfold were placed in the cluster 2 (right skinfold 90.40% and left skinfold 93.30%). This results are in concordance with studies that correlate high body fat percentage with high abdominal fat accumulation  [25, 30], especially in subjects with insulin resistance and high risk of develop type 2 diabetes  [14, 38].

5 Conclusions

The findings of this research suggest that the k-means method applied on anthropometric measurements can classify overweight/obese subjects and subjects with abnormal body fat percentage. The best anthropometric measurements to classify overweight and obesity on this research were: Arm circumferences, subscapular skinfolds, waist circumference and hip circumference. On the other hand, the best anthropometric measurements to classify abnormal BFP subjects in this research were: Arm circumferences, waist circumference, hip circumference, suprailiac and abdominal skinfolds.

Machine learning techniques, such as fully connected neural networks and the support vector machine, will be studied in the future to assess the relationship between BMI, BFP and anthropometric measurements. A machine learning technique would allow to evaluate how much influence have every antropometric variables over the classification, and would be possible to extract a spectrum to see which groups of subject are more vulnerable of suffering abnormal BFP.