Introduction

Age-at-death estimations from skeletal remains are crucial in forensic anthropology. However, age assessment from skeletal remains is problematic as a result of the disparities between biological age and chronological age due to inter-individual variation [1]. Several age estimation methods have been developed and are continuously evaluated to date on many skeletal collections from different populations. These methods depend on the observation of macroscopic morphological changes in skeletal remains which usually leads to a subjectivity problem that affects the accuracy and reliability of the method itself. Therefore, research on methods for age-at-death estimation from skeletal remains is still a developing science [2,3,4,5].

One of the most studied skeletal regions for age assessment is the pubic symphysis [3]. The articular surface change of this skeletal structure, due to maturation and later degenerative processes, is what makes the pubic symphysis an important target for age estimation [6,7,8]. However, most of the methods that are based on the pubic symphysis have been developed on North American collections, made up of African and European descendants. Due to the variability between distinct populations and the consequent age mimicry phenomenon [9,10,11], it is necessary to test these methods for an evaluation of their functionality before proceeding to use them on a population that is different from the one they were developed on.

Another restrictive factor is the limited amount of reference collections with human skeletons of recent origin. Most of them are from twentieth century cemeteries or from ancient cemeteries in which variation could exist not only between human populations but across centuries as well due to environmental changes over time. Therefore, methods developed from older collections may not be very useful when applied to current populations, i.e., forensic cases [12, 13]. Sampling of skeletal remains from different populations, resulting in new forensic collections, is necessary to understand the current diversity and, in this manner, adjust existing methods or create new ones that perform better.

A study on age-at-death estimations from the pubic symphysis and the auricular surface in a Spanish skeletal collection, where three established methods are compared (Suchey-Brooks, Lovejoy, and Buckberry and Chamberlain methods) [14], shows that it may be problematic when these are used on a Spanish population. It suggests that more statistical studies should be carried out before using existing age assessment methods in Spanish populations. However, there are no posterior studies on Spanish pubic symphysis collections, implying that there is still a lack of understanding about the corresponding diversity.

The present study, based on documented twentieth and twenty-first century skeletal collections, constitutes a preliminary development and evaluation of a new method for age-at-death estimation from the pubic symphysis that reduces the observer subjectivity by using a simpler binary scoring system. The objectives are (1) to describe new age-related traits and a new scoring criterion and (2) to evaluate machine learning models for age estimation based on the new method.

Materials and methods

Documented collections

The pubic symphyses of two Spanish documented skeletal collections are used for this study; a documented late twentieth century skeletal collection housed at the Universitat Autònoma de Barcelona (UAB) and a twenty-first century forensic collection from the Institut de Medicina Legal i Ciències Forenses de Catalunya (IMLCFC). The UAB skeletal collection was sampled in the late 1990s from a cemetery at the city of Granollers, Barcelona. It is constituted of people who died between the 1970s and 1990s [15]. The IMLCFC forensic collection is composed only by pubic symphyses collected from medicolegal autopsies between the years 2000 and 2019. Table 1 shows a description of the collections.

Table 1 Age distribution of the UAB and IMLCFC collections separated by sex

Symphyseal traits and scoring system

A total of 16 traits or attributes are studied and described (Table 2). Fifteen of these traits are extracted from the traditional methods of Todd [16, 17] and McKern and Stewart [18], and they are modified into binary traits obtaining a score of present or absent. More recent descriptions can be found in an open access laboratory manual of revised osteometric definitions from the University of Tennessee [19]. Furthermore, a new age-related trait named microgrooves is observed and described for the first time in this study (Fig. 1). This trait appears as very small grooves or indentations into the surface of the symphyseal face, forming a type of network that presents a reticular aspect. It differs from microporosity in that it maintains the continuity of the cortical bone; therefore, the spongy bone is not exposed. And it differs from crests in that these do not present a reticular aspect as the microgrooves do, but rather present much larger and well distinct ridges and furrows that extend from the dorsal to the ventral area of the symphyseal surface. Besides the size and reticular aspect, it also differs from crests in that microgrooves do not present elevated ridges. The microgrooves or microindentations can appear across the entire symphyseal face or can be present in segments. It is advisable to observe this feature carefully under a magnifying glass so that it is not missed or confused with microporosity.

Table 2 Analyzed symphyseal surface traits and scoring system
Fig. 1
figure 1

The new microgrooves trait (MG). In the image on the right microgrooves are marked to indicate their location. The arrows represent points for visual orientation

A descriptive analysis of each trait is executed elucidating their distribution by age. Additionally, a McNemar test is performed for the interrater agreement assessment considering a significance level of 0.05 (5%). The JAMOVI 0.9.2.3 [20] statistical software is used for this test.

Statistics

For the following statistical experiments, the data set is divided into age intervals, resulting in five different age interval sets (Table 3). In addition, the Wrapper Subset Evaluator method is applied on each age interval set in order to select the best performing traits.

Table 3 Age interval sets created for this study

The Wrapper method employs a search algorithm to seek through all the possible trait combinations or subsets and test each one by executing a machine learning algorithm. In this manner, it develops and evaluates a classification model for each possible trait subset and then selects the best performing subset of traits. Since the Wrapper Evaluator selects the subset of traits that best performs for a specific machine learning algorithm, this same specific algorithm has to be used later to develop the classification model from the selected trait subset. In other words, each machine learning algorithm has its own best performing subset of traits for each data set. The Wrapper Subset Evaluator and all the machine learning algorithms in this study are implemented with the Waikato Environment for Knowledge Analysis software (Weka; version 3.8.3) [21]. A total of 5 data sets (each age interval set) are used for the training and evaluation of the machine learning models.

Three different supervised learning algorithms are applied on each one of the age interval sets (S1, S2, S3, S4, S5); these algorithms are the ZeroR classifier which establishes a baseline performance to which the other classification methods are compared, the Naïve Bayes classifier, and the Random Tree classifier [22]. Every model is tested with a 10-fold cross-validation. In addition, we run every algorithm ten times, with a different seed value for randomization on each run, to obtain the average precision and its standard deviation for each model.

Results

Table 4 shows the age distribution for every trait. CR is present at younger ages, and its frequency gradually decreases towards the middle ages. BEV is very confined between ages 20 and 40. PR is almost equally distributed across ages with higher frequencies between 20 and 60 years. UE and LE show a very high frequency across all ages. MG is primarily present at middle ages, increasing from younger and decreasing towards older individual. VW has a wide distribution across all age groups with a higher frequency between 35 and 55 years. VMD and DMD are mainly present in the upper half of the age range, and their frequency increases towards older ages. Specifically, DMD is more confined to older ages than VMD. LIP is generally very frequent, and it is present in almost all age groups except for the younger individual. There is also a slightly higher frequency for individuals older than 65 years. MIC is present in almost all ages, but it presents a much higher frequency for older individual, increasing rapidly from the middle ages towards the greater ages. On the other hand, MAC is slightly more distributed towards older ages. DR, VR, UR, and LR are widely distributed across ages, and in particular UR and LR have low frequencies in general.

Table 4 Age distribution of symphyseal surface traits

The interrater agreement assessment results, for the scoring system, indicate that there is sufficient evidence to consider the existence of agreement between observers (McNemar’s test: all p values are superior to 0.05) (Table 5).

Table 5 McNemar’s test results for each trait

Table 6 shows how MIC, DR, and UR are not selected by the Wrapper Subset Evaluator for any age interval set. The biggest group selection (S2 Naïve Bayes classifier subset) contains 9 traits, and the smallest group selection (S2 Random Tree classifier subset) contains 4 traits.

Table 6 Wrapper Subset Evaluation results. Traits selected for each age interval set and method

Within all the resulting models, the S4 Random Tree model (\( \overline{x} \) = 82%; SD = 1.38) and the S5 Naïve Bayes model (\( \overline{x} \) = 91.90%; SD = 0.67) have the best results (Table 7). The average precision results are higher when the age intervals are broader.

Table 7 Results of the classification methods for five sets of age intervals

On Table 8 the average precision result of each age category or class is presented for the best performing models. In general, the best classified age categories are the younger ones, and the worst classified categories are the oldest. The first age category (≤ 24) does not get any result under a 70%, and age categories from 40 to 65 years get results over an 80%, except for the S2 Naïve Bayes model that gets an average precision of 61.42% (SD = 2.84). Age categories over 60 and 70 years have average precision results under 70% obtaining in some cases a 0%, except for the S5 Naïve Bayes model that presents a result of 71.40% (SD = 0.00) for individuals with 80 years of age and older.

Table 8 Class precision results of the best performing models for each age interval set

Figures 2 and 3 display the resulting decision trees from the best performing data subsets; these are the S4 Random Tree model (\( \overline{x} \) = 82.00%; SD = 1.38) and the S5 Random Tree model (\( \overline{x} \) = 90.29%; SD = 0.40), respectively.

Fig. 2
figure 2

Decision tree from the S4 Random Tree model. This tree employs seven traits: MG, MAC, BEV, LE, VMD, DMD, and LIP. 0 = absence, 1 = presence. In parentheses, the value on the left is the total number of instances classified in that leaf, and the number on the right represents the incorrectly classified instances in that leaf. Seed value for randomization = 2

Fig. 3
figure 3

Decision tree from the S5 Random Tree model. This tree employs three traits: MG, VR and DMD. 0 = absence, 1 = presence. In parentheses, the value on the left is the total number of instances classified in that leaf, and the number on the right represents the incorrectly classified instances in that leaf. Seed value for randomization = 2

Discussion

Research on age-at-death estimation from skeletal remains is a challenging task. Not only the disparities between chronological and biological age due to variability can contribute to the difficulty of this challenge, but the subjectivity of scoring systems is an important limiting factor as well [1, 23, 24]. In order to develop and evaluate a new method, we studied each trait separately, transforming their traditional scoring criteria into a less ambiguous binary categorization (present or absent).

The present work is based on fifteen traits previously described in traditional age assessment methodologies [16,17,18]. Additionally, we present an all new age-related trait (microgrooves; MG) never described before. It is noticeable how this new trait proves to be a useful indicator for age-at-death estimations in this study, since it is selected by the Wrapper Subset Evaluator for several data subsets including the best performing subsets. Nevertheless, further studies should be developed to test the MG trait on other collections. It is also important that these studies focus on the evaluation of the MG trait’s sensibility to bone preservation; that is whether the preservation of the pubic symphysis affects the visibility of this new trait.

Regarding the general attribute selection, the best performing age-related traits include MG, MAC, BEV, LE, VMD, DMD, and LIP. These binary traits present an acceptable classification capacity (from 70 to 82%) for wide age intervals (≤ 29, 30–69, ≥ 70). However, as the intervals get smaller (20-, 15-, and 10-year age intervals) the traits’ classification power decreases to unreliable precision results (less than 60%). The bad performance of short age intervals can be related to the traits’ natural broad age distribution. Only some traits are very confined to certain age ranges, such as crests for ages under 20 years and the margin decomposition for ages over 70 and 80 years. Another study on age-at-death estimation from pelvic bones (pubic symphysis and sacropelvic surface) obtained similar results [25]. Their method separated the symphyseal surface into three areas, and each area was scored into one of the 2 to 4 stage categories. They were only able to get good results when wide age intervals were considered, with the highest precision results being over 70%. When they used 10-year age intervals, unreliable classification results were obtained. Also, they got the highest class precision results for individuals younger than 29 years of age and elders older than 70 years of age, but with values that did not reach a 50%. Compared with their results, our models did get high reliable class precisions for individuals under 24 years old (from 70 to over 80%) and for individuals between 40 and 65 years of age (from 61 to over 80%).

Concerning the classification of older individuals, with two age intervals, we were able to obtain a model (S5 Naïve Bayes) that presented a high classification capacity of elderly individuals (over 70%) with an overall precision result of 91.9% (SD = 0.67), employing the MG, MAC, VMD, DMD, and LIP traits. This result is interesting taking into account that generally age-at-death estimation methods poorly classify older individuals [26, 27]. A specific classifier model for elders (older than 70 and 80 years) could be used in conjunction with another model that performs well for younger and middle age individuals as a multi-model approach. In other words, we can first identify if an individual is an elder with a specific model for older individuals such as the S5 Naïve Bayes model, and if it classifies the individual in the young category, then another model with reliable class precision results for young and middle age categories could be used for a more precise classification.

As to the limitations of our method, the small size of the skeletal collection and the scarce representation of some age intervals, such as the young individuals and elders, can have their own downside. However, despite the small sample size, the results of our study show that the combination of binary attributes and machine learning algorithms is a promising tool to gain objectivity in forensic anthropology age-at-death assessments. The resulting decision trees from the employment of machine learning methods are simple tools that can facilitate the age classification process and can be easily used in the field. This is an important point to consider since a method’s simplicity is a well valued factor when implementing age-at-death assessment methods.

It is also noteworthy that the reference collection used in this study is unique in the sense that it is constituted by a new Spanish twenty-first century pubic symphysis forensic collection. Its importance relies on the potential it has for the development of future studies. It is a continuously growing tool that will help investigators research new methodologies for age-at-death estimation such as 3D image analysis, computed tomography, or digital image processing [28,29,30,31,32]. It is necessary to stress that more contemporary forensic collections are needed to better understand the diversity of current populations. That is, a continuous sampling must be considered to maintain these collections updated just as populations are dynamic entities that are in continuous change due to the effect of globalization and human migration [33, 34]. In addition, social and lifestyle changes can have an effect through time as well.

This work demonstrates the potential of the proposed methodology to facilitate age-at-death assessments in forensic anthropology. The results of this study are preliminary and further, evaluation of the binary traits must continue in order to better elucidate their relation not only with age but with other factors such as sex or population origin. Future experiments should be designed in order to validate this preliminary method on different collections and on a larger sample size.