Introduction

Age estimation is not only an essential first step in the identification of unknown human skeletal remains in a forensic setting but also in the analysis of archaeological samples of human bones. Age estimation is most accurate for immature skeletons, and there are three main approaches to age estimation: dental development, skeletal growth, and skeletal maturation [1]. The most reliable methods for ageing are based on tooth formation schedules, [14] but there are circumstances in which they cannot be used. The linear growth of the skeleton, particularly the diaphyseal growth in length of the long bones is generally considered to be a good alternative to dental mineralization for age estimation in prepubertal children [58].

Although age of immature human skeletal remains has been recurrently estimated from long bone diaphyseal lengths, using a variety of methods and/or approaches [929], most data available are unsuitable for age prediction. Several publications include tables of descriptive statistics for bone lengths by age, but these studies were designed to determine the mean long bone length for a given age when assessing growth status in living children. Radiographic data for long bone length by age is the source of this information, and detailed tables have been published by Maresh [911], Ghantus [12], Anderson and coworkers [13], and Gindhart [14]. A few tables of descriptive statistics based on actual measurements of dry bones have been constructed as an aid for age estimation, including those published by Johnson [15], Walker [16] Stloukal and Hanáková [17], Sundick [18], and Hoppa [19], but the samples are only of estimated age as they are archaeological in origin. Fazekas and Kosa [20] also provide tables of descriptive statistics for dry bone material, but in this case for the fetal period only.

More recent studies have been concerned with exploring and modeling the relationship of long bone length with age, such as the studies carried out by Smith and Buschang [30, 31]. These studies are based on Maresh’s [11] original data and have been incorrectly described as age estimation methods [7] as the models were designed to estimate the average long bone length for a given age, and not the age from a given long bone length. Predictive models with the specific purpose of age estimation from length of long bones were initially developed by Stewart [21] and Hoffman [22], but here age and the respective confidence intervals have to be extrapolated from graphed data of diaphyseal lengths. More recently, Facchini and Veschi [23], Rissech and coworkers [2426], Danforth and coworkers [27], Boccone and coworkers [28], and Primeau and coworkers [29] have all used regression analysis for age estimation purposes. There are some important caveats with all these methods. The regression formulae provided by Danforth and coworkers [27], Boccone and coworkers [28], and Primeau and coworkers [29] were based on an archaeological sample of unknown age, where ages were estimated. By contrast, Facchini and Veschi [23] and Rissech and coworkers [2426] used known sex and age skeletal collections to derive their formulae. Some of these formulae have been published without reporting error estimates and, hence, cannot be used to obtain a confidence interval for the estimated age. This applies to all formulae derived by Facchini and Veschi [23] and the femur formulae obtained by Rissech and coworkers [24]. Consequently, at present only the humerus and tibia formulae published by Rissech and coworkers [25, 26] can be considered suitable for age estimation from the diaphyseal length of the long bones, as they allow the estimation of error. Conversely, as important as they are, these equations were developed using conventional least squares regression to produce age estimates and are likely to introduce significant biases [32].

In age estimation, the common model used is least squares regression and inverse calibration. Age is the dependent variable and the long bone length is the independent variable, but it is age that is unknown and needs to be estimated from long bone length. When inverse calibration is used to estimate age, age (the dependent variable or x) is actually regressed on long bone length (the independent variable or y), rather than the reverse. Consequently, random errors in this inverse calibration are assumed entirely in the y direction, when in fact they are in the x direction. Considering that least squares was designed to minimize errors in the y direction, inverse calibration where age is treated as the independent variable (y) and long bone length is treated as the dependent variable (x), results in a systematic bias [32]. Konigsberg and coworkers [33] and Lucy and Pollard [34] have recommended classical calibration as a more suitable statistical technique for making age estimates from skeletal indicators. In classical calibration, the variable for which estimates are to be made is always x (age), not y (long bone length) as in inverse calibration. A regression y (long bone length) on x (age) is performed as usual (not age regressed on long bone length as in inverse calibration). However, this produces an equation for long bone length (y) in terms of age (x), so to estimate age we must invert the relationship. In this case, least squares regression adjusts the regression equation in the correct y direction. The major drawbacks in classical calibration are the difficulty in calculating the uncertainty for any point about the calibration line and a reduction in the efficiency of estimates [32, 33], that is, the variability will be larger for classical calibration than for inverse calibration. Although this problem has been described exclusively for adult age estimation, it is likely that it will affect age predictions from immature skeletal remains as well, albeit at a smaller level. According to Aykroyd and coworkers [32] the greater the correlation between age and the skeletal indicator, less systematic bias will be in the inverse calibration model. Given the low correlation between age and skeletal indicators of age in adults, it is no surprise that the conventional use of least squares regression is of great concern. Conversely, one might assume that the bias associated with inverse calibration is negligible when applied to the immature skeleton given the very high correlation between age and long bone length. However, considering that this correlation is not perfect, inverse calibration may not necessarily translate into a bias free regression model for age estimation. In fact, we would still expect at least some bias.

This study addresses the issue of modeling age and long bone length for age estimation purposes, using the most appropriate statistical tools that will not result in methodological biases. Consequently, a sample comprised of immature individuals from Western European documented skeletal collections (Portugal and England), were selected to develop age estimation formulae from the diaphyseal length of the humerus, radius, ulna, femur, tibia, and fibula in prepubertal children (<12 years of age) using regression and classical calibration [35]. This study compares the differential performance of inverse calibration versus classical calibration formulae. An additional goal is to determine which bones are most accurate in estimating age and whether there are sex and age differences in the accuracy of these formulae. One last goal is to test previously published equations for age estimation from diaphyseal lengths of the long bones.

Materials and methods

Sample

For this study, data were collected from two samples of child skeletons of known sex and age: the Portuguese sample, which includes the Lisbon collection children, and the English sample, where the Spitalfields and St. Bride’s children were combined. The Lisbon collection is a large series of identified human skeletons housed at the Natural Museum of Natural History and Science, in Lisbon, Portugal [36]. This collection is composed of over 1,500 skeletons but detailed biographic information is available only for a fraction of the individuals. These fully identified skeletal remains are of Portuguese nationals’ who were born between 1805 and 1972 and died between 1880 and 1975 in and around the city of Lisbon [34]. The Spitalfields [37] and St. Bride’s [38] collections are, respectively, curated in the Natural History Museum and the crypt at St. Bride’s Church in Fleet Street, London, UK. The collections consist mostly of Londoners who were born and died between 1729 and 1859. The Spitalfields collection includes 968 individuals and the St. Bride’s collection includes 237 individuals.

As this study is based on diaphyseal length, the sample was restricted to skeletons of prepubertal age and showing no evidence of fusion of any of the long bone epiphyses. Consequently, only individuals under the age of 13 were selected. Considering the absence of significant fetal material, the study sample is also truncated inferiorly at birth. Specimens with obvious skeletal malformations were not included. In total, the study sample is comprised of 184 individuals (72 females and 112 males) with ages ranging from birth to 12 years and is of known sex. Overall, the sample is comprised of children who were born and died between approximately 250 and 50 years ago in Western Europe (Portugal and England). Table 1 describes the size and composition of the sample.

Table 1 Size and composition of the sample by age, sex, and collection

Methods

The maximum diaphyseal length of the six long bones of the limbs—humerus, radius, ulna, femur, tibia, and fibula—was measured using an osteometric board or a sliding caliper, in the case of infant skeletons, and recorded to the next whole millimeter. Measurements were obtained from the left side as the maximum length of the diaphysis, parallel to the long axis [39]. When bones from the left side were missing or damaged, the bones from right side were measured instead. Intra- and interobserver measurement errors were estimated by re-measuring a random subsample of 20 individuals, and calculating the relative technical error of measurement (%TEM) and the coefficient of reliability (R) [40] for each bone.

In all subsequent analyses, the sample was separated by sex and divided into two age groups. One subsample included all individuals younger than 2 years of age (<2 years) and the other subsample all individuals 2 years of age and older (≥2 years). The division of the sample at the age of 2 years reflects biological realities of the growth process. After birth, linear growth is very fast and around the age of 2 years it slows down up to puberty [41]. Given this sharp decrease in growth velocity, the growth curve may not properly be modeled by simple linear regression between birth and 12 years of age. Consequently, this separation allowed the sample to be modeled separately for growth before and after the age of 2 years, using linear regression (Fig. 1). As the simple linear regression and calibration models are very effective and easier to calculate and use, they were preferred over nonlinear models.

Fig. 1
figure 1

Scaterplot illustrating a classical calibration model where long bone length (femur) is regressed on age, with two separate regression lines adjusted to the data using least squares. One line is adjusted to the data of children under 2 years of age, and the other line to the data of children 2 years of age and older, when the sexes are combined. Note differences in the slope (growth velocity) and in the dispersion of data points about the regression line

Normality and homoscedasticity of the samples were tested at the start of the statistical analysis. The samples from the two series (Portuguese and English) were then compared using an ANCOVA, to determine whether there were significant size and sex differences. Subsequently, age estimation formulae were calculated using classical and inverse calibration models [35], for each long bone length, separately for the total sample, and for the subsample which included individuals younger than 2 (<2) and 2 years of age and older (≥2), and by sex. For the inverse calibration formulae, the standard error of the estimate (SEE) and the coefficient of determination (R2) were calculated. The SEE cannot be obtained in classical calibration models, and in this case, the mean standard error (MSE) was calculated instead. This statistic was obtained by calculating the standard error for each individual observation as suggested by Lucy [35] and then averaging the observations to obtain a MSE for the entire sample. This is an approximation of the SEE calculated from inverse calibration [35]. Classical calibration models produce an equation for long bone length (y) in terms of age (x), so to estimate age the relationship has to be inverted and the formulae obtained is solved for age. For each classical calibration formulae, the mean long bone length and respective sample size (N), standard deviation (SD), and range (minimum and maximum), were also provided. The minimum and maximum provide the valid range of values from which age can be estimated using each of the models.

Accuracy and bias of the classical and inverse calibration models was tested on the study sample. For each long bone model, the estimated age obtained was compared with the known chronological age and both the mean residuals (MR) and mean of the absolute value of the residuals (MAR) were calculated, as an estimate of bias and accuracy respectively. Additionally, the percentage of individuals whose chronological age falls within the 95 % confidence interval (95 % CI) of the estimated age (using the MSE for the classical calibration models and the SEE for the inverse calibration models) was also calculated for each long bone length. Confidence intervals in the classical calibration models are calculated by multiplying the standard error by the appropriate value from the t distribution for n–2 degrees of freedom [35], which is approximately 2 for all models.

Finally, previously published models for age estimation from long bone lengths were tested on the study sample, specifically, the equations provided by Rissech and coworkers [2426] for age estimation from the femur, the tibia and the humerus, and the equations provided by Facchini and Veschi [23] for age estimation from the humerus, the radius, the ulna, the femur, the tibia and the fibula. The accuracy of these models was tested by calculating MR and MAR. One sample t tests were used to test if MR are significantly different from zero. The percentage of individuals whose chronological age falls within the 95 % CI of the estimated age was only calculated for the tibia and humerus equations provided by Rissech and coworkers [25, 26] as Facchini and Veschi [23] do not provide the SEE for their formulae and neither does Rissech and coworkers [24] for the femur.

Results

Table 2 shows the results of the intra- and interobserver measurement tests. For intra-observer error, all variables had %TEM values under 0.77 and R values equal to 1.00. The inter-observer error test results are very similar with all variables showing %TEM values under 0.72 and R values also equal to 1.00.

Table 2 Intra- and interobserver measurement error test results for length of each long bone, estimated from the relative technical error of measurement (%TEM) and the coefficient of reliability (R)

The ANCOVA test results show that the samples from the two series (Portuguese and English) differ significantly in long bone length by age (Table 3). However, this holds only for the subsample of individuals 2 years of age and older. The two series do not differ in size when children under the age of 2 are considered. Despite the differences between samples, they were combined into one for all subsequent analyses. No significant differences in long bone length were found between the sexes for the total sample and for the subsamples which included younger and older than 2 years of age, with the exception of the radius and ulna in the later subsample (Table 3). Although females and males tend not to differ in long bone length by age, the sexes were treated separately in the analysis, as well as combined. The ANCOVA results also showed that there is no significant interaction between the effects of sex and series.

Table 3 ANCOVA results for comparisons of linear regression models between the samples from the two series (Portuguese and English) and the sexes

The classical calibration models for each of the long bone lengths are shown in Tables 4, 5, and 6 for the total sample, and for the subsamples which include only individuals younger than 2 years of age (<2 years) and individuals 2 years of age and older (≥2 years), respectively. Each model includes the N, the regression formula solved for age (inverted), the MSE, the R 2, the mean long bone length (M) and its respective SD and range (minimum–maximum). Given the differential preservation, sample sizes vary. Femoral diaphysis length consistently provides the best estimates of age in the total sample (MSE = 1.06 years when the sexes are combined, MSE = 0.92 in males and MSE = 1.21 in females) and in the subsamples which include only individuals younger than 2 years of age (MSE = 0.23 years when the sexes are combined, MSE = 0.23 in males and MSE = 0.25 in females) and individuals 2 years of age and older (MSE = 1.16 years when the sexes are combined, MSE = 0.97 in males and MSE = 1.40 in females). The next best bone length is the tibia, but not consistently, as the humerus and the fibula show lower MSEs in some subsamples. In fact, the tibia is the worst performing bone length in the subsample of under 2-year olds. Overall, the ulna shows the largest amount of error. Females show generally larger MSEs with the exception of children under the age of 2. The error associated with these formulae also increases with age, as the subsample of individuals younger than 2 years show a MSE of about 0.26 years, whereas in the subsample of individuals 2 years of age and older the MSE is around 1.40 (five times as much).

Table 4 Classical calibration models for each long bone length in the total sample, divided by sex and for the sexes combined
Table 5 Classical calibration models for each long bone length in the subsample of individuals younger than 2 years of age (<2 years), divided by sex and for the sexes combined
Table 6 Classical calibration models for each long bone length in the subsample of individuals 2 years of age and older (≥2 years), divided by sex and for the sexes combined

When testing the accuracy of classical (Table 7) and inverse calibration (Table 8) models in the study sample, classical calibration formulae show no mean difference between estimated and chronological age (MR = 0.00), whereas inverse calibration formulae show consistent differences between estimated and chronological age (MR ranges between −0.50 and 0.99). In the inverse calibration model, MR seem smaller for the humerus (and overall for the upper limb bones) and largest for the femur (and overall for the lower limb bones), but not consistently. MR also tend to be smaller and negative in younger children (<2 years) and larger and positive in older children (≥2 years). Most of the MR are significantly different from zero (p < 0.05), with a few exceptions. Comparing both models (classical and inverse calibration) in terms of the percentage of individuals whose chronological age is within the 95 % CI the results vary between 90.3 and 100 % for the classical calibration model and between 57.9 and 100 % for the inverse model. This percentage tends to be slightly greater in the classical calibration models, with some exceptions, such as the femur, tibia, and fibula in the older males (≥2 years). In relation to MAR, the classical and inverse calibration models show similar results, but the classical calibration model has overall slightly smaller MARs. In the total sex-combined samples, MAR is around 0.95 (values vary between 0.16 and 1.38) for the classical calibration model and 0.96 (values vary between 0.17 and 1.42) for the inverse calibration model. This difference is more noticeable in the subsample which includes only the younger children (<2 years). Figure 2 illustrates the raw residuals in the classical (A and B) and inverse calibration models (C and D). Whereas in the classical calibration model the raw residuals are scattered randomly about 0, in the inverse calibration models the residuals appear more spread out below the zero line.

Table 7 Accuracy results for classical calibration models in the total sample, and in the subsamples which include individuals younger than 2 years of age (<2 years) and individuals 2 years of age and older (≥2 years), divided by sex and for the sexes combined
Table 8 Accuracy results for inverse calibration models in the total sample, and in the subsamples which include individuals younger than 2 years of age (< 2 years) and individuals 2 years of age and older (≥2 years), divided by sex and for the sexes combined
Fig. 2
figure 2

Scaterplot of the raw residuals in the classical calibration (a children under 2 years of age and b children 2 years of age and older) and inverse calibration (c children under 2 years of age and d children 2 years of age and older) models for the relationship between long bone length (femur) and age, in the combined sex sample

The study sample also provided an accuracy test for the formulae published by Facchini and Veschi [23] and that by Rissech and coworkers [2426]. Table 9 shows the accuracy results of Facchini and Veschi’s [23] formulae, in which the length of the humerus bone provides the greatest accuracy, closely followed by the femur and the tibia. MR vary between −0.03 and 0.06, showing that there is only a slight overall overestimation. MR are consistently different from zero (p < 0.05) in the subsample which includes only the younger children (<2 years), but in the subsample of older children only the MR obtained from the length of the radius and ulna in males are significantly different from zero. MAR is generally very low, varying between 0.04 and 0.12 years (under 2 months). Rissech and coworker’s formulae (Table 10) are most accurate in the total sample when using the humerus, but in the subsample of younger children the tibia is most accurate and in the subsample of older children it is the femur. There is a general tendency for these formulae to underestimate age (MR varies between −1.51 and 0.58) which can be estimated within 0.74 to 1.58 years. MR show consistent differences from zero (p < 0.05) in the entire sample, but particularly for the length of the humerus and tibia.

Table 9 Accuracy of Facchini and Veschi’s (2004) regression formulae in the total sample, and in the subsamples which include individuals younger than 2 years of age (< 2 years) and individuals 2 years of age and older (≥2 years), divided by sex and for the sexes combined
Table 10 Accuracy of Rissech and coworker’s (2008, 2012) [26] formulae in the total sample, and in the subsamples which include individuals younger than 2 years of age (<2 years) and individuals 2 years of age and older (≥2 years), divided by sex and for the sexes combined

Discussion

Previous approaches used in the estimation of age from the length of the diaphysis in long bones were either not devised specifically for this purpose, as is the case for tables of descriptive statistics from radiographic data, or show important statistical caveats that undermine the reporting and the accuracy of age estimates. In this study, a series of new regression methods are proposed for the estimation of age in remains of known or unknown sex from diaphyseal lengths of the long bones, using classical calibration, that address those specific concerns.

This study used samples from reference collections in Portugal and England, which were shown to differ in size. English children from both the Spitalfields and the St. Bride’s collections lag behind the Lisbon children in growth. After the age of 2 years, children in the Lisbon collection have on average larger long bones than those from Spitalfields or St. Bride’s, which is likely the result of differing social conditions during growth between the Portuguese and English samples. Despite these differences, the samples were combined in order to include more variation in the models and make them potentially applicable to a wider range of populations. Considering that it is usually difficult or impossible to establish whether the model samples are representative of the growth status of unknown immature remains in a particular forensic case, or even in an archaeological sample, an approach that is not sample- or population-specific is likely to be more reliable. Under these circumstances, such an approach will fail less often in providing a reliable age estimate, but this estimate will have a larger confidence interval, due to sampling more variation.

Age estimation formulae were determined for the sexes separately and combined, in spite of the consistent similarities between males and females. In forensic casework, sex is required and age can be estimated with a slight increase in precision if sex is known. In an archaeological context, sex may not be determined prior to age estimation. Consequently, the development of sex-specific and sex-combined formulae was intended to address both forensic and archaeological applications.

The division of the sample at the age of 2 enables use of the most appropriate formulae for each age group, as these accommodate biological differences in the growth process. These differences include faster bone growth (regression slope) and reduced individual variation (regression error) in children under the age of 2 years. Although the calibration models for the total sample can be used for the entire prepubertal age range (0–12 years), the models for the two subsamples (<2 and ≥2 years) provide estimates that are more representative of the age-related differences in growth and more accurate for younger individuals.

The use of classical calibration over inverse calibration finds considerable support from the accuracy tests performed here. The classical calibration models have no bias (MR are zero), when compared with the inverse calibration models (MR different from zero). In addition, the classical calibration models shown here are not only similarly accurate but show no loss of efficiency, despite the expected larger variability for classical calibration when compared with inverse calibration [32, 33]. In fact, the mean of absolute residuals suggest that both models are equally efficient.

One or more of the formulae in Tables 4, 5, and 6 can be used to estimate the age of unknown immature skeletal remains from diaphyseal long bone length, according to the information available for bone length, age group, or sex. If, for example, an unknown individual has a femur measuring 190 mm in length and a nonsex-specific estimate is required, the femur length formula in Table 4 can be used to estimate age. Femur length is substituted into the formula age = (length–97.62)/20.28, thus obtaining an age estimate of 4.56 years after performing the calculation (190–97.62)/20.28. The 95 % confidence interval for age can be calculated using the MSE (1.06) for the formula. The confidence interval is obtained first by multiplying the MSE by 2, the critical value in the t distribution that includes 95 % of the population (1.06 × 2 = 2.12). This value is then added to and subtracted from the age estimate (4.56 ± 2.12 years) to obtain a 95 % confidence interval for the estimated age, which in this case is between 2.44 and 6.68 years. However, these confidence intervals in the classical calibration models cannot be described as there is a 95 % probability that the age falls between 2.44 and 6.68 years, but instead that there is a 95 % confidence that the age lies between 2.44 and 6.68 [35]. It is also important to highlight that the formula provided here are for use in dry bone, as there may be a certain amount of shrinkage relative to wet bone, which is more significant in the early ages [42].

Although testing the accuracy of the regression models in the same sample that was used to develop them may be considered inappropriate, the purpose here was not to carry out an independent test of the models but rather a test of their comparative performance. In fact, one would expect the models to perform well in the same sample that was used to develop them, but the reality is that the expectation was only met for the classical calibration model. This confirms the notion that inverse calibration models are inherently biased and should not be used to develop age estimation methods. This has already been widely acknowledged in the adult age estimation literature [3234], where the systematic bias seen in most adult age estimation methods is considered to be a direct consequence of the use of least squares regression and inverse calibration (age regressed on skeletal age indicator). Consequently, it is perhaps time to acknowledge this same caveat when estimating age of non-adults. Although Aykroyd and coworkers [32] assert that systematic bias in inverse calibration models is reduced as the correlation between age and the skeletal indicator increases, the relatively high correlation between age and long bone length during growth was not sufficient to prevent significant biases. Using the sex combined models for the total sample (Table 8), the inverse calibration formulae yields age estimates with an average bias of about 3 months (0.25 years), whereas the classical calibration formulae yields age estimates with an average bias of 0 months that is with no bias. With the same formulae, an average of 94.9 % of the individuals will have their chronological age included in the confidence interval, compared with an average of 97.4 % using the classical calibration formulae. Although differences between the classical and inverse calibration models may seem small, the difference is significant in that use of classical calibration reduces the risk of misidentification of unknown individuals based on their age. The accuracy tests for the classical calibration model also show that the mean standard error can be reliably used to calculate 95 % confidence intervals. For most age estimation formulae, 95 % or more of the children had their true chronological age included in the confidence interval.

When the classical calibration models are examined in more detail, it is clear that age can be estimated with more accuracy and efficiency in the subsample that includes individuals under the age of 2 years. This is expected as variation in growth is smaller in this age group and hence MAR will also be smaller when compared with the subsample that includes only individuals aged 2 years and older. Conversely, the average percentage of individuals whose chronological age is included in the confidence interval is actually slightly smaller in the <2-year subsample (94.6 %), compared with the ≥2-year subsample (96.9 %). In general, the classical calibration model is most efficient for males, although it seems slightly more efficient for females in the subsample of children under 2 years of age. This may be related to the fact that the male and females sample size is more even under the age of 2, but the older female sample is smaller than that of males such that the female model is more likely to be influenced by random fluctuation in size. In fact, females show more variation as demonstrated by overall larger standard deviation about the mean. The diaphyseal length of the femur is the variable with the least amount of expected error, averaging around 1.06 years (MSE) about the mean estimated age in the total sample with the sexes combined, and the least amount of real error, averaging around 0.88 years (MAR). By contrast, the diaphyseal length of the ulna shows the largest amount of expected (1.21 years) and real error (0.99 years). For the sample of < 2 years, the length of the tibia is the least suitable for age estimation.

The greatest limitation of this study is the size of the samples and their age distribution. Although it would be preferable to have a more evenly balanced age distribution, and greater representation of older children, it is currently impossible to add more identified skeletal material. This study combines two of the largest series of documented immature skeletal remains, and includes slightly more individuals, particularly males, than Facchini and Veschi [23] and about twice as many as that of Rissech and coworkers [2426]. A larger sample, particularly one which included a larger proportion of older children, would include more variation and possibly reduce the amount of expected error. Another limitation of this study was the lack of opportunity to provide an independent test of the accuracy of the classical calibration models in view of the scarcity of identified immature human skeletal material. It is hoped that with access to similar collections by the authors and/or by other researchers will provide a much desired independent test.

The efficiency of the formulae provided in this study is difficult to assess relative to that of other published formulae [2326], because previous studies have only used inverse calibration for modeling age and long bone length, and only Rissech and coworkers [25, 26] have actually provided error rates for their formulae. The comparison is further complicated because error rates provided by Rissech and coworkers [25, 26] cannot be directly compared with those of this study, because only an approximation of the standard error can be obtained from the classical calibration models. The standard error of the estimate provided by Rissech and coworkers [25, 26] for estimating age from the diaphyseal length of the humerus and of the tibia is 1.399 and 1.777 years respectively. The mean standard error in the formulae for the same models in the study sample, but using classical calibration, are 1.13 and 1.11 years for the humerus and tibia, respectively. In the end, it is difficult to determine which error estimates are actually greater, because of the issues discussed above, but also due to Rissech and coworker’s sample including individuals up to 19 and 15 years of age, which increases the amount of error by introducing increased age variation.

Considering that Facchini and Veschi’s [23] and Rissech and coworkers’ [2426] recently published formulae for age estimation from long bone lengths, were based on least squares and inverse calibration, the overall expectation is for these models to provide biased age estimates. Overall, Facchini and Veschi’s [23] formulae seem to outperform those of Rissech and coworkers [2426]. This may suggest that the sample used by Facchini and Veschi [23] includes children of approximately the same size for age or only slightly larger than children in this study sample. Conversely, children in Rissech and coworkers’ [2426] sample seem smaller for their age. However, this may be an artifact of the bias introduced by inverse calibration. In fact, one would expect greater similarities in size for age in Rissech and coworkers’ [2426] sample, given that the some individuals from the same collections were used. Conversely, the difference may also be explained by the fact that Rissech and coworkers’ [2426] sample includes children from the Coimbra collection and these may be overall smaller in size for age, compared with the Lisbon, Spitalfields, and St. Bride’s collections. The bias introduced by inverse calibration may also explain why Facchini and Veschi’s [23] formula are more accurate, when in fact the average long bone length in their sample seems slightly greater than that in the study sample. The results of the tests of Facchini and Veschi’s [23] and Rissech and coworkers’ [2426] formulae illustrates the effects of using inverse calibration for age estimation, where a method developed on a more closely related sample will be less reliable than a method developed on another sample that reflects differences in size. These biases caution against using population-specific methods, particularly when they were developed using inverse calibration. In addition to the formulae provided by Facchini and Veschi [23], Rissech and coworkers [2426] published techniques for fetal age estimation from measurements of the long bones based on inverse calibration [4345] are also likely to be equally biased.

A key consideration is the applicability of the age estimation formulae presented here to the analysis of human immature skeletal remains from forensic cases. Numerous studies have shown that long bone growth rates of children in past populations or in populations from developing countries which experience poor nutrition or increased risk of disease are lower than those in modern Western industrialized nations [46]. Skeletal growth profiles in the Spitalfields, St. Bride’s, and Lisbon samples show consistently slower rates of long bone growth compared with the modern reference established by the Denver Growth Study [4, 46]. In addition, secular changes have had a very strong impact on body size, particularly in the last 100 years in Europe, where height in adults and children increased steadily due to improvements in living conditions [47].

The Lisbon collection children lived during a period when most European children were experiencing increased growth in height compared with previous generations [47, 48]. However, Portuguese children did not experience the major secular increase in height until after 1960. While the height of Portuguese children increased by only 0.4–2.6 cm between 1906 and 1936, there was an increase of 10.1–15.9 cm between 1966 and 2006 [49]. The Lisbon collection children predate these major changes in height and were on average shorter than present-day Portuguese children of the same age. English children also experienced a similar increase in height over the twentieth century [50], but the Spitalfields and St. Bride’s children are from the eighteenth and nineteenth centuries. Consequently, all children in the study sample predate the secular increase in height documented for European children during the twentieth century and are more representative of populations experiencing lower levels of social and economic development, including archaeological populations and children from developing countries.

These findings have obvious implications for the use of long bone length as an indicator of age. As the study sample represents a group of children who experienced moderate to severe malnutrition and were exposed to high disease loads, they are stunted in growth [4, 46]. Consequently, the formulae provided here are unlikely to be useful in modern medicolegal contexts of the developed world, as they will not reflect the current growth status of children in most developed nations. The study sample may be considered representative of children in present-day populations which have poor nutrition and increased risk of disease, including those from developing countries and some more deprived communities in developed nations. Therefore, the age estimation formulae should only be used for estimating age in children who experienced environmental conditions similar to those of Portugal and England living in urban environments between 250 and 50 years ago. Although it may be impossible to determine whether the formulae are suitable to a specific population or group, the range and mean long bone lengths provided here may offer a crude gauge to make such an assessment. The fact that the growth of the long bones is more similar between Portuguese and English children in the study sample before the age of 2 years, suggests that the formulae for children under 2 years of age may be more suitable for forensic purposes and for a wider range of populations. Differences in growth due to environmental influences are usually established by the age of 2 years [47], but become increasingly noticeable after that age because of the cumulative effects of environmental insults on growth. Whereas regression formulae provided in this study are more likely to be useful in forensic investigations involving immature remains from developed nations, particularly if the child is under 2 years of age, there is little to turn to for help when estimating age of child remains in the developed world. If the expert is willing to sacrifice the error rate of the age estimate, then some of the tables [914] and graphs [21, 22] mentioned before can provide some guidance.

A further concern relating to the use of formulae derived from skeletal size in deceased children for age estimation is mortality bias. The growth of children who die prematurely may not be representative of their living counterparts as children who experience stunted growth also have a greater probability of dying. As a result, skeletal samples tend to include a disproportionate number of stunted children [51]. Consequently, the formulae provided here may reflect an overestimation of the growth deficit in an already deprived population. This will be less of a concern in archaeological remains, since in this case children will also have died prematurely from natural causes, whereas child remains in forensic investigations are found more commonly in circumstances where a violent death has occurred.

For reasons similar to those discussed above, Facchini and Veschi’s [23] and Rissech and coworkers’ [2426] formulae will not provide reliable estimates of age in modern immature skeletal remains from Western Europe, and in particular from the Iberian Peninsula as argued by Rissech and coworkers [25]. Rissech and coworkers [25] assert that their formulae are valid for forensic age estimation in modern children because they did not detect delayed growth in their series compared with modern growth models. They further argued that the homogeneity observed in the maximum length of the adult femur, between their series and the Spanish documented skeletal collection of the Universidad Autonoma de Barcelona (UAB) [24] implied a lack of growth deficit in their subadult series. However the adults in the UAB collection were born between 1892 and 1959 [24], making them of the same birth cohort of the Lisbon collection children. As a result, they are likely to have experienced similar deficit during growth as the Lisbon sample, resulting in smaller adult size than would have been attained in modern adults. Given the secular effects discussed above, European children are considerably taller today than they were 50 or 100 years ago. Spanish children, in particular, are among the Europeans who have experienced the greatest secular increase in height since the middle of the twentieth century, about 2.4 cm/decade [48, 52]. Unfortunately, Rissech and coworkers’ [2426] have incorrectly asserted their samples as appropriate for use in a modern Iberian or Western European forensic context. This provides a cautionary note against assumptions of temporal continuity in skeletal growth in populations when using cemetery collections of identified human skeletons to represent present-day populations, as secular trends must be considered.

Conclusions

The long bone lengths collected in this study are derived from one of the largest assemblages of known sex and age immature skeletons presently available, not only in terms of overall size but because the entire age range is reasonably well represented. Analysis of these data demonstrates that inverse calibration is an inappropriate approach for developing age estimation formulae and that it should be abandoned when modeling age and a skeletal indicator of age, even in immature skeletal remains. The classical calibration models presented here provide a series of new formulae for age estimation from the diaphyseal length of the long bones that can be potentially used in a variety of contexts. The long bone growth rates in the study sample are lower than those of the modern Western industrialized nations which have undergone considerable secular change because of improved living standards. As a result, the formulae cannot be reliably applied to forensic cases involving recent children, but can possibly provide the best available estimates when children from developing countries are involved or children from poor communities in developed nations. The formulae can also be used in archaeological populations where long bone growth rates in children are similar those in England in the eighteenth to nineteenth centuries and Portugal in the early twentieth century. Previously published formulae for age estimation from long bone lengths have provided unsatisfying results due to failure to consider secular trends and inappropriate statistical treatment of data.