Introduction

Reliable sex determination is of paramount importance in forensic practice [13]. It provides a component of an individual’s biological profile complementing age at death estimation, stature, and population affinity assessment [1, 4]. It is widely accepted that the most accurate sex estimation from skeleton can be performed on pelvis and then skull [5, 6]. Unfortunately, they are not always preserved in forensic cases [7] and in such situations, one can use long bones. Especially robust ones like femur or tibia which are often better preserved can be used for sex estimation [8, 9]. Numerous studies were published that prove usefulness of tibia in forensic anthropology [8, 1021]. Almost all of them agree that its proximal end is the most sexual dimorphic. Generally, the accuracies of sex determination are high (above 85 % when multiple measurements are used) especially with proximal and distal widths as well as circumference [22].

In sex assessment, discriminant function analysis (DFA) is the most often used classification tool among authors. Their advantages and shortcomings have been discussed recently [23, 24]. DFA tend to be samples-specific and their generalization needs to be assessed in different samples [25].

Any attempts to develop standards for sex estimation of human skeletal remains must take into account that the pattern of sexual dimorphism varies among human populations. Therefore, it is important to avoid the application of metric standards proposed in different populations from different periods of time than the studied sample [26]. Also, as several studies have shown, body size has changed over generations in the population as a consequence of secular trend [2729]. These changes are especially related to the body height [30]. Secular trend in stature is widely observed in humans since the nineteenth century, and this trend directly impacts the dimensions of long bones [3134].

Major problems with methods that use size-based variables are that standards can be influenced by secular trend and are usually population-specific. On the other hand, this is to a certain extent also true for shape-based characteristics [22]. Thus, it should be assumed that methods for sex determination, based on skeleton collections of known sex from the first half of the twentieth century, cannot guarantee the same reliability of results when they are used in attempts to identify unknown human remains from recent populations [35].

Application of metric data from one population into the DFA derived from different population groups results in high classification error, and the results are also affected by large sex bias [36, 37]. In a study designed to quantify the effect of applying Euro-American [38] and South African of European ancestry [39] standards to Australian population sample [40], classification accuracy was approximately the same in Australian sample as in target sample (e.g., 80–83 %). On the other hand, correct estimation of sex was distinctly distorted by unacceptable sex bias, i.e., the difference in correct sex determination of males compared to females (which was 31 and 36 %, respectively).

The effect of disregarding population specificity on the classification accuracy of DFA is often mentioned [25] but is rarely described for different parts of the skeleton. In recent publications, we found such studies only for crania from Western Australian and Indian populations [25, 30] and in European populations for the clavicle [41], the calcaneus [37], and for the femur [36]. Recently, many authors have argued for the development and use of population-specific formulae for diverse parts of skeleton when metric data are used [25, 30, 35, 4145].

The first objective of the present study is to propose classification functions (CFs) for sex estimation in recent Czech population based on CT imaging. The collection of osteometric data from CT images is a reliable and acceptable source and is utilized more and more often for sex assessment in forensic anthropology [4651]. The other objective is to simulate ignoring population specificity and practically show the range of errors in sex classification.

Material and methods

Material

In this study, we used surface models which were constructed from anonymized CT scans from angiography (the slice increment was set at 0.5 mm). Tibial 3D models were reconstructed using specialized software Mimics (Materialise, Leuven, Belgium) and thresholding method. Our virtual material comes from Czech population of the twenty-first century. The age, sex, and race of all specimens are known. Sample consists of nonpathological 56 left human tibiae where 30 belong to male and 26 to female. Male individuals were born between the years 1943 and 1980, and the mean age was 56.1 years (ranging from 31 to 68). Female individuals were born between the years 1920 and 1978 when their mean age was 69.0 years (ranging from 33 to 91). Table 1 shows age distributions, mean ages, and standard deviations. The same material was originally utilized in the study of Brzobohatá et al. [52], where authors performed sex determination from tibiae using geometric morphometrics.

Table 1 Characteristics of the Czech sample by age and sex

Selection of measured dimensions

Following relevant published works, we chose ten dimensions commonly measured in anthropology. Their list together with used abbreviations, references where they were defined and landmarks used for computing these dimensions are given in Table 2. Most of the chosen dimensions were defined by Martin and Saller [53] marked by letter M; in other cases, the authors of the definitions is given in the same table.

Table 2 Selected and measured dimensions with their name, references, and used landmarks

The selection was adjusted to several conditions:

  1. 1.

    Dimensions had to correspond with the selection of discriminant functions.

  2. 2.

    According to the related literature, the most dimorphic dimensions were selected (area of the knee joint mostly).

  3. 3.

    Circumferential dimensions were excluded because of the difficulty of their reproducibility in a virtual environment as opposed to measuring real dry bones.

  4. 4.

    Finally, length was also left out as the length of the bone does not perceptibly contribute to sex classification, compared to joint dimensions.

Measurements were carried out in the Morphome3cs software (www.morphome3cs.com), which was developed in the Department of Software and Computer Science under the Charles University in Prague. In the environment of this program, we located 23 landmarks on surface models of each bone and one landmark (number 21) was computed by the software. For the dimension M3, we first had to construct the medial plane (defined by landmarks 22–24) and then we located landmarks 11 and 12 as the most laterally prominent points of proximal part. These two points lied on the tangent plane which is parallel to defined medial plane. Dimension M6 was measured by locating auxiliary landmarks 18 and 20 and defining landmark 21 as their midpoint. Dimension M6 is the distance between the landmarks 17 and 21. Other dimensions were measured simply as the distances between two relevant landmarks. The locations of all landmarks are documented in Fig. 1. All metric values were collected by one observer.

Fig. 1
figure 1

Location of landmarks on the a proximal part (112, 22, 23), b proximal part posterior view, c body (1316), and d distal part (1721, 24) of tibia

Intraobserver error

Intraobserver rate was calculated by relocating all main landmarks of seven randomly selected tibiae. Repeated measurements were performed seven times with at least 1-day interval between them. Average error was calculated in Morphome3cs software and reached 0.4734 mm.

Selection of published discriminant functions

The process of selection of discriminant functions follows these requirements:

  1. 1.

    Only multidimensional discriminant functions were chosen, because one dimension cannot reliably determine the sex, due to overlapping features of sexual dimorphism between female and male [54].

  2. 2.

    Sex bias could not exceed 7.5 %.

  3. 3.

    The accuracy of sex estimation was greater than 75 %.

  4. 4.

    Selection was made with regard to the criteria and limits of the selected metric dimensions.

  5. 5.

    Selected functions were designed for both European and non-European populations from various periods of time.

Based on these criteria, we chose nine published discriminant functions whose coefficients for each variable, constants, sectioning points, and levels of accuracy are summarized in Table 3. Discriminant functions 1–4 were proposed by Brůžek [14] for a Portuguese population whose birth falls between the early nineteenth and early twentieth centuries. DF 5–8 proposed by Kranioti and Apostol [21] for a sample of south European populations (Spanish, Italian, Greek, and pooled populations) whose death falls into the second half of twentieth century. Last discriminant function DF9 proposed by Işcan and Shaivitz [10] for the North American population whose was born between the early nineteenth century and first half of the twentieth century.

Table 3 Selection of published discriminant functions with coefficients, sectioning points, and accuracy of sex determination

To evaluate classification performance, accuracy and sex bias were used. According to several authors [25, 55, 56] sex bias is computed as the difference between the classification accuracy of males and that of females, both expressed as percentages. A sex bias lower than 5 % is ideal in forensic anthropology [25]; however, to increase the number of functions available for comparison, a limit of 7.5 % was used.

Selection of dimensions used for proposing own CF

For classification purposes, the most sexual dimorphic dimensions of tibia were chosen. We thus preferred the dimensions on the proximal part of bone in the area of knee joint (BB, M3, M3a, M3b, M4a, M4b). The last dimension M6 was selected because several authors [9, 12, 14, 20] had used some of the dimensions which measured distal breadth of bone in combination with measurements on the proximal part. Then, using the software Statistica [57], we tested combinations of two or more dimensions and selected the ones with the greatest discriminatory power.

Statistical analyses

Basic statistical characteristics and the application of Czech dimensions into the selected discriminant functions were performed in MS Excel. Two-sample t test was used to quantify the level of dimorphism in the measured lengths. Descriptive statistics for both sexes of the Czech sample, including standard deviation and t test for each dimension are shown in Table 4. As expected, males have greater tibial dimensions than females. t Test indicates that all of these differences are statistically significant (α = 0.05). Only one dimension (DB) was found above the significance level. For all these statistical analyses and for computing our own discriminant functions as well, Statistica software [57] was used. Also tenfold cross-validation was done for linear discriminant functions in program R.

Table 4 Descriptive statistics of the measured variables of the tibia from Czech population

Classification techniques

For the classification, we used two different approaches. Linear DFA in software Statistica and one advanced data modeling method based on the Group of Adaptive Models Evolution (GAME) [58, 59]. The GAME method was utilized to search for new classification functions. This inductive method is based on a feed-forward artificial neural network and consists of different types of transfer functions. Both, the structure of the network and the parameters of the transfer functions, are set automatically during a run of GAME. The final structure of the network represents the new classification function. An example of such a network structure is shown in figure (Fig. 2). Because of the random initialization, the GAME method provides different discriminant functions in each run of the method. The result DF can be either linear or nonlinear. To obtain representative results from the GAME method, we used ten times repeated tenfold cross-validation.

Fig. 2
figure 2

Example structure of the GAME neural network with one input layer, one hidden layer, and one output layer for classification of males and females. The number of hidden layers can differ for each particular run of the algorithm

Apart from the classification functions, the GAME method encapsulates also a feature ranking method called FeRaNGA [60]. The FeRaNGA method provides importance of particular input variables used during the ten times repeated tenfold cross-validation process while searching for classification functions.

Results

Discriminant analyses

Computation of DFA in Czech sample

We designed two linear discriminant function equations (Table 5) for modern Czech population. First of them (CF1) includes seven variables (BB, M3, M3a, M3b, M4a, M4b, and M6) and provides 85.7 % (83.3 % males, 88.5 % females) of correct determination (sex bias −5.2 %). The second uses five variables (BB, M3a, M3b, M4a, and M4b) and determines sex with 82.1 % (80.0 % males, 84.6 % females) of correct assessment (sex bias −4.6 %). Cross-validation is not identical to the original data; in the first case, value is equal to 75.0 % and in the second to 76.8 %. The sectioning point is equal to 0, decision value above 0 indicate a male, while values below zero indicate a female.

Table 5 The proposed classification functions in the Czech population

Advanced classification technique (GAME) in Czech sample

The GAME method creates automatically a structure of the artificial neural network. The structure is represented by a classification function. Due to a random initialisation of GAME at the beginning, the classification function is different for each run of the method. From this reason, we used ten times repeated tenfold cross-validation (thus, we created 100 classification functions) to get mean classification accuracy of the GAME method for representative comparison with another approaches. Each particular classification function has different level of complexity. It can be either linear, nonlinear or any combination of both types. Classification function with the highest classification accuracy does not necessarily means that it is also applicable in practice (e.g., an exponential function of a logarithmic function having as an argument a sigmoidal function is not the right candidate for practice). The overall average classification accuracy from 100 runs of the GAME method was 82.35 % with a standard deviation of 2.25 %. Based on above mentioned we selected three user friendly and practically applicable classification functions with high classification accuracy while having minimum complexity. One linear classification function (CF3) and two nonlinear classification functions (CF4, CF5) are shown in Table 5. The linear CF3 includes six variables (DB, M3a, M4a, M4b, M6, M9a) and provides classification accuracy of 83.9 % (males 83.3 %, females 84.6 %) with sex bias of −1.3 %. The first of our nonlinear functions is CF4, and its type is exponential. The classification accuracy of CF4 is the same as for CF3 with the same sex bias, though it needs only four variables to reach such accuracy. The last classification function CF5 is of sigmoidal type and provides the highest classification accuracy (from CF1 to CF5) with 87.5 % (male 90 %, female 84.6 %) and sex bias 5.38 %.

The discriminatory value for CF3, CF4, and CF5 is equal to 0.5. Smaller value indicates male subject, while greater or equal values indicates female subject.

The importance of particular input variables determined by the FeRaNGA method during the search for best classification function are listed from the most important to the least important followed by their importance: M6 (28.2 %), DB (21.7 %), BB (16.4), M3 (14.9 %), M9a (8.7 %), M3a (4.8 %), M4a (2.5 %), M4b (1.8 %), M8a (0.6 %), and M3b (0.4 %). The four most important variables are M6, DB, BB, and M3.

Application of dimensions from Czech sample into published DFA

Table 6 compares classification success rates in the original reference sample and in target sample (Czech population) separately for males and females. When we used Czech dimensions in DF1, originally created for Portuguese sample, almost all of Czech males (90 %) were determined correctly, only three of them were by this function considered as female. On the other hand, females were overestimated and determined as male. Only 38.5 % of female sample was successfully classified. We got very similar results using DF2. Success of classification among males was 93.3 % and among females coincidently 38.5 %. DF3 classified correctly 83.3 % of males and more than half (53.9 %) of females. The last tested function originally proposed for Portuguese population, DF4, estimated males flawlessly (100 %) but totally overestimated females of our sample (0 %). DF5 was fitted to Spanish population and separated properly 100 % of males and just 11.5 % of females. Discriminant function 6 derived from Italian population was not eventually incorporated (see below). DF7 was proposed on Greek population, and this function estimated absolutely correctly in our male sample, and largely overestimated female sample (3.9 %). DF8 was derived from mixed south European populations, and this function gave 100 % of correct assessment of males and 7.7 % of correct determination of females. The last tested discriminant function DF9 proposed for Euro-American population successfully classified all Czech males but only 7.7 % in female sample.

Table 6 Application of DFA proposed in different populations in recent Czech population: simulation of disregarding DFA population specificity

The same table compares computed sex biases of reference and target sample. While values of published DFA are low (<7.5 %), sex biases of tested functions are unacceptably high (above 29.4 %).

Discussion

Forensic anthropology benefits from accurate and reliable methods for estimating parameters of an individual’s biological profile. The Daubert standard requires forensic anthropologists to confirm the validity of their methodologies through empirical testing, peer review, publication, and calculating error rates. External validity must be tested, and if the proposed procedures are based on one population, application to other population is required [1, 61, 62]. In cases of natural mass disasters, accidents, and mutilated human remains, sex estimation is often based on single skeletal element, most frequently on the long bones of extremities, vertebrae, or skull fragments [63, 64]. The success of determining sex is dependent not only on the method and degree of sexual dimorphism of the reference population in which the method was proposed, but also on the degree of sexual dimorphism of the target population in which the method is applied. Moreover the reliability of the sex allocation is influenced by secular trend [27, 28]. However, for these reasons, as we demonstrated in the Introduction, most of the methods for sex estimation are population-specific. At the same time, the tibia may be more susceptible to short-time secular changes and thus, more research is required to test this hypothesis [21].These short-term changes in dimensions of the tibia are also affected by the increasing prevalence of obesity [65]. A number of methods have been proposed on the identified skeleton collections that come from populations that do not longer exist. A number of such studies were published on historical skeletal collection and their methods are offered and available to users [815, 66]. As noted by Ousley and Jantz (2012), “Data from modern American show that standards derived from the nineteenth century are not appropriate for the assessment of twentieth-century groups.” and “Also, postcranial samples from the Hamann-Todd Collection used by Iscan and Cotton (1990) perform very poorly when applied to modern cases” [67]. However, a major problem with use of the sex classification tools is that the population is not stable and is changed continuously under the influence of modifications in socioeconomic factors. Therefore, it is important to know the risk of errors, when the population specificity of the methods is completely ignored.

Altogether, we proposed five classification functions in Czech sample to determine sex (two discriminant functions, one linear, exponential, and sigmoidal). Maximum classification accuracy using a linear DF analysis was 85.7 % and −5.2 % sex bias with seven variables included (CF1). However, the highest classification accuracy of 87.5 % with −5.4 % sex bias and comprising three variables (CF5) was achieved using feed-forward artificial neural network–GAME method. This confirms that other classification functions than classic linear DFA are suitable as classification methods [24, 68]. However, with all five functions, we reach an acceptable sex bias according to Franklin et al. [25], ranging from −1.3 to −5.4 %, only two of them slightly exceed the 5 % limit. More importantly, we have achieved comparable success rate and sex bias (Table 5) as other mentioned authors did in other population samples. Similar results for samples from twentieth and twenty-first centuries reached also Brzobohatá et al. [52] using geometric morphometry. Although CF5 (sigmoidal function) provides the best determination of sex, it has the greatest sex bias (5.4 %), which is still acceptable. It should be noticed that CF3 (linear) and CF4 (exponential) have the same accuracy and even sex bias, but CF3 includes six variables and CF4 only four variables. Generally, nonlinear classification functions (CF4 and CF5) tend to require less input variables. This is advantageous, at least in cases of incomplete or damaged remains, where it is not possible to measure all dimensions.

In this study, we tested accuracy of discriminant functions fitted to various populations with dimensions measured on our Czech. We did it in the same manner as a potential user when estimating the sex from the tibia. We used only multivariable methods because a single variable has little practical value for sex assessment according to Peckmann et al. [37] due to the large overlapping area of both sexes. Results reveal strong sexual dimorphism of the tibial dimensions in the Czech population which significantly contributes to sex discrimination.

When classifying the Czech sample using seven DF from three different Mediterranean European populations [14, 21] or DF from American Caucasians [10], sex estimation was done correctly only in 53.6–69.7 % of cases. However, regardless of the rate of misclassification that is unacceptable for practical use in the forensic sciences, the sex bias ranged from 29.4 to 100 %. The results repeatedly demonstrated on the one hand, almost absolutely correct determination of males (DF1-5, DF7-9) and on the contrary, failure of classification of female individuals. DF6 is incorrect, and there is a mistake in coefficients or constant computed for function IF4 in the original paper of Kranioti and Apostol [21]. It can be verified by solving discriminant equation with mean values; in that case, Italians of both sexes have negative values despite sectioning point of zero.

The low classification rates in females were observed also in sexing of the metacarpals [69], and results in this study indicate that the standards developed from the continental Greece are not proper for application in forensic cases in Crete because they do not represent the local Cretan population. The original sectioning point sometimes completely misses the distribution of the sexes. Low accuracy together with high sex bias is unacceptable for forensic anthropology. Similar results, proving that population differences cause that DF that perform well on one population may produce high error rates and sex biases when applied to other populations, were noted by several authors in crania [55, 70, 71].

The most likely explanations of such results are the variability of individual populations [72], different degree of sexual dimorphism of tibia among populations [73] and thus different sex-specific means of dimensions of each population (variance of values dimensions). We should take into consideration that unlike chosen studies, where DF1-DF9 were proposed, we used 3D models of bones. According to previous studies [74, 75], where authors support their comparability on skulls, we believe that measurements derived from 3D models are comparable to those from dry bones. Therefore, the general effect of applying nonpopulation standards reduces the accuracy of classification, the magnitude of which is proportionately related to the degree of divergence in size between the original reference sample and the individual to which those standards will be applied. It follows that the main limitation of using published methods for estimating sex using the metric data for tibia (and other bones) is their population specificity (e.g., [21, 42]). Evidence follows from contradiction of comparing the results of reliability and accuracy of classification methods. Population specificity is also related to the trend of globalization [76]. Problem is that the individual whose sex we want to estimate does not necessarily originally come from the area in which their remains were found (e.g., [4, 77]). The population affinity or ancestry of an unknown specimen in the worldwide increase in admixture of human populations is difficult (e.g., [7880]).

What options are there for dealing with population specificity of methods which is characteristic not only of sex estimation methods but also of all methods estimating the biological profile of the individual? Forensic anthropology is constrained by a paucity of population-specific standards as number of repositories of documented skeletons, traditionally the main source of population-specific data, is limited [62]. However, as Albanese et al. [81] mentioned, group specificity should increase the precision of the estimation methods, but on the other hand, in many cases, the parameters of a given group are based on assumptions which make assigning an unknown to a group under the best conditions problematic and at worst impossible. Methods developed in pooled samples acquired, contrary to traditional assumptions, certain degree of robustness. For example, pooled-ancestry stature equation can be more appropriate than population-specific equations obtained from a single sample [82]. It should be kept in mind that it is not possible to respect the population specificity in bioarchaeology. The only possible solution could be the creation of regional standards in Europe, which correspond to variability of body size, and on such a basis, a robust method for sex estimation should be developed. It was similarly designed in the study concerning stature in bioarchaeology [83], where the authors created northern and southern European formulae.

Medical imaging (CT, MRI, ultrasonography) of living individuals offer appropriate and reliable source of contemporary population data from selected geographic populations. These data can be pooled in a metapopulation sample from which skeletal standards for the estimation of age, sex, and stature could be developed. The combination of recent data from the CT images and historical collections of identified individuals could lead to suggestions of highly robust methods, whose reliability would provide a reliable estimation of the biological profile in forensic anthropology and bioarchaeology. For these reasons, one of the goals of the international scientific community is to develop a synergistic, well‐coordinated activity to create a database of 3D imaging from clinical computed tomographic scans of the skeleton to formulate new nonpopulation-specific methods.

Conclusion

Sexual dimorphism of the tibia is well established in the literature and standards for the Czech have been proposed. However, all the DFA proposed in Czech sample and all of already published functions in other populations were population-specific with classification accuracy ranging between 80 and 90 %. Disregarding or ignoring this fact causes complete failure of sex estimation. Direct application of published population-specific DFA in Czech sample leads to unacceptably low accuracy with sex bias ranging from −29.4 to 100 %. This failure of sex assessment is caused by a variability of skeletal dimensions among populations. For these reasons, the misclassification rate does not correspond to the Daubert criteria, and therefore, the DFA from tibia cannot be recommended also for the estimation of sex of recent skeletal remains in Europe.