Introduction

Recent research trends in the discipline of forensic anthropology indicate a growing awareness of the importance (and requirement) for contemporary population-specific standards; the development, refinement and testing of alternative methodological approaches have also been given due consideration. In relation to assessing sex in human skeletal remains, the former issue is listed as a recommended practice in the ‘Best Practice’ guidelines published by the Scientific Working Group for Forensic Anthropology. Additional objectives of this working group involve the dissemination of research and technology relating to the discipline and to identify best practice and future directions [1].

With regard to forensic practice in Western Australia (and Australia generally), there is a relative paucity of contemporary population-specific skeletal standards for the estimation of sex; in attempting to fortify the capability of forensic anthropologists in this country, we are developing standards for estimating sex that are applicable to a number of different skeletal elements, complete and/or fragmentary [e.g. 2, 3]. The formulation of accurate, robust (e.g. statistically quantified) and “useable” (e.g. able to be applied to incomplete bones) standards is clearly a desirable and necessary aim, but it is equally important that alternate methodological approaches be given due consideration; this ensures that only the most robust and accurate methods end up being disseminated into the wider forensic community. Ideally, this will result in the “end-user” (forensic scientist/investigator) being better prepared, such that they can select from a suite of available methods that most appropriately suit a given situation.

An important reason for developing and/or applying standards specific to the geographical region in which a forensic investigator is working is the well-established notion that the application of non-population-specific standards (e.g. for estimating sex) results in a reduction in classification accuracy and/or an unacceptably large sex bias [e.g. 4, 5]. In the forensic anthropological discipline, it is now (more than ever) crucial that methods are empirically tested and the associated degree of uncertainty in arriving at any final estimation is statistically quantified because this is a core requirement for the admittance of expert evidence testimony [6, 7].

Repositories of documented human skeletons have traditionally provided the necessary biological data for the development of anthropological standards. Such collections, however, are not globally available. In recent years there has been an increasing movement towards utilizing radiological imaging data that more appropriately represent contemporary individuals from a variety of geographically diverse populations. In terms of forensic validation studies, it has been demonstrated that 3D CT data are suitable for the reliable quantification of traditional osteometric measurements [e.g. 8, 9]. It has also been established that radiological (or digital) skeletal data can be used to formulate sexing standards that have a high degree of expected accuracy—above 80 % and even beyond 90 % (e.g. cranial CT scans [9], humeral digital radiographs [10], lateral skull cephalographs [11] and digital photographs of the glenoid cavity [12]). Such accuracy rates are comparable to, if not higher than, results achieved in traditional morphometric studies of the same bones [e.g. 1315].

With regard to the refinement and testing of both novel and established traditional methodological approaches, the main aim of the present study is to explore how accurately cranial form predicts sex in a sample of adult Australians by applying traditional morphometric and geometric morphometric methods to data derived from three-dimensional anatomical landmarks acquired from CT-reconstructed crania. From an original dataset of more than 1,000 pairwise ILD measurements, a series of traditional measurements that most accurately estimate sex in this population was selected. We concurrently assess and compare the effectiveness of three-dimensional multivariate descriptors of cranial size and shape for estimating sex. The robustness of all findings is then explored using resampling statistics.

Materials and methods

Subjects

The present study examines cranial multi-detector computed tomography (MDCT) scans of 100 male and 100 female adult individuals who were patients during 2010 and/or 2011 that presented at various Western Australian hospitals for clinical cranial evaluation. The scans were anonymous when received by the authors, with only sex and age data retained. The mean age of the male individuals is 36.4 years (range 18–63) and for the females it is 40.3 years (range 18–61). Specific information regarding individual ethnicity of participants is unavailable, but the sample is overall taken as being representative of a “typical” contemporary Western Australian population (e.g. primarily Caucasian—see [2]). CT scans presenting acutely abnormal morphologies resulting from trauma and/or pathology (e.g. serious fractures) were not included if it affected our ability to accurately locate the required cranial landmarks (see the following discussion). Research ethics approval was granted by the Human Research Ethics Committee of the University of Western Australia (RA/4/1/4362).

Radiological data acquisition

The selection criteria were very specific: only scans with a slice thickness ≤1 mm and which included the entire region of interest (cranial vault to alveolar margins) were accepted. Cranial patient CT imaging was performed using MDCT on a Phillips Brilliance 64 scanner (Philips Healthcare, North Ryde, Australia). Following 3D volume rendering, the 3D coordinates of 46 anatomical landmarks were acquired (by AF) using OsiriX® (version 3.9–64 bit). The landmarks are accordingly defined and illustrated in Table 1 and Fig. 1, respectively.

Table 1 Definitions of the landmarks used in the present study (see Fig. 1 for illustration)
Fig. 1
figure 1

Anterior and lateral views of the cranium showing the landmarks used in the present study. Only selected landmarks are illustrated (see Table 1 for definitions)

Geometric morphometrics

Raw landmark coordinates were imported in MorphoJ [16] as an ascii text file in the format as specified by the programme user’s guide. In this software, the shape coordinates were obtained using a Procrustes superimposition of the original raw landmark configurations [17]. This results in size being standardized and any differences in translation and rotation are minimized using a least square method. Centroid size (CS)—the square root of the sum of squared distances between the landmarks and their centroid (i.e. the mean of the X, Y and Z raw coordinates)—was computed for each specimen; this was used as a biologically meaningful expression of the overall size of the landmark configuration and thus of the cranium [18].

Quantification of differences between bilateral landmarks was performed in MorphoJ using a hierarchical ANOVA to assess whether asymmetry was negligible compared to individual variation. Shape analyses were performed on the first 20 principal components (PCs) of the Procrustes coordinates, which were shown to provide an accurate summary of total shape variation using the approach of Fadda and Corti [19] and Cardini et al. [20]. Analyses designed to test the “robustness” of the results of the discriminant analysis data (see below) were performed as described in Viscosi and Cardini [21]. More detailed information about this and other statistical analyses can be found in the “Electronic supplementary material”.

Traditional morphometrics

The percentage of variance explained by sex was computed in PAST as the squared correlation coefficient (R 2) between each ILD among all possible pairs of landmarks and a dummy variable for sex. The ILDs were ordered according to increasing R 2 values; those measurements ≥ R 2 values of centroid size (which summarizes size differences, simultaneously taking all landmarks into account—see the preceding discussion) were selected and individually used to predict sex. This approach is a crude (albeit simple) method of exploring whether simple cranial linear distance measurements can be used to accurately estimate sex in this particular population. Confidence intervals (CI) for the R 2 estimates were computed using 1,000 bootstraps in NTSYS-pc [22] following the approach of Manly [23].

Statistical analyses of sexual dimorphism

Sexual dimorphism was assessed in PAST using a test for mean group differences with 10,000 permutations [24, 25]. Sex prediction accuracy was estimated by performing discriminant analyses (in PASW 18 [26]) using both a 50 % holdout-sample cross-validation and a jackknife leave-one-out approach.

For centroid size, shape and the linear measurement selected as an a example of a potentially simple trait for a rapid and accurate assessment of sex, the proportion of individuals whose sex was estimated with a posterior probability higher than 0.80 was also computed, as well as the proportion of individuals with typicality probabilities lower than 0.05.

Following Sanfilippo et al. [27], the predictive accuracy of shape and size across a series of cutoff points was compared using a receiver operator characteristic curve (ROC) and the corresponding area under the curve (AUC); both were computed in PASW 18 [26] from the discriminant analysis posterior probabilities. The 95 % confidence intervals for AUC were also estimated; females were coded 0 and males were 1.

Results

Overall differences between the left and the right side of the cranium were small and negligible compared to the variation among individuals controlling for sex (P < 0.0001 for both size and shape); the sides were thus accordingly averaged. Further justification of this approach is provided by the correlations of averaged and original distances, which were always higher than 0.9 (range 0.92–0.98).

The percentages of variance (R 2) explained by sex with 95 % confidence intervals (1,000 bootstraps) for centroid size, shape and 13 ILD measurements are shown in Table 2. Sex was always highly significant (P < 0.0001 for all size and shape measurements). It is apparent that sex explains a much higher proportion of variance in size (49 %) than shape (3.2 %). When sampling error is estimated, the variance explained by centroid size ranges between 35.8 and 64.2 %; for shape, the corresponding figures are between 2.7 and 4.8 % (Table 2). The R 2 values for the13 ILD measurements are also shown in Table 2; only those measurements with R 2 values ≥ centroid size R 2 are presented (see “Materials and methods”). The percentages of variance explained by the ILD measurements are between 49.3 and 54.3 %; the range of the corresponding low to high confidence interval figures is 36.1 to 69.1 % (Table 2).

Table 2 Percentages of variance explained (R 2) by sex with 95 % confidence intervals estimated using 1,000 bootstraps and cross-validated sex-predictive accuracy for multivariate shape, centroid size, and the selected ILD measurements

The cross-validated sex prediction accuracy of centroid size, shape and the 13 ILD measurements is presented in Table 2, respectively. With regard to evaluating the predictive power of the multivariate cranial shape data (summarized by the first 20 PCs), it is evident that it is powerful enough to produce a relatively high cross-validated accuracy (83.5 % using the leave-one-out method) despite the large amount of variance unrelated to sex differences (Table 2). The associated mean male and female shape differences are illustrated in Fig. 2; females (relative to males) have a flatter and narrower face, reduced anterior projection of the glabella, proportionally larger orbits, less inferior projection of the mastoid process and a more brachycephalic neurocranium.

Fig. 2
figure 2

Lateral and anterior images showing the female (light grey) and the male (black) mean shape differences at ×5 magnification. In Procrustes-based shape analysis, the size of the landmark configuration is conventionally standardized to unit. For this reason, the female cranium appears to be as large as that of the male but with a comparatively smaller face and larger vault

In consideration of the sex prediction effectiveness of centroid size and the ILD measurements, it appears that size is in absolute terms a more accurate predictor of sex, with jackknife classification accuracies ranging between 82.5 and 88 % (Table 2). In further considering the ILD measurements that most accurately predict sex, the distance between glabella and zygion (GZ, mean of ILD 2-41 and 2-42) provides an example of the accuracy achievable (87.5–88 %) using a potentially simple linear measurement to rapidly quantitatively estimate sex in Australian adult crania. The selection of GZ is based on the fact that glabella and zygion are two of the most common anatomical landmarks in the ILD measurements outlined in Table 2 and that it has an R 2 value higher than that of centroid size. Furthermore (and importantly) the distances between these two landmarks are relatively easy to measure in a standardized manner. With either shape, centroid size or GZ, very few specimens (less than 5 %) have a typicality probability smaller than 0.05 and at least 60 % (CS and GZ) and up to 80 % (shape) of individuals have posterior probabilities higher than 0.8.

ROC curves were computed for shape, centroid size and GZ (Fig. 3). The three curves are very similar and their AUC confidence intervals largely overlap (Table 3). These analyses suggest, therefore, that differences in accuracy may be negligible. However, it is interesting to note that in terms of absolute cross-validated accuracy, GZ is more accurate than centroid size, which is in turn more accurate than shape; in relation to the AUC statistics, shape is more accurate than GZ, which is more accurate than centroid size. It is evident, however, that the overall difference between the highest and lowest AUC values is small (0.948 to 0.921; Table 3).

Fig. 3
figure 3

A comparison of ROC curves for shape, centroid size and the GZ ILD

Table 3 AUC of ROC curves computed from the DA posterior probabilities for shape, centroid size and the GZ ILD

Further analyses of the GZ ILD distance were performed to fully quantify the statistical robustness of this simple linear measurement for predicting sex. These analyses indicate that: (1) sex is correctly predicted in 87.5–88 % of individuals (Table 2), (2) 65 % of individuals are not only correctly sexed but have a high posterior probability (≥ 80 %), (3) the thresholds for reaching the 0.8 posterior probability of correct classification in the discriminant analysis is GZ ≤ 92 mm for females and GZ ≥ 97 mm for males (Fig. 4), (4) the smallest GZ for males is 88 mm—therefore, all individuals with GZ ≤ 88 mm were females and (5) misclassified individuals are evenly distributed between males (12) and females (13), which indicates that there is no real bias towards one or the other sex in relation to the predictive accuracy of this measurement.

Fig. 4
figure 4

Glabella–zygion ILD measurement (shown on the lateral and anterior skull images) and associated frequency histograms with estimated normal curves for females and males. The approximate range of measurements where sex is estimated with a posterior probability higher than 0.8 (80 %) is emphasized with a light grey (females) and dark grey (males) background

Discussion

It is imperative in the forensic anthropological discipline that alternative (non-traditional) approaches to quantifying biological form are given due consideration for their potential forensic applications; it is equally essential that refinements to established approaches are explored. It is also important to consider that statistical quantification of any methodology and/or standard is necessary because such data are inherently intertwined with evaluating the admissibility of forensic expert testimony based on the methods and practice of science. To this end, this study is a proof of concept, whereby we show that using a sample of Australians, and despite missing information on its composition (specific individual ethnicities), we can effectively extract accurate shape and size sex predictors and have statistically quantified the accuracy of using cranial form to predict sex in a Western Australian population. A multifactorial approach was applied, whereby both traditional morphometric and geometric morphometric methods were applied to data acquired in CT-reconstructed crania.

Although in the present study we have focused on evaluating the sex prediction power of three-dimensional landmark configurations representing cranial size and shape, in addition to a single simple linear measurement (GZ), many other ILD size variables were shown to have comparable R 2 values (see Table 2); therefore, they can also be potentially useful to estimate sex. A further justification of this is demonstrated by the high degree of sex classification accuracy for those ILD measurements (Table 2). It is important to note, however, that as all of those variables estimate the same quantity (e.g. size), they will be similarly affected by sampling; measurement choice, therefore, is likely to be a matter of practicality as some of the measurements may be easier to accurately acquire using traditional instruments (e.g. spreading and sliding calipers).

Other measurements (among those that have a high correlation with sex) might also be considered; they potentially include some of the variables with R 2 smaller than the arbitrary threshold (see “Traditional morphometrics”) we applied. For instance, using an ILD between nasion and asterion (a measurement that has an R 2 just below our threshold but may be more likely preserved in fragmentary material) correctly classify sex in more than 82 % of individuals and has an AUC of 0.920, which is virtually identical to centroid size. Indeed that threshold was simply used as an expedient method of reducing the total number of predictors that we examined in this exploratory study from an original set of more than 1,000 variables. Using the same threshold in a different sample (or a different threshold in our study sample) might elucidate other interesting predictors that are simple to measure and thus enabling sex to be accurately estimated with a single linear measurement. The predictive accuracy in a linear discriminant analysis, however, will likely be similar because, as we already stated, these are all size measurements and therefore capture the same aspect of cranial variation. Resampling statistics suggest that these size estimators are reasonably stable in terms of their association to sexual dimorphism as they all produced comparably accurate predictions and similar confidence intervals around the observed R 2 values.

With regard to the alternative simple linear GZ (and the other ILD) measurement(s), it is important to note that the data presented here represent a proof of concept; they are clearly non-traditional linear measurements that are practical to acquire, are biologically meaningful, and can be used to accurately predict sex. The sex classification accuracy of our single ILD measurements (range 82.5–88 %; Table 2) falls within the range of (if not higher than) selected published research using established single traditional linear measurements (e.g. South African White—80.2 % [28], South African Black—78 % [5], Cretan—81.9 % [29], Japanese—84.6 % [30]).

Further exploration of the forensic utility of the alternative ILD measurements presented in the present study will ideally require quantification of their intra- and inter-observer error and evaluation of congruence between digital and caliper measurements in addition to estimates of generalizability to other samples within and outside Australia. Our previous research [3] using CT-reconstructed data and the same data acquisition methods, however, demonstrated that for 33 traditional linear measurements, the level of intra- and inter-observer error is well within accepted standards (e.g. rTEM < 5 %; low TEM and high R values [3133]). Furthermore, based on our analyses of purposely scanned skulls from the Centre for Forensic Science teaching collection, we have established that there are no significant differences in measurements acquired in CT scans compared to the same measurements taken in dry skulls using traditional instruments.

The present study has demonstrated in this population highly significant sexual dimorphism in cranial shape; the visualization of the associated mean male and female shape differences also elucidated a series of sexually dimorphic features, including the shape of the forehead contour, anterior projection of glabella and inferior projection of the mastoid (amongst others; Fig. 2). Such shape features, which are well established for distinguishing male and female crania ([e.g. 3436]), are related to basic biomechanical and functional differences between the sexes ([e.g. 3740]). It is, however, the relative expression of these morphological features, under the influence of different genetic and environmental influences, which varies between global populations [18, 41, 42]. This is in essence why contemporary population-specific standards are a necessary requirement in the formulation of a forensic biological profile.

In considering the cross-validated sex prediction accuracy of the three-dimensional configurations of the 46 anatomical landmarks analysed in the present study, it is evident that multivariate cranial shape and size both produced a relatively high accuracy rate (83.5–85.5 %; Table 2). Previous research using multivariate cranial shape data yielded a similar degree of expected classification accuracy (87 %) in a South African population [43]. Other published research in a variety of populations have indicated a wide range of prediction accuracy using cranial size and shape data (e.g. Portuguese—77.86 % [44], American Black and White—89.65 and 86.65 % [4], Bohemian—100 % [45]); a point of commonality between all those studies is that the inclusion of centroid size in the analysis considerably increased classification accuracy.

The ROC curves and AUC confidence intervals calculated for shape, centroid size and the GZ ILD measurement demonstrated that differences in overall accuracy are relatively negligible (Fig. 3; Table 3). It is evident, however, that consideration of the ROC and AUC data combined seems to imply that complex measurements like shape might, on the whole, lead to higher confidence in statements about sex discrimination compared to simple size measurements. Obviously, however, the use of three-dimensional classifiers is a specialization that may not always be available in many forensic laboratories. To that end, the use of straightforward linear measurements still yields accurate and statistically robust classifications of sex (Table 2); accordingly, a series of cranial linear standards are currently being devised based on the statistical analyses of an expanded Western Australian sample.

Conclusions

In the present study, significant size and shape cranial sexual dimorphism was elucidated in a Western Australian population and the accuracy of using cranial form to predict sex was statistically quantified. It is evident that the cranium is a useful element for estimating sex in this population, with classification accuracy using geometric morphometric data, specifically three-dimensional multivariate descriptors of size and shape, are well above 80 %. The use of a single traditional linear measurement (glabella–zygion) is also able to classify Western Australian individuals according to sex with a high degree of expected accuracy (87.5–88 %). The present study has demonstrated the forensic potential of alternative and established traditional approaches for sex estimation in the Western Australian milieu.