Introduction

It has been observed that researchers have to be courageous to investigate or write about sex differences (Furnham, 2017). Even the terminology is a sensitive issue: the terms “man” and “woman” are typically used in reference to gender, whereas the terms “male” and “female” are used in reference to sex. In this paper we shall examine sex differences and refer to male and female.

What is most surprising in this complex research area is comparing the radically different conclusions of researchers and reviewers on exactly the same topic. Early intelligence researchers put in considerable effort to ensure tests showed minimal sex differences (Mackintosch, 1986; Mackintosh, 2011), though personality researchers seemed less concerned with evidence of sex differences. However, over the course of the last 20 years there have been a great number of studies concerned with gender differences in personality world-wide (Del Giudice, 2009; Del Giudice et al., 2012; Schmitt, 2015; Schmitt et al., 2008; Weisberg et al., 2011). Some have concentrated on particular group differences using clinical populations, different age groups, and different cultures, or whether scores change much over time (Furnham & Cheng, 2019). Few have been interested in devising valid tests that minimise sex differences, but rather trying to establish the size, and more importantly the cause, of the differences they find currently exist.

There have also been meta-analysis in the area, some done many years ago. Thus, using now less known and used tests of an earlier era, Feingold (1994) concluded that males were more assertive, and had higher self-esteem than females, who scored higher than males in extraversion, anxiety, trust, and, especially, tender-mindedness. There were no sex differences in social anxiety, impulsiveness, activity, locus of control, and orderliness.

Some argue that even if they are small, actual, verifiable (not artefactual) sex differences, they should not be explored or explained because of the divisive personal and social effect that it can have on both sexes. Others believe there are important explicable reasons for sex differences which warrant scientific description and explanation (Buss, 1995; Eagly, 1995; Furnham, 2017).

Another curiosity in this highly disputed area is the apparent contradiction between popular and scientific writers (Gray, 1992; Pease & Pease, 2002). There are many popular books that portray a simple evolutionary perspective that describes, and even rejoices in, sex differences in almost all human behaviour, but particularly communication, relationships and work. These are contrasted with the measured and cautious academic books and papers that note how complex some of these seemingly simply questions are, and how all the answers require numerous qualifiers (Halpern et al., 2007).

Inevitably there are two strongly competing, opposite forces: those who stress the biology of difference and those who stress the sociology of similarity. The former often suggest that these differences are immutable, though it is accepted that all innate traits can be changed with experience. Whilst nearly everyone acknowledges that we are biopsychosocial beings there are those who see us more as BIOpsychosocial as opposed to biopsychoSOCIAL. This all concerns explaining how and why observed differences occur (Furnham, 2017).

At the heart of the issue is the quality and quantity of sex differences, their cause and consequence. Though the focus has always been on differences, the trend has been to talk of similarities which is what a great deal of the literature suggests. Indeed, it has been argued that the word difference is too easily confused with deficiency. The same is true of the words sex and gender: the former applying to biological distinctions and the latter sociological categories (Furnham, 2017).

As a result, there is a sophisticated and subtle nature-nurture debate that persists across many interrelated disciplines whose practitioners’ study human behaviour (Furnham & Kanazawa, 2020). Many academics who view gender as the product of socialization and cultural factors are split into opposing camps based on whether there are large or insignificant differences between women and men/ males and females (discussed by Buss & Schmitt, 2011).

There is an extensive and growing literature on how evolution creates systematic variation in personality (Nettle, 2006; Penke et al., 2007). Scholars attempt to explain how culture, biology, and evolution interact to collectively shape personality (Fischer, 2018). Evolutionary psychology posits various sources of sex differences, such as sexual selection (intersexual selection and intrasexual competition) and the theory of obligatory parental investment (Archer, 1996, 2009; Buss, 1995; Geary, 2010). Moreover evolutionary psychologists attempt to describe and explain how evolutionary processes shape sex differences in personality and the specific reasons as to why we might expect, or not expect, to see sex differences specific personality traits (Del Giudice et al., 2012; Lippa, 2010; Schmitt et al., 2008). There is also a salient literature on the proposed cultural origins of gender, more particularly the purported sociocultural factors that shape gender symmetry (Hyde, 2007) or differences (Eagly & Wood, 1999).

Behaviour geneticists decompose total variance in personality and other individual traits into three components: heritability (genes), shared environment (everything that happens within the family that makes siblings from one family similar to each other but different from those from other families), and unshared environment (everything that happens within and outside the family that makes siblings from one family different from each other) (Plomin et al., 2012). Behaviour geneticists contend that the rough rule of thumb when it comes to the determinants of adult personality and other traits is 50–0–50, that is, roughly 50% of the variance in personality, behaviour, and other traits is heritable (influenced by genes), roughly 0% by the shared environment (what happens within the family and is experienced similarly by all siblings), and roughly 50% by the non-shared environment (what happens inside and outside the family not shared by siblings). Tooby and Cosmides’ (2005) talk of the standard social science model (SSSM) of the brain. Adherents of the SSSM argue that the brain is a general-purpose device that is almost entirely shaped by culture and that individual differences are explained by social environment and learning (Vrabel & Zeigler-Hill, 2017). Tooby and Cosmides (2005) purport that their integrated model is superior to the SSSM because it integrates both culture and evolutionary biology. However, among evolutionarily-minded scholars, some believe that this distinction represents a false dichotomy (Richardson, 2007; Wallace, 2010). Suffice it to say this remains a highly contentious academic area of research.

Reviewers of this topic can be described as Maximizers vs Minimizers. Maximizers want to find and explain the (many large) differences between the sexes while the minimisers want to emphasize how few real and meaningful differences there are (Furnham, 2017). Part of this debate can be seen in the interpretation of Cohen’s d, which is an indicator of difference. Whilst there are conventions about how to label the difference as: none, trivial, small, medium, large and very large, this is also contested. Most researchers quote Cohen who suggested that d = 0.2 be considered a ‘small’ effect size, 0.5 represents a ‘medium’ effect size and 0.8 a ‘large’ effect size. This means that if two groups’ means do not differ by 0.2 standard deviations or more, the difference is trivial, even if it is statistically significant, although these cut off points have been disputed. However there is an interesting literature which suggests d differs according to a number of factors (sub-discipline, sample size) and that in some areas of research a d of .25 to .35 could be considered medium (Hemphill, 2003; Greenwald et al., 2015; Schäfer & Schwarz, 2019).

To what extent do these results matter? In the applied context, it might be useful to contrast the approach taken with IQ tests, where they were originally designed to eliminate, as much as possible, sex differences. This does not seem to have been done by personality test creators. Most researchers do not worry about sex differences on IQ tests, because in essence there are none, but perhaps we should worry about personality tests in selection contexts. This could clearly have an impact in selection. It is particularly interesting when different personality tests essentially measure the same trait but yield large d differences.

This Study

This study is on sex differences in personality. In this study we report data on six questionnaires, four of which are well known and have overlapping dimensions (like Conscientiousness). We were fortunate to have large data sets on each of these, though we are aware that there are other important and well-used personality tests some of which measure other dimensions (e.g. HEXACO). We also aware that one test we report on, namely the MBTI, has been heavily criticized academically, though still very frequently used in applied and consulting settings and thus we examine it along with the others (Barbuto Jr, 1997; Furnham, 2018). Also, we also examine the MVPI (see below) which strictly speaking measures motives and values rather than traits, but yields some interesting and important results.

We believe this study has various unique features. First, while it replicates many other studies, it does so in overpowered samples often comprising many thousands of adults. Second, it examines the differences in six different well-known tests, whereas previous studies nearly all examined only one test. This allows the possibility of looking at differences between tests that measure the same construct (e.g. NEO Neuroticism, HPI Adjustment; HPTI Adjustment). Third, for two of the well know measures (NEO; HPI) we were able to examine differences at both domain and facet level. Fourth, for two questionnaires there were two large samples so that it was possible to examine replications. In all studies the respondents were first language English-speaking adults.

Usually test manuals provide information on group differences such as ethnicity and gender. Sometimes this data is very out-of-date and restricted to one continent. Surprisingly, the N is also often modest. Moreover, it seems to be the case that test publishers are eager to show as few group differences as possible as this may influence potential buyers of the test (Furnham, 2018). For each of the tests used in this study the manuals were consulted to examine the data on sex differences. Each provide good evidence of the internal and test-retest reliability of the test scores. This led to the development of the hypotheses, though the major concern was in the size of the differences.

Based on an observation of various previous studies in personality and evolutionary psychology and the test manuals, the following hypotheses were derived (Del Giudice, 2009; Del Giudice et al., 2012; Furnham, 2008; Schmitt et al., 2017):

  1. 1.

    MBTI: Males would score significantly higher than females on Thinking vs Feeling (H1) and Judging vs Perceiving (H2) (Furnham, 1996, 2018).

  2. 2.

    NEO-PI-R: Males would score significantly lower than females on Neuroticism (H3) and Agreeableness (H4) (Costa Jr. et al., 2001).

  3. 3.

    HPI: Males would score significantly higher than females on Adjustment (H5) and Ambition (H6) but lower on Interpersonal Sensitivity (H7) (Hogan et al., 2007).

  4. 4.

    MVPI: Males would score significantly higher than females on Commerce (H8), Power (H9) and Science, but lower than females on Aesthetics (H10) and Altruism (H11). (Hogan et al., 2007).

  5. 5.

    HDS: Males would score significantly higher than females on Bold (H12) and Mischievous (H13) but lower on Cautious (H14), Dutiful (H15) and Excitable (H16). (Hogan et al., 2007).

  6. 6.

    HPTI: Males would score higher on Adjustment (H17), Risk Approach (H18) and Competitiveness (H19) (MacRae & Furnham, 2020).

Method

Participants

There were seven different samples, most over 1000 participants, used in this study. The focus was on sex differences and these are shown in each table. People ranged in age from 24 to 69 years with the majority being in their late thirties. For each questionnaire there was no significant sex difference in age between males and females. In most of the samples (over 50%) were graduates and once again it was established that there was no difference in the education level between males and females. Most were working adults in supervisory and management positions from a very wide range of organisations. We did not have data on the participants socio-economic status or their work history. Because the participants were nearly all at middle manager levels in their organisations there was a bias towards more males often being twice as many as females (see study limitations). Participants self-identified as either male or female: there was very little missing data for this question.

Instruments

  1. 1.

    The MBTI Myers-Briggs Type Indicator-Form G (MBTI: Myers & McCaulley, 1985). This is a Jungian-based inventory that is composed of 94 forced-choice items that yield scores on each of the eight factors as well as the famous four dimensions: Introversion-Extraversion, Sensation-Intuition, Thinking-Feeling and Judging-Perceiving. Respondents are classified into one of 16 personality types based on the largest score obtained for each bipolar scale. The test provides linear scores on each dimension which are usually discussed in terms of types based on cut-off scores. The Myers-Briggs Type Indicator has been the focus of extensive research and substantial evidence has accumulated suggesting the inventory has satisfactory concurrent and predictive validity and reliability.

  2. 2.

    The NEO Personality Inventory Revised (NEO-PI-R) (Costa & McCrae, 1992) This questionnaire is a 240-item measure designed to assess the Five Factor Model (FFM) domains (Neuroticism, Extraversion, Openness, Agreeableness and Conscientiousness) as well as six primary traits/facets for every domain. The test takes approximately 35 min to complete. Research has provided evidence for the validity and the reliability of this instrument (McCrae et al., 2011). In the current study only the five domains and not the traits were taken into consideration.

  3. 3.

    Hogan Personality Inventory (HPI; Hogan et al., 2007). The HPI consists of 206 items that are used to produce seven personality traits and six criterion scores. Participants respond to each question on a five-point Likert scale. The scales’ internal consistency and test-retest reliabilities are well established, with both the manual and independent research citing internal consistency alphas of over .71 and test-retest reliability between .74 and .86 (Hogan et al., 2007).

  4. 4.

    Hogan Development Survey (HDS; Hogan et al., 2007). Similar to the HPI, the HDS is a contextualised measure as it seeks to identify dysfunctional behaviours that impair work performance. The HDS taxonomy is closely related to classical personality disorders (PD) described by the DSM-IV-TR (American Psychiatric Association, 2015). The HDS adopts a dimensional model, opposed to categorical. The HDS consists of 154 items that are completed via participants stating either their agreement or disagreement. The manual reporting internal reliabilities ranging between .50 and .79 (average alpha = .67) and test-retest reliabilities between .58 and .87 (average alpha = .75).

  5. 5.

    The Motives, Values, Preferences Inventory (MVPI, Hogan et al., 2007) measures 10 Motives/Preferences. Each scale is composed of five themes: a) Lifestyles, which concern the manner in which a person would like to live; b) Beliefs, which involve ‘shoulds’, ideals and ultimate life goals; c) Occupational Preferences, which include the work an individual would like to do, what constitutes a good job, and preferred work materials; d) Aversions, which reflect attitudes and behaviours that are either disliked or distressing; and e) preferred Associates, which include the kind of persons desired as co-workers and friends. MVPI scores are quite stable over time, with test-retest reliabilities ranging between .64 and .88 (M = .79). More than 100 validation studies have been conducted on the MVPI with results indicating that the inventory is effective in predicting job performance and outcome variables such as turnover.

  6. 6.

    High Potential Trait Indicator (HPTI: MacRae & Furnham, 2014). The HPTI is a measure of personality traits, specifically within a workplace context. It is comprised of six factors, Adjustment, Curiosity, Ambiguity Acceptance, Conscientiousness, Courage, and Competitiveness. The inventory is 78 items in length. Each trait is converted into a percentile rank based off the normal distribution of the sample. Various paper have been published using this measure (Furnham & Treglown, 2018)

Procedure

Participants were tested by three well established British-based psychological consultancies over a period from 10 to 16 years, where participants attended assessment centres and their data was logged. The same participants tended to complete the MBTI and NEO-PI-R where the data were obtained from one consultancy, the three Hogan Instruments (HPI, HDS, MVPI) where data were obtained from the other consultancies, and the HPTI the third consultancy. This was done in either in assessment centres or online as a part of recruitment or development process, and all participants were given full feedback on their test performance. They came from a wide range of organisations in the private and public sector. Participants agreed to take part in research and anonymised data was used in the analysis with their permission. Data sets were given to the authors for analysis with all tests scored which means we could not calculate alphas, though we have no reason to believe there were any problems with them (Hogan et al., 2007). Ethics permission was requested and received (CEHP: 2017; 514).

Results

Data was first screened for random responding, missing data, and other errors. In each analysis we started with MANOVAs and each was significant, followed by one way ANOVAs. Bonferroni corrections (p < .01) were made which meant a number of analyses (12 in all) ceased to be significant. In the interpretation we only focused on results where p < .001 though our primary focus was on the Cohen’s d score/ We assumed that under d < .20 was a small difference and d < .50 a medium sized difference.

  1. 1.

    MBTI

Table 1 shows that males scored higher than females on all dimensions, particularly Thinking-Feeling where the d was in the .20 > d < .50 range. Males scored higher in Sensing and higher on Judging which is consistent with the literature. This confirms H1 and H2.

  1. 2.

    NEO-PI-R

Table 1 Data from the Myer-Briggs Type Indicator: MBTI

Table 2 shows that all big five factors showed significant sex differences. Females scored higher on four of the five traits, particularly Openness and Neuroticism, but lower on Conscientiousness. All but three of the facets revealed significant differences. With few exceptions the facets within each domain showed consistent differences. Exceptions were Assertiveness and Excitement Seeking in Extraversion where males scored higher than females. Of the 30 d scores, 17 were < .20, 16 were .20 > d < .50 and only one >.50 (Feelings in the Openness factor). This confirms H3 and H4.

  1. 3.

    HPI

Table 2 Data from the NEO-PI

Table 3 shows all seven domain factors were significant. Males scored significantly higher on Adjustment, Ambition, Sociability and Inquisitive, but lower on Interpersonal Sensitivity, Prudence, and Learning Approach. Once again, the facets within each domain score tended to be consistent both in direction and significance. Of the 50 analyses 32 showed d scores <.20, 17 were 20 > d < .50 and one >.50 (Curiosity in the Inquisitive factor). This confirms H5, H6 and H7.

  1. 4.

    MVPI

Table 3 Data from the Hogan Personality Inventory (HPI)

Table 4 shows the results from the two different samples. The results were reasonable consistent. In both samples males scored significantly higher than females on Recognition, Power, Commerce and Science but lower on Hedonism, Altruism, Affiliation, and Aesthetics. Combining the two in all 8 of the d’s were < .20, 10 were .20 > d < .50 and two d > 50 (Commerce in Sample 1 and Science in Sample 2). This confirms H8 to H11.

  1. 5.

    HDS

Table 4 Data from the Motives and Values Preference Inventory (MVPI)

Table 5 also shows data from two different samples, which were one again were reasonably consistent. In both samples females were significantly higher on Excitable (Borderline), Cautious and Dutiful, while males scored higher on Sceptical, Reserved, Bold, and Mischievous. Combing the two samples on the 22 differences 16 were d < .20 and 6 were .20 > d < .50. This confirms H12 to H16

  1. 6.

    HPTI

Table 5 Data from the Hogan Dark Side Inventory (HDS)

Table 6 shows the results of gender difference tests for the six HPTI traits. Significant differences were noted for all six traits, with males scoring higher on Conscientiousness, Adjustment, Risk Approach, Ambiguity Acceptance, and Competitiveness, whereas female participants scored higher on Curiosity. Effect sizes revealed that only Risk Approach (d = .32) and Ambiguity Acceptance (d = .33) had small effect sizes, whilst the rest can be regarded as negligible (d < .20).

Table 6 ANOVAs and effect sizes of gender differences in the six HPTI traits

Discussion

The results of this study can be interpreted in various different ways. A sex-difference maximiser would note that a cursory glance at the six tables shows that the vast majority of ANOVAs (over 80%) shows significant sex differences, many at the p < .001 illustrating the fundamental point that there are many and important sex differences in personality, using a variety of measures and assessed at both the domain and facet level. On the other hand, the minimiser might take comfort in the effect size data (Cohen’s d) and note that there are very few large or even medium effect sizes, though this depends on how size is categorised.

Nearly all the hypotheses based on the previous literature were confirmed. Overall, the MBTI showed relatively small differences except in the Thinking-Feeling variable which has been the topic of much debate. It has been suggested (and refuted) that this factor is essentially measuring Neuroticism and hence the higher score for females which is consistent with the previous literature (Furnham, 2018),

The results from the NEO-PI-R confirmed some previous studies which showed males higher only on Conscientiousness but lower on Extraversion, Agreeableness, Openness and Neuroticism. The biggest domain differences were for three traits where females scored higher than males. The most unusual finding was the big difference on Openness (which was also shown in the HPTI trait of curiosity) where there is a limited literature and few speculations on sex differences. The smallest and fewest differences were on Consciousness and its facets. The facet analysis gave some indication of variability within domain but few where the differences went in the opposite direction. Two exceptions were the facets of assertiveness and excitement seeking in Extraversion where, as in many other studies males scored higher than females. Interestingly the highest d was for the Openness facet Feelings (d = .53) which reflects the finding in the MBTI. (Furnham, 1996).

The results of the HPI confirm previous studies with the biggest domain d’s being for Adjustment, Ambition and Curiosity with males scoring higher and Interpersonal Sensitivity with females scoring higher. Again, most of the facets scores went in the same direction though they did occasionally differ greatly in size: compare empathy and calmness in Adjustment.

The results of the replicated MVPI study showed two things: where there were significant differences the results went in the same direction, and that the biggest differences lay in male’s interest in power, business and science, values associated with entrepreneurship and work success (Furnham, 2018). Further, as in previous studies females scored higher in Altruism and Aesthetics.

The findings from the HDS show similar outcome in the two studies. When grouping the eleven traits into the recommended tri-partite system the results are clear: females tend to have scores on those traits moving away from (Cautious but not Reserved) and toward others (Dutiful not Diligent) while males score higher on traits in the moving against others category (especially Mischievous).

The final scale showed two of the six HPTI scales with relatively large differences: males score higher in Risk Approach and Ambiguity Acceptance which has been shown many times before. Although there was a sex difference on Competitiveness, the size of this was modest.

One interesting comparison could be between the scores of different tests which essentially (claimed to) measure the same construct. Thus, the sex difference d for Neuroticism in the NEO-PI-R was .35, Adjustment in the HPI was .30 and Adjustment in the HPTI was .14. Similarly, Conscientiousness in the NEO-PI-R was .12 and in the HPTI was .11, and Prudence .06. Equally the sex difference d in Agreeableness in the NEO-PI-R was .32 and Interpersonal Sensitivity in the HPI was .30. Therefore, the results seem to suggest similar sex differences on scales of different length and question measuring the same phenomenon. There were however exceptions: females were more Extraverted and Open on the NEO=PI-R, but less Sociable and Curious on the HPI.

One interesting issue concerns revisiting each question and facet to determine whether there was any inherent sex bias in the question wording and whether if these were removed the overall d would decline. This is not an issue of attempted to deny or reduce differences that exist but rather trying to reduce artefacts arising from question selection. Certainly, with changes in society, particularly with reference to sex and gender differences, questionnaire wording could cause both offense and differences in interpretation unless they are constantly updated.

Another issue to arise from this study is the great variability in the facet score items and labels that are essentially measuring the same dimension. Compare for instance the six Openness facets of the NEO-PI-R with six facets of the HPI. Given these labels it is expected that these two measures are relatively weakly correlated and measuring rather different factors.

Finally accepting that there are some real, biologically based, stable sex differences, as opposed to socialised gender differences, in personality traits the question arises as to why they occur. Results such as these cannot inform the nature-nurture debate, with (most) evolutionary psychologists offering a cohesive (and for some convincing) argument as to why there are replicable, consistent and cross-cultural findings. Minimizers who reject the “biology as destiny” approach attempt to explain all these differences in terms of primary and secondary socialisation (Buss, 1995). However, in a big review study Schmitt et al. (2017) concluded: “Social role theory appears inadequate for explaining some of the observed cultural variations in men’s and women’s personalities. Evolutionary theories regarding ecologically-evoked gender differences are described that may prove more useful in explaining global variation in human personality” (p45).

This study, like all, others has limitations. All participants were British adults taking part in a compulsory assessment centre. Though they might have been tempted by impression management there is no reason to suspect that there were sex differences in this behaviour. The reason why males outnumber females tended to reflect the profile of middle managers in those organisations which reflected all sectors, public and private. The sample was thus biased in terms of age, education and class and the question remains whether a more representative sample of people from a wider age range and social class background would have shown more or fewer sex differences. Furthermore, nearly all participants were from Europe and the effects of culture were thus not explored. It could be that sex differences are smaller in more Western, individualistic, democratic, egalitarian, and higher gender-parity cultural contexts than those from more traditional, developing countries.

It has been argued that personality changes over time and it may be that sex differences and similarities in personality are different for young, middle-aged and older participants (Roberts et al., 2001). Finally there is always the possibility that there are sex differences is self-report behaviours and biases, such that females exhibit more humility and males more hubris and that therefore some observed differences are more due to other factors and artefacts than actual personality differences.