1 Introduction

Differences in cognitive skills are strongly related to differences in wages between females and males (Murnane et al. 1995; Murnane et al. 2000; Grogger and Eide 1995; Weinberger 1999; Altonji and Blank 1999). Specifically, males have higher cognitive test scores than females on average (Strand 2003; Bell et al. 2006; Upadhayay and Gurugain 2014) and these scores contribute to higher wages for the former than for the latter (Hedges and Nowell 1995; Paglin and Rufolo 1990). Furthermore, differences in cognitive skills could contribute to the gender wage gap not only because of differences in means but also because of differences in returns: scoring an additional point on a cognitive test results in a larger gain in terms of wages for a male than for a female.

Recently, studies have focused on the relation between socioemotional skills and productivity. The main finding is that there is a positive connection between wages and certain socioemotional skills. Early work by Marxist economists showed that employers in low-skill labor markets value docility, dependability, and persistence more than cognitive ability or independent thought (see Bowles and Gintis (1976) and Edwards (1976)). Recent studies (see Heckman et al. (2006), Cunha and Heckman (2007), Hanushek and Woessmann 2008) also support this fact with evidence of a positive relation between results in socioemotional test scores and labor market outcomes such as wages or occupational choice.

Regarding how people form socioemotional skills, some studies propose that test scores are bad proxies of abilities due to measurement errors and endogeneity with schooling. The important features that drive wages are latent abilities. However, to the best of our knowledge, few studies exist that address the role of latent socioemotional and cognitive abilities in the gender wage gap. Moreover, they do not address this question for the developing world and for Latin American countries in particular.

Based on the latent ability model proposed by Heckman et al. (2006), we estimate the contribution of socioemotional skills to the gender wage gap. In contrast to Heckman et al. (2006), our identification strategy is based on panel data collected by the Young Lives Study (YL) for Peru. We argue that dependence through time between test scores is due to latent abilities. After estimating latent abilities, we use them in an Oaxaca–Blinder decomposition to explore the role that abilities and their returns play in explaining the gender wage gap. Moreover, we estimate a joint model of schooling, employment, occupational choice, and earnings to disentangle the effect of inter-gender differences on the ability to make each of these choices. Since the YL database lacks information on wages, we estimate latent abilities as linear combinations of characteristics common to both YL and Peruvian Skills and Labor Market Survey (ENHAB). The latter is a recent survey in Peru which gathers data on cognitive and socioemotional test scores on an individual’s characteristics, educational trajectory, and wages.

The preliminary results show that there is a significant gender wage gap in Peru. In fact, in a model with measured abilities, we find significant inter-gender differences in the endowment of cognitive skills but no relevant differences in terms of socioemotional abilities (in endowment or returns). Estimating the joint model shows that differences in socioemotional abilities between men and women are important but only to the choices prior to determining wages. Cognitive skills are relevant in determining years of schooling and occupational choice and measuring socioemotional ability for wages and employment. Applying our proposed estimation procedure shows that the actual latent ability turns out to be highly and statistically significant for mean wages as well as accounting for inter-gender differences. In particular differences in the endowment of socioemotional abilities contribute (negatively) to the gender wage gap. However, we do not find any significant inter-gender differences in the returns to cognitive abilities. Moreover, the estimation of the joint model sheds light on the fact that the observed gender wage gap is mainly attributable to differences in occupational choices. Cognitive and socioemotional abilities have different values for men and women in terms of schooling, employment, and wages; but basically men earn higher wages because their equilibrium assignation is to occupations with higher rewards for cognitive skills that they are most endowed with.

The paper is organized as follows: the following section presents a review of the literature. Section 3 presents our empirical baseline model of wages in terms of abilities as well as the Oaxaca–Blinder decomposition. In Section 4, we describe the data and sample. Section 5 presents our econometric implementation of estimating latent abilities. Section 6 presents the results. And, Section 7 concludes.

2 Literature review

For decades, researchers have focused on studying the relation between abilities and labor market outcomes. These studies are mostly related to cognitive test scores. Murnane et al. (1995) assess the role of the mathematics skills of graduating high school seniors in their wages at age 24 and find a positive and increasing effect of cognitive skills on wages (especially in years closer to graduation). In a more recent study, Cunha et al. (2006) state that cognitive ability affects the likelihood of acquiring higher levels of education and advanced training as well as the economic returns to these activities. Other studies have focused on explaining the black-white wage gap, and they confirm the importance of the cognitive factor: large observed cognitive gaps between black and white workers in their late twenties are an important determinant of the wage gap (Neal and Johnson 1996) and unobserved cognitive ability is the most important variable in explaining the racial differences in wages (Urzúa 2008).

Somewhat more recently, studies have directed their attention towards socioemotional abilities and their relation to labor market outcomes. Early work by Bowles and Gintis (1976), Edwards (1976), and Klein et al. (1991) show that socioemotional skills such as dependence and persistence are highly valued by employers. Other studies such as that of Heckman et al. (2006) also support this fact with evidence of a positive relation between results in socioemotional test scores and outcomes in the labor market. More recently, a type of non-cognitive abilities that has received attention are the social ones. Specifically, Deming (2017) finds that a pre-market measure of self-reported sociability, the number of clubs and sports in which the respondent participated in high school, is a significantly more important predictor of labor market success for this last cohort than for the first one: the association between social skills and the probability of full-time work increased more than fourfold and the return to social skills among full-time workers grew significantly. Additionally, as in Weinberger (2014), Deming (2017) finds that cognitive and social skills are complements but does not find complementarity between cognitive skills and the widely used measures of non-cognitive skills.

The research that explores gender inequality at first focused on studying the role of cognitive skills in explaining gender differences on market labor outcomes. Paglin and Rufolo (1990) report that most of the gender gap in average starting salaries for college graduates is between, rather than within, detailed college majors and that differences in starting salaries across majors have a very positive relation with average math scores. Moreover, the gender difference in math scores accounts for around 20% of the gender gap in the starting salaries of college graduates. Likewise, Landes (1977), Ragan and Smith (1981), and Mincer and Ofek (1982) estimate that the accumulation of human capital, the occupational selection, and turnover account for a large portion of the gender wage gap.

However, because of a sizeable component in the gender wage gap continues to go unexplained, researchers started reaching out beyond the confines of traditional economic models for explanations. One of these is the hypotheses that psychological attributes or non-cognitive skills can explain gender differences in the labor market. The evidence indicates that women are more risk averse than men, that women may systematically underperform relative to men in competitive environments, and that women have a higher level of altruism and stronger preferences for redistribution (Bertrand 2011). Despite this, Blau and Kahn (2017), with respect to the possible psychological gender differences, make some clarifications: first, even if men and women do differ on average, it is not possible at this point to know the role of nature versus nurture; and second, gender differences in non-cognitive skills do not necessarily all favor men.

Some authors detect considerable effects from socioemotional abilities on the gender wage gap. For instance, Grove et al. (2011) find that personality traits and preferences regarding family, career, and jobs account for a quarter of the “explained” gender wage gap. Similarly, Fortin (2008) finds that socioemotional factors account for a small but nontrivial part—about 2 log points— of the gender wage gap of workers in their early thirties. This magnitude is larger than that of educational attainment and cognitive skills (math scores), which account for about 1.2 log points, and almost as important as that of labor market experience and job tenure, which account for 2.4 log points.

Nonetheless, other research finds modest effects of socioemotional abilities. Manning and Swaffield (2008), in respect to the gender gap in early career wage growth in the UK, conclude that the whole set of psychological factors they use—comprised of risk aversion, competitiveness, self‐esteem, other‐regarding behavior, and career orientation—can explain at most 4.5 log points of the about 25 log point gap in earnings that has built up between men and women 10 years after entering the labor market, and human capital factors account for about 11 of these 25 log points. Hence, as Bertrand (2011) highlights, the research on the effect of gender psychological differences is clearly just in its infancy and far from conclusive and has many contradictory findings.

On another note, quite remarkable is the fact that most of the research regarding the role of cognitive and socioemotional test scores in the labor market outcomes has only focused on developed countries. To the best of our knowledge only a few address this issue in developing countries, in particular in Latin American countries.

Díaz et al. (2012) use the ENHAB on a sample of the working age population and apply an instrumental variable approach to address issues regarding the endogeneity from schooling. They find that schooling and both cognitive and socioemotional skills are valued in the Peruvian labor market: one standard deviation increase in the years of schooling generates an increase of 15% in earnings, while a change in cognitive skills and socioemotional skills of a similar magnitude generate a 9 and 5–8% increase in earnings, respectively. However, their strategy considers that skills only affect wages directly and through schooling. Unlike us, they do not consider that skills also affect the decision to work and the kind of occupation chosen. Moreover, Urzúa et al. (2009) go a step further by analyzing gender discrimination in the labor market of Chile with a rich data set. They follow the studies that estimate labor market models with multiple sources of unobserved heterogeneity through cognitive and socioemotional abilities. Nonetheless, due to data limitations, they consider only one underlying source of unobserved heterogeneity as a combination of cognitive and socioemotional abilities. Their results show the existence of gender gaps in the labor market variables such as experience, employment, hours worked, and hourly wages that cannot be explained by observable or unobservable characteristics, or the underlying selection mechanisms that generate endogeneity. As far as we know, this is the only study that addresses the role of cognitive and socioemotional (latent) abilities (accounting for the endogeneity from schooling) in the gender wage gap in a developing country. So, our attempt to uncover the importance of latent abilities to the gender wage gap will be one of the first on developing countries.

Other ways to identify the effect of abilities on labor outcomes are by using fixed effects estimates or relying on instrumental variables. A benefit of using twin fixed effect estimates is that they control for genetic factors and background perfectly. Thus, there is no need to rely on test scores to control by the ability. However, we do not have a dataset that lets us exploit this type of fixed effects. We do have information on test scores.

The problem with test scores is that they do not reflect the real (latent) abilities because they have a measurement error (Heckman et al. 2006). Therefore, using them in wage and schooling regressions is troublesome. Conditioning on schooling, both cognitive and socioemotional tests predict wages. However, schooling is a choice variable and thus its endogeneity must be addressed. Omitting schooling from the wage equation increases the correlation of both abilities with wages. Estimates comprise both the direct (on productivity) and indirect (on schooling) effects of abilities on wages. Nonetheless, there is an important difference between cognitive (and socioemotional) tests and achievement tests. Although IQ is well set by age 8, the research has demonstrated that achievement tests are quite malleable and increasingly so with schooling. This malleability creates a reverse causality problem.

To overcome this, Hansen et al. (2004) develop two methods for estimating the effect of schooling on achievement test scores that control for the endogeneity of schooling by postulating that both schooling and test scores are generated by a common unobserved latent ability. Their analysis shows that schooling has small equalizing effects on measured test scores, especially for those with low ability and low levels of schooling. In the same line, Behrman et al. (2008) find that the rates of return are much higher for investing in primary school than for investing in middle schools and at the primary school level, the returns are higher for expanding low-quality schools than for increasing the quality of existing schools.

More recently, Behrman et al. (2011) estimate the effect of cognitive skill and physical health on wages. Their results, when accounting for this endogeneity, show that only the cognitive skill has an impact on wages; health does not. This result opposes earlier evidence that showed positive associations between health human capital and wages. In this respect, we are aware that wages and other labor outcomes are not only driven by abilities, but the physical characteristics and health of an individual also affect productivity. Unfortunately, even though we acknowledge the importance of health to productivity, we cannot contribute to this discussion due to data limitations.

To summarize, the main contribution of our analysis is exploring the role that latent cognitive and socioemotional abilities and their returns in terms of wages play in explaining the gender wage gap in a developing country. We achieve this by using a data combination method and by modeling the endogeneity present in the job decision process.

3 Model

The model is based on Heckman et al. (2006), Cunha et al. (2010), and Cunha and Heckman (2008). Latent cognitive and socioemotional abilities are two underlying factors. Conditioning on observables, these factors explain all the dependence across choices and outcomes. Individuals make decisions regarding schooling, working, and occupation. If the individual works, he or she earns a wage.

3.1 The model for wages

As in Heckman et al. (2006), we let f C and f S denote the latent cognitive and socioemotional abilities, respectively, and assume they are independent. The logarithm of wages are given by:

$$LnW = \beta _YX_Y + \alpha _Y^Cf^C + \alpha _Y^Sf^S + e_Y$$
(1)

where XY is a vector of observed controls, βY is the vector of returns, \(\alpha _Y^C\) and \(\alpha _Y^S\) are the latent cognitive and socioemotional abilities, respectively, and eY represents the error term. In order for Eq. (1) to be identified, we assume that the error term is independent of all other factors. However, unobserved factors that affect wages and productivity, such as health, could be correlated with cognitive and non-cognitive skills. In this context, we are cautious in interpreting \(\alpha _Y^C\) and \(\alpha _Y^S\) as causal effects. Instead of estimating the total effect, our analysis focuses on estimating the contribution of skills in the variation among wages.

The identification strategy is similar to that in Heckman et al. (2006). We restrict the latent cognitive ability to only affecting cognitive measures and the latent socioemotional ability to only affecting socioemotional measures. The model of the cognitive measure is:

$$C = \beta _CX_C + \alpha _Cf^C + e_C$$
(2)

Likewise, the model of the socioemotional measure is:

$$S = \beta _SX_S + \alpha _Sf^S + e_S$$
(3)

Our assumptions indicate that conditional on the X variables, the dependence across time of measurements comes from f S and f C.

3.2 The model for schooling

Each individual chooses the level of schooling that maximizes his or her lifetime expected benefit. Following a linear-in-the-parameters specification, and letting ISc represent the net benefit associated with schooling level Sc, we have:

$$I_{S_C} = \beta _{S_C}X_{S_c} + \alpha _{S_C}^Cf^C + \alpha _{S_C}^Sf^S + e_{S_c}$$
(4)

where Sc is the schooling level chosen by the individual among \(\overline {S_c} \) possibilities; \(X_{S^C}\) is a vector of observed variables that affect schooling; \(\beta _{S^C}\) is its associated vector of parameters; \(\alpha^{C}_{S_C}\) and \(\alpha^{S}_{S_C}\) are the factor loadings associated with cognitive and socioemotional latent abilities, respectively; and \(e_{s_c}\) represents an idiosyncratic component assumed to be independent of fS, fC, and \(X_{S_C}\). The error terms for each schooling level are mutually independent.

The observed schooling level corresponds to:

$$D_{S_c} = argmax_{S_c \in 1, \ldots ,\overline {S_c} }\left[ {I_{S_c}} \right]$$
(5)

We consider two educational levels: (i) complete tertiary education, and (ii) up to complete secondary education. We consider that schooling’s relevant decision is whether to complete tertiary education or not, so that this decision is consistent with the occupational choice of a high-skilled or low-skilled job (Section 3.4). We use an indicator variable \(D_{S_C}\) = 1(\(I_{S_C}\) > 0) to indicate the choice of attaining complete higher education.

3.3 The model for employment

Let IE denote the net benefit associated with working and assuming a linear-in-the-parameters specification:

$$I_E = \beta _EX_E + \alpha _E^Cf^C + \alpha _E^Sf^S + e_E$$
(6)

where βE, XE, \(\alpha _E^C\), \(\alpha _E^S\), and eE are defined as in the schooling model. Then we observe whether the individual is employed which corresponds to a binary variable DE = 1(IE > 0) that equals one for employment and zero otherwise. The error term is orthogonal to the control variables.

3.4 The model for occupational choice

Let I0 denote the utility that is associated with choosing a white collar occupation (where the alternative is a blue collar occupation). We are not using the conventional definition of white and blue collar work (type of occupation). White collar is working in a job that requires higher skills. We are defining this high-skilled job as one that requires a tertiary education. This definition makes sense for Peru where skilled labor is scarce: only 15% of individuals graduate with a tertiary education (Ministry of Education of Peru). We assume the following linear model for I0:

$$I_0 = \beta _0X_0 + \alpha _0^Cf^C + \alpha _0^Sf^S + e_0$$
(7)

where β0, X0, \(\alpha _0^C\), \(\alpha _0^S\), and e0 are defined as in the schooling and employment models. D0 = 1(I0 > 0) is an indicator of the choice of white collar occupational status. The error term is orthogonal to the control variables.

Both cognitive and non-cognitive factors are known by each individual but not for the econometrician. Controlling for this dependence is equivalent to controlling for the endogeneity in the model. The wage equations usually are functions of measured abilities or test scores. However, these test scores have a measurement error and are functions of schooling and latent abilities. For that purpose, using measured abilities does not reflect the parameters associated with the effect of abilities on choices and labor market outcomes.

In order to deal with this endogeneity problem, Heckman et al. (2006) estimate the distributions of latent abilities that rely on having at least three measurements. In addition to that, our identification strategy also relies on having panel data information on measurements; specifically, having information on the same measure in two different moments in time. Finally, even though we are assuming a linear-in-the-parameters specification, the model can be interpreted as an approximation of a more flexible behavioral model as in Heckman et al. (2006).

4 Data and sample

The identification strategy relies on having panel data information on the measured abilities, schooling, labor force participation, occupational choices, and wages. Unfortunately, this database is not available in Peru. We propose an empirical method which exploits two datasets.

The first one is the YL database for Peru. The YL contains longitudinal information on two cohorts of children (younger cohort and older cohort) for each of four countries: Ethiopia, India (Andhra Pradesh), Peru, and Vietnam. In Peru, data were collected from 20 sites in 14 regions that represent 95% of the Peruvian child population (excluding the 5% with higher incomes). Children and their caregivers were interviewed three times: in 2002 (baseline survey), when they were 8 years old; in 2006–2007, when they were 12 years old; and again in 2009–2010, when they were around 15 years old. The survey contains information on aspects related to child development, cognitive test scores, psychosocial traits (attitudes and aspirations), and anthropometric measures as well as a rich set of other individual and household characteristics. In particular, household characteristics such as household socio-status, wealth indices, log household consumption, and caregivers’ measured ability are also shown as well as other individual characteristics.

In order to analyze the distribution of skills among Peruvian children, we focused on the older cohort which, for Peru, comprised around 700 children that were 8 years old by the beginning of the study (born in 1994–5). We worked with the subsample of children with available information on items related to cognitive and socioemotional abilities as well as individual characteristics for rounds 2 and 3.Footnote 1 Finally, we worked with the subsample of children living in urban areas. The final sample comprised 349 individuals. The children in our sample and those urban children whose full information we were not able to find in rounds 2 and 3 came from households with similar wealth and consumption per capita and were, on average, of similar age, weight, and height. We find that mothers of children in our sample had attained more years of schooling at the time of round 1 in comparison to the mothers in the missing subsample. While the balance in general indicates that the missing information was random, we cautiously interpreted our results as valid for children with higher cognitive and non-cognitive abilities, as their parents’ educational levels were important determinants of ability.Footnote 2

The YL subsample were evenly distributed among boys and girls (165 and 184, respectively) with an average age of 149 months and a mean of 6 years of schooling in round 2. Table 1 presents the descriptive statistics on the main variables of interest from both rounds as well as information on the child’s mother tongue and the parents’ educational levels from round 1 (what we call “permanent characteristics”). Some facts worth highlighting are that in both rounds, while girls scored below average in items related to cognitive ability, boys did so in self-efficacy items (results were mixed for self-esteem between rounds). Most household characteristics and family backgrounds were similar between genders. Caregivers’ measured socioemotional abilities differed between children’s genders; boys’ cargeviers showed higher levels of self-efficacy by the time of round 2 and lower levels of self-esteem by the time of round 3. Important differences appeared between both rounds of the survey; a fact that was helpful for our identification strategy. The measures used to represent socioemotional abilities were built based on respondents’ degree of agreement or disagreement with a number of statements related to self-esteem. For this measure, the statements explored in the YL survey focused on positive and negative dimensions of pride and shame based on the Rosenberg Self-Esteem Scale. The degree of agreement was measured on a 4-point Likert scale that ranged from strong agreement to strong disagreement. We constructed the self-esteem index as the average score of these items and used the standardized index for our estimations.

Table 1 Descriptive statistics (young lives)

The second database corresponds to a novel household survey collected by the World Bank in 2010 that not only contained information on wages and individual characteristics, but also on measured cognitive and socioemotional abilities for a sample of currently employed working age individuals. The ENHAB is a nationally representative household survey that comprises information on urban areas of 11,235 randomly selected individuals aged 14–65 from 2600 cities. The data contain information on household living conditions, demographic information, academic achievement, employment and earnings, and novel information on (i) cognitive and socioemotional test scores, (ii) schooling trajectories, (iii) early labor market participation, and (iv) family characteristics. The measured abilities were assessed by means of cognitive tests that evaluated numerical and problem-solving skills, working memory, verbal fluency and receptive language, and socioemotional abilities according to GRIT scales (Duckworth et al. 2007) and the Big Five personality factors (Goldberg 1990). For this analysis we focused on seven of these measures, the standardized values of each of the big-five factors (emotional stability, extraversion, agreeableness/kindness, agreeableness/cooperation, conscientiousness strong, and openness) and a compound of the two measures of Grit as well as a compound of cognitive measured abilities. Information regarding individual characteristics included personal educational background, family characteristics, and socioeconomic status (parental education and occupations, family size, information on access and school characteristics, when parents attended basic and secondary education, perceived socio-economic status, etc.).

We worked with three subsamples: (i) individuals with available information on measured abilities (test scores), N = 2415; (ii) individuals with positive earnings, N = 4063; and (iii) individuals with available information on relevant individual characteristics, N = 7499. In general terms, individuals in the data set were evenly distributed among men and women and had a mean age of 33 years, monthly earnings of around 1000 soles in constant Peruvian currency for year 2010 (around 350 USD), worked an average of 51 h a week, and had on average a complete secondary education. Table 2 shows some other relevant descriptive statistics for the three subsamples and the difference in each between men and women. Some facts worth highlighting are that men earn higher earnings (monthly and hourly), work longer hours, and have higher levels of measured cognitive abilities than women. These differences are statistically significant at 99%. However, the results are mixed regarding socioemotional skills. While women appear to be more consistent, kind, cooperative and conscientious, men appear to be more persistent, extraverted, emotionally stable, and open.

Table 2 Descriptive statistics (ENHAB)

5 Econometric implementation

The main objective of this paper is to identify the contribution of abilities to the gender wage gap. The main equation is a function of schooling and ability:

$$LnW_i = \alpha + \gamma S_{ci} + \gamma _{fc}f_i^c + \gamma _{f^s}f_i^S + \mu _i$$
(8)

where LnWi are log earnings, Sci represents years of schooling, and fi is the latent ability: Cognitive (C) or Socioemotional (S). The main problem of this equation is that fi is unobserved by the econometrician. Thus, if schooling is correlated with ability, and ability is omitted, the estimation of γ is inconsistent. In particular, if ability is positively correlated with schooling, γ will be overestimated. The empirical literature has dealt with this issue by including tests scores as proxies for these abilities:

$$LnW_i = \alpha + \gamma S_{ci} + \gamma _CC_i + \gamma _SS_i + v_i$$
(9)

where Ci and Si are standardized test scores for measured cognitive and socioemotional abilities. However, using test scores does not solve the problem satisfactorily. Test scores are likely to be determined not only by schooling but also by the latent abilities of the individual. Thus, the coefficient corresponding to test scores would be partially capturing the indirect effect of schooling on earnings through the measured skills; thus, the true effect of schooling on earnings cannot be obtained. Moreover, since fi is still omitted; γ, γC, and γN are overestimated.

We propose an econometric procedure to estimate latent abilities, fi. For that purpose, we exploit panel data information on the measured abilities from the YL database and information on wages and measured abilities from the ENHAB. Specifically, the econometric implementation is divided into four stages.

First, we use time variation (from rounds 2 and 3) in the measured cognitive and socioemotional test scores and years of schooling among children in the YL sample to recover the (unobserved) fixed effects. In particular, we try to explain the variation in the two measures of ability (MAit): one socioemotional ability (self-esteem) and one cognitive ability (Peabody Picture Vocabulary Test scores). The identification procedure requires controlling for characteristics (Xit) that may have varied between the ages of 12 and 15 and that may explain the variation in the measured abilities during that period. In this way we can explain the changes in the measured ability and partial out any unobserved fixed effect, which we interpret as the latent ability. This latent ability collects all the information about the ability formed up to age 12.

$${\mathrm{{\Delta}}}MA_{it} = \gamma _0 + \gamma _X{\mathrm{{\Delta}}}X_{it} + {\mathrm{{\Delta}}}\mu _{it} \, for\,i\, = \,1,\,2$$
(10)

Second, we estimate the correlation of characteristics that remain unchanged in the child’s life from 12 to 15 years old for these fixed effects. For this, we capture the fixed effect or unobserved component of each ability by using the first-stage estimates to predict the average value of the covariates in rounds 2 and 3 and by deviating the predicted value of the measured ability with respect to the observed value of the variable.

$$\widehat {MA_{it}} = \widehat {\gamma _0} + \widehat {\gamma _X}X_{it}$$
(11)
$$\widehat {LA_i} = \frac{1}{2}\left[ {\left( {MA_{i1} - \widehat {MA_{i1}}} \right) + \left( {MA_{i2} - \widehat {MA_{i2}}} \right)} \right]$$
(12)

With these estimated proxies of latent abilities, we estimate the effects of the variables that remain constant when a child is between 12 and 15 years old and which may determine the latent ability by using the YL sample. Since we are using two databases, we require that these variables are available both in the YL questionnaire as well as for the ENHAB questionnaire. This availability allows for predicting the value of the “latent ability” for the ENHAB sample, which has the information on wages. Good candidates are gender, mother tongue, and parents’ educational level (years of schooling) and are denoted as Zi.

$$\widehat {LA_i} = \gamma _0^{LA} + \gamma _1^{LA}Z_i + \mu _i^{LA}$$
(13)

Third, we use the estimated parameters of the second stage to predict the fixed effects that would correspond to the ENHAB working age sample. This is possible due to the fact that the “permanent” characteristics are also available for the ENHAB sample.

$$\widehat {\widehat {LA_i}} = \widehat {\gamma _0^{LA}} + \widehat {\gamma _1^{LA}}Z_i$$
(14)

An assumption in this “matching” procedure is that the YL and ENHAB samples share similar characteristics such as national representativeness.Footnote 3 With this prediction we estimate the wage equation, and we analyze the gender wage gap as the theoretical model suggests: modeling wages as a function (basically) of latent cognitive and socioemotional abilities. The usual empirical approach is to model wages as a function of measured abilities (test scores) which leads to biased estimates. Thus, we exploit the calculated proxies for the latent abilities in the ENHAB to compare the usual approach with these results.

Finally, we use the Oaxaca–Blinder decomposition based on those estimated fixed effects as controls in the wage equation, and we estimate a theoretical model of log wages on latent ability as proposed by Heckman et al. (2006).

5.1 Gender wage gap and Oaxaca–Blinder decomposition

According to the model, measuring the wage gap based on test scores gives a wrong appreciation of the contributions of abilities and their returns to the gender wage gap. In particular, let us consider the following relation between wages and test scores:

$$LnW = \gamma _YX_Y + \gamma _Y^CC + \gamma _Y^SS + {\it{\epsilon }}_Y$$
(15)

Estimating this equation provides biased estimators of γC and γS. In this equation latent abilities are unobserved and considered in the error term sY. Since Cognitive (C) and Socioemotional (S) test scores are functions of latent abilities, they are correlated with the error term. Therefore, the estimated coefficients do not reproduce the effect of the latent abilities on wages. Once the gender wage gap is identified and the proxies of latent cognitive and socioemotional abilities are estimated and predicted we can apply one approach that allows us to evaluate the role of certain variables on the gender wage gap: the Oaxaca–Blinder decomposition.

The Oaxaca–Blinder decomposition is a method that aims to decompose differences in mean wages across two groups, in this case, between genders. The setting assumes a linear model that is separable in observable and unobservable characteristics:

$$Y_g = X\beta _g + \eta _g\,{\mathrm{for}}\,{\mathrm{g}} = {\mathrm{male}},{\mathrm{female}}$$
(16)

Thus, letting d be an indicator variable for group membership, yd be the scalar outcome of interest for a member group d, Xd be a vector of observable characteristics (including a constant), \(\widehat \beta ^d\) be the column vector of coefficients from a linear regression of yd on Xd, and letting overbars denote means, one can re-express different wages between different observable characteristics or differences in coefficients:

$$\overline {Y^1} - \overline {Y^0} = \left( {\overline {X^1} - \overline {X^0} } \right)\,\widehat {\beta ^1} + \overline {X^0} \,\left( {\widehat {\beta ^1} - \widehat {\beta ^0}} \right)$$
(17)

where the first and second terms on the right-hand side of the equation represent the explained and unexplained components of the difference in mean outcomes, respectively. This is what we call a “two-fold decomposition”. An extension of this method is called the “three-fold decomposition” which includes a third term that interacts (simultaneous) differences in observable characteristics with coefficients:

$$\overline {Y^1} - \overline {Y^0} = \left( {\overline {X^1} - \overline {X^0} } \right)\,\widehat {\beta ^1} + \overline {X^0} \,\left( {\widehat {\beta ^1} - \widehat {\beta ^0}} \right) + \left( {\overline {X^1} - \overline {X^0} } \right)\,\left( {\widehat {\beta ^1} - \widehat {\beta ^0}} \right)$$
(18)

where the last term on the right-hand side of the equation represents the interaction.

6 Results

In this section, we compare the results of estimating the effect of cognitive and socioemotional abilities on wages by using measures of these skills (test scores) with those obtained by using two definitions of latent abilities. In each case we start by presenting the Mincer equation of log wages that controls for schooling and abilities. Then, we apply the Oaxaca–Blinder decomposition to estimate the effect of the abilities on the gender wage gap. Further, to disentangle this effect in each of the choices made by the individual before earning a certain wage we estimate a joint model of schooling, employment, occupational choice, and wages. To proceed in this manner, we apply the procedure explained previously to obtain proxies for cognitive and socioemotional abilities and present the results obtained in each of the four stages.

6.1 Wages and measured abilities

Considering the previous discussion on the issues of estimating the effect of abilities on wages, Table 3 shows the results of a basic Mincer equation under the naïve assumption that there is no correlation between measured skills and schooling. Column 1of Table 3 shows that after controlling for work experience, place of residence, mother tongue, and birth order, an additional year of schooling leads to a 11.3% increase in earnings. In column 2, we present results that control for the parents’ schooling, as it may explain part of the correlation between earnings and schooling. As suspected, the point estimate drops from 0.113 to 0.0983. The inclusion of measures of cognitive and socioemotional abilities shows that cognitive ability and emotional stability lead to higher wages while agreeableness and consistency of effort reduce it.

Table 3 Mincer equation with measured abilities

6.1.1 Oaxaca–Blinder decomposition

In order to estimate the contribution of measured abilities on the gender wage gap we apply the Oaxaca–Blinder decomposition. The sample mean of log hourly wages is 1.430 for men and 1.167 for women and yields a statistically significant wage gap of 0.263. The wage gap can be attributed to differences in the predictors and in the coefficients. Nonetheless, while there would be a significant increase in women’s hourly wages if they had the same characteristics (mean values of the regressors) as men, around 80% of the gender wage gap would be reduced if women shared the men’s coefficients or returns, given their own characteristics. The gender wage gap as well as the endowment effect and the differences in coefficients are significant even after controlling for standard individual characteristics. Columns 1 and 2 of Table 4 illustrate the results obtained by means of the two-fold decomposition using a simple specification and after adding standard controls, respectively. Columns 3 and 4 present the results obtained by means of the three-fold decomposition.

Table 4 Oaxaca–Blinder decomposition with measured abilities

The detailed decomposition shows that the explained part of the wage gap is mainly driven by the difference in endowments of cognitive ability between men and women. The results also show that (measured) non-cognitive or socioemotional abilityFootnote 4 does not make a significant individual contribution to the explained part of the gap.

As noted in Jones (1983) and Oaxaca and Ransom (1999), the interpretation of the detailed decomposition of the unexplained part of the gender gap (or returns) depends arbitrarily on the reference category of predictors. We abstain from interpreting the detailed decomposition of the unexplained part and focus on the contribution of the individual predictors to the explained part of the gender gap, which is unaffected by the choice of the base category.

6.1.2 Joint estimation: schooling, employment, occupation, and wages

The previous results describe the correlation between cognitive and socioemotional skills and wages but do not account for choices made by the individual before receiving a wage. To disentangle the effect of measured skills on each of these choices, we proceed with a joint estimation that considers sequential choices of schooling, employment, and occupation. The model follows an individual’s line of choice. First, the individual is aware of his or her own level of abilities and chooses a schooling level using this information. After completing the chosen level of schooling, the individual decides whether to enter or not the labor market. Once he or she decides to participate in the labor market, he or she chooses the occupation (white collar or blue collar) and, finally, receives a wage according to his or her previous decisions.

Table 5 shows the result of the maximum likelihood estimation of the joint model. The procedure requires the maximization of the joint likelihood of attaining a certain level of education, being employed, choosing a certain occupation, and earning a certain wage. Thus, the individual contribution to the likelihood is:

$$l_i = \overbrace {L_{si}\left( {\theta _S\left| {LA_i} \right.} \right)}^{{\mathrm{Schooling}}}\underbrace {L_{hi}\left( {\left. {\theta _h} \right|LA_i,\,s_i} \right)}_{{\mathrm{Working}}}\overbrace {L_{oi}\left( {\left. {\theta _o} \right|LA_i,s_i,h_i = 1} \right)}^{{\mathrm{Occupation}}}\underbrace {L_{wi}\left( {\left. {\theta _W} \right|LA_i,\,s_i,\,h_i = 1,\,o_i} \right)}_{{\mathrm{Wages}}}$$
(19)
Table 5 Joint likelihood with measured abilities

In the above equation, Lsi(θS|LAi) is the conditional density schooling (having attended through secondary school), given the latent abilities LAi. Lhi(θh|LAi,si) is the likelihood of being employed, given the latent abilities and schooling si. The following term, \(L_{oi}\left( {\left. {\theta _o} \right|LA_i,s_i,h_i = 1} \right)\), is the likelihood of choosing a white collar job, given the latent abilities, schooling, and that the individual is employed (hi = 1). Finally, \(L_{wi}\left( {\left. {\theta _W} \right|LA_i,\,s_i,\,h_i = 1,\,o_i} \right)\) represents the conditional distribution of wages, given the latent abilities, years of schooling, being employed, and occupational choice oi.

Each column of Table 5 corresponds to each of the choices involved in the model. The results indicate that while measured cognitive skills seem to matter the most in determining the years of schooling and occupational choice, the measured socioemotional abilities gain relevance for wages and employment. In terms of inter-gender differences, men have higher returns to socioemotional abilities than women in terms of being employed and earning higher wages. Women have a higher return to cognitive abilities only in the choice of schooling. Nevertheless, these estimated contributions consider the measured abilities, which could be capturing the effect of other factors correlated with the outcome variables and measured abilities.

6.2 Wages and latent abilities

To properly estimate the contribution of abilities on the gender wage gap, we consider latent abilities. In the following subsections, we present the results of the proposed procedure for estimating latent cognitive and socioemotional abilities. Then, we estimate the previous models with the resulting proxies for the latent abilities instead of the measured abilities.

6.2.1 Estimating latent abilities and factor loadings

Table 6 shows the results of the first stage in columns 1 and 2 that correspond to cognitive ability (PPVT scores) and self-esteem, respectively. Each regression controls for the child’s caregiver’s measured self-esteemFootnote 5 and the child’s standardized height for age, standardized body mass index, age, and an indicator for having missed school for more than one week due to illness (-not so- exogenous variation in schooling) as well as the household’s perceived status, wealth index, and log real consumption per capita. Standard errors are clustered at the community level.

Table 6 First stage estimation (fixed effect model of measured ability)

In the case of self-esteem, changes in the caregiver’s self-esteem and the child’s height for age, body mass index, missed school for more than one week due to illness, and perceived household socioeconomic status (SES) are statistically significant. One interesting result is the large effect associated with SES. It indicates the status within social groups, elevated self-esteem should result from elevated SES (Rosenberg and Pearlin 1978). If an individual aspires to success in the form of social status and achieves these goals, elevated self-esteem should result (Twenge and Campbell 2002). This is especially relevant for adults whose SES reflects their own earned status (Rosenberg and Pearlin 1978). The research of socioemotional skills has a long way to go; however, that is beyond the scope of this paper. Furthermore, we include additional control variables in both estimations finding that the magnitude and level of significance of coefficients stayed the same as presented in Table 6.

Another interesting result is the negative association between BMI and cognitive ability. The BMI (body mass index) is an attempt to quantify the amount of tissue mass (muscle, fat, and bone) in an individual. Basically, it is a measure of overall healthiness. Because of poverty levels in Peru, malnourishment among children is common. Thus, a negative correlation between BMI and cognitive skills is rare. However, according to the data, a priori expectation of unhealthy children does not hold. The sample mean of BMI is around 18 points, which is a borderline level between underweight and healthy weight. An interesting fact is the number of overweight children (BMI > 25): 88 out of 349 children. Even though BMI does not measure body fat directly, a high BMI can be an indicator of high body fatness. Research shows that body fatness is negatively related with cognitive skills.

For instance, Baccouche et al. (2014) estimate the relation between BMI and cognitive performance in rugby players and find that BMI is negatively correlated with verbal proficiency. In the same vein, Smith et al. (2011) summarize diverse studies that point out that increased adiposity is associated with poor cognitive performance, especially in the executive functions in children, adolescents, and adults. Also, Basatemur et al. (2013) find that maternal pre-pregnancy BMI (which determines children’s BMI) is negatively associated with children’s cognitive performance, even after adjusting for multiple sociodemographic confounders. They also find that the relation appears to become stronger as children get older. Identification of the causes of this negative association requires a more thorough investigation. However, the literature seems to be shedding some light on the subject.

Table 7 shows the results of the second stage of our procedure. In this stage, we estimate the coefficients associated with the permanent characteristics that we use to predict latent ability using the ENHAB sample. Based on the literature and the availability of these variables on both the YL and EHAB datasets, we control for gender, order of birth, parents’ educational level, and mother tongue. The literature on skill formation shows that the latent ability is innate and thus, should be affected by characteristics that are determined for the child up to its first three years of life. What we estimate as latent ability is actually the ability formed when the child was 12 years old, so one could expect, a priori, that variables that are fixed until that moment should be important in determining this latent ability. Covariates such as parents’ educational level, gender, and the child’s first language should be important, but not others such as characteristics of secondary education (which would also be endogenous). This is what motivates the reduced model. For the proxies for cognitive and non-cognitive latent abilities, all included controls are statistically significant for one measure or the other. Men have a higher endowment of cognitive ability, while the difference in self-esteem is not statistically significant after controlling for the order of birth, mother tongue, and parents’ education. Lower ability, both cognitive and non-cognitive, is also related with a higher order of birth. Almost consistently, parents’ education has a positive impact on both proxies of ability. Finally, the child’s first language is an important determinant of self-esteem but not of cognitive ability.

Table 7 Second stage estimation (latent ability on permanent characteristics)

The third stage of the procedure predicts the latent abilities in the ENAHB sample by using the coefficients and regressors displayed in Table 7. Since both the YL and ENHAB surveys are nationally representative, the matching procedure should be plausible. Table 8 shows the descriptive statistics of both predictions (for the YL and ENHAB samples) in the full sample and by gender. Due to limitations in our data, our predictions of cognitive and non-cognitive abilities are not able to explain a great part of the variation. However, both predictions share similar characteristics and directions in the gender differences that indicate our procedure is valid within this limitation.

Table 8 Third stage statistics (predicted latent abilities in both databases)

Finally, we estimate the effect of latent abilities on wages. Table 9 compares the results of the basic Mincer equation obtained by including the measured abilities (column 3) and those obtained by controlling, instead, for our predicted latent abilitiesFootnote 6 (column 4). In this regression, we also control for personality traits by including two latent measures of Goldberg and GRIT items. A common latent factor arises from the principal component analysis of the measured personality traits that have a positive effect on wages (extraversion, emotional stability, openness and effort). We call this factor the latent positive trait. Similarly, we capture a latent common factor, the “latent negative trait”, from the Goldberg and GRIT items that have a negative relation with earnings (kindness, cooperation, conscientiousness, and consistency of interest).

Table 9 Mincer equation with latent abilities

Two results are worth highlighting. First, the return to schooling in column 4 is larger than that of column 3. This is consistent with our previous suspicion that measured abilities capture part of the effect of schooling on wages (the reason behind the drop in returns to schooling from column 2 to column 3). Second, the statistical significance of latent cognitive abilities. This significance shows that there is an effect of cognitive abilities on wages, but also that it now also captures the indirect effect of abilities through schooling now that we are able to control for both schooling and latent abilities. As expected, the latent negative and positive personality traits have significant effects on wages, and we find no significant effect of socioemotional skills on wages.

6.2.2 Oaxaca–Blinder decomposition

This subsection describes the results obtained after applying the Oaxaca–Blinder decomposition to the whole sample of the working age populationFootnote 7 but accounting for differences in latent abilities (the previously estimated proxies). As stated in the specification with measured abilities, a significant gender wage gap exists and, again, this gap is largely explained by returns (or the unexplained part of the wage gap). Regarding the explained part of the gap, in contrast to the results obtained in the previous subsection, we find that socioemotional skills also play an important role in explaining the wage gap.

Table 10 shows the results corresponding to the Oaxaca–Blinder decomposition for the ENHAB sample that accounts for differences in latent cognitive ability as well as latent self-esteem (as a proxy for the socioemotional latent ability) and also controls for latent personality traits. After applying the two-fold and three-fold approximations, the data support the fact that the gender wage gap is attributable to group differences in the coefficients and the predictors. In terms of differences in the endowment of abilities, Table 10 shows that the differences in cognitive and socioemotional abilities favor men regarding earnings. While the higher endowment of cognitive ability amongst men appears to reduce the gender wage gap, if women had the same endowment of socioemotional ability as men, they would earn significantly higher wages. The difference in the endowments of the positive personality traits favors men, and also explains part of the wage gap.

Table 10 Oaxaca–Blinder decomposition with latent abilities

6.2.3 Joint estimation: schooling, employment, occupation, and wages

Table 11 shows the results of the joint estimation that considers cognitive and socioemotional latent abilities. Interpreting the role of both abilities in each of the choices considered lead to interesting results. First, the non-cognitive ability is crucial for attaining higher levels of education and this is so for men and women. Cognitive abilities are not determinants for this choice. The other three choices must be interpreted together. We observe that self-esteem is important in determining occupational choice. Although the interaction term is not statistically significant, the direction of the coefficient points to an advantage for men regarding occupational choice. We could interpret that men earn higher wages because when employed their equilibrium assignation is towards occupations with higher rewards for cognitive skill. This, combined with the fact that men have higher cognitive skills, helps explain the gender wage gap.

Table 11 Joint likelihood with latent abilities

In contrast with the results from the joint estimation that uses measured abilities, we can observe that inter-gender differences in cognitive and socioemotional abilities in favor of men drop when considering latent abilities. Moreover, returns to non-cognitive latent abilities gain significance for the occupational choice. This significance supports our idea that most differences attributable to abilities occur within occupational choice.

7 Conclusions

This study presents evidence on the role of cognitive and socioemotional skills in closing the gender wage gap. In a first attempt to estimate their effect on wages, we followed the basic empirical approach of modeling wages in terms of measured ability. Second, we applied a procedure that estimated a model in terms of latent ability. We based the model on the setting proposed by Heckman et al. (2006). While these authors identify latent abilities based on dependence on different test scores for the same time period, we use variation over time for the same test score. This is possible due to the availability of panel data information on measures of cognitive and socioemotional skills. In addition, we estimate a joint model of schooling, employment, occupational choice, and wages in order to disentangle the effects of the latent abilities in the gender wage gap throughout an individual’s choices previous to earning a certain wage. Our main contribution is analyzing the role of latent socioemotional and cognitive abilities to the gender wage gap in a developing country by estimating and accounting for proxies of the latent abilities and disentangling the effect of these abilities by means of a joint model of schooling, employment, occupational choice, and wages.

There is a significant gender wage gap in Peru. Estimations with measured abilities confirm the empirical literature regarding endogeneity issues that result from using test scores as measures of ability. The Oaxaca–Blinder decomposition in a model with measured abilities shows the significant inter-gender differences in the endowment of cognitive skills but no relevant differences in terms of socioemotional abilities (in endowment or returns). Estimating the joint model shows that differences in socioemotional abilities between men and women are important but only for choices prior to wage determination. Cognitive skills are relevant in determining years of schooling and occupational choice and measure the socioemotional ability for wages and employment.