1 Introduction

Brazil has recently shown an important decrease in personal and regional inequality. Nevertheless, its income inequality is still one of the highest worldwide (it has the 16th highest Gini index among 141 countriesFootnote 1), reaching 0.523 in 2013 down from 0.603 in 1995. Regional income disparities are also considerably high, demonstrated by the fact that average personal income was 1.78 times higher in the South-South-east than in the North-North-east in 2013.

Following the tendency towards urbanisation observed all over the world (Glaeser 2011; UNFPA 2007), there is a significant spatial concentration of population and economic activity in Brazil. In 2010, 84.4 % of the population lived in urban areas, occupying 1.07 % of the national territory.Footnote 2 Regarding economic concentration, in 2012, cities with more than 500,000 inhabitants hold a 40.8 % share of total GDP, while having 29.3 % of the total population.Footnote 3 Also, the correlation between the logarithm of wages and the logarithm of population density at the municipal level reached 0.06 in 2010 for the whole labour market and 0.05 for the formal sector.

Moreover, it is widely acknowledged that living costs increase significantly in bigger cities, meaning that only with a positive relationship between wages and the city size is it possible to explain the existence of cities (Duranton and Puga 2014; Oort and Lambooy 2014). In this context, the proximity of economic agents generates agglomeration economies by reducing transportation costs. Different theoretical approaches analyse these costs reductions under different perspectives—of goods, people and ideas (Glaeser and Gottlieb 2009).

Following the substantial empirical literature on agglomeration economies, this paper aims to understand the nature and magnitude of regional disparities in Brazil by exploring the industrial scope of the extraordinary gains obtained in bigger cities. The strategy followed here presupposes that salaries are directly related to productivity, and investigates whether competition, specialisation, and/or diversity are more or less relevant to explain the regional residual variation in salaries in different sectors (which is not related to individual characteristics).

In comparison with the previous literature, we provide different estimation strategies, controlling or not for individual unobserved characteristics constant in time (Glaeser and Maré 2001; Combes et al. 2008a, 2010; Groot et al. 2014). Moreover, we consider a two-stage estimation on two different levels (the first stage is at the individual level, while the second stage is at the sector–region–year level).

Concerning the empirical literature on agglomeration economies in Brazil, we provide an analysis at the microlevel, controlling for individual characteristics. In addition, we compare the potential effects of specific sources of externalities related to the industrial scope of agglomeration economies (diversity, specialisation and competition) over sectors with different levels of technology and knowledge intensity. Finally, it is also possible to discuss how the formalisation of the Brazilian labour market may have affected all these relationships by analysing the formal sector in two different time periods.

This study aims to contribute to the literature of static agglomeration effects, by controlling for individual skills and comparing different estimation strategies. The focus is on the industrial scope of agglomeration economies, a strand of the literature that has recently attracted more attention because individual-level longitudinal data have become more easily available. Moreover, so far most empirical studies have focused on developed countries, but we are instead evaluating the relative importance of the local industrial mix for different sectors in the context of a developing economy, that of Brazil, which is another important contribution.

The next sections are organised as follows: Sect. 2 contains a review of the related literature; Sect. 3 discusses the methodological approach; Sect. 4 describes the data, and then, Sect. 5 presents the results. And, finally, Sect. 6 draws the main conclusions from the analysis.

2 Urban wage premium and increasing returns

The relationship between city size and the income of firms and workers has been examined based on the assumption that urbanisation reflects the gains from agglomeration. The next subsections will briefly discuss the main issues investigated in the literature and explore the advances of the empirical studies in this area.

2.1 Principles of agglomeration economies

The urban wage premium has been the subject of analysis in several studies aiming to identify how the density of economic activity affects the productivity of workers (Heuermann et al. 2010). There is usually a positive association between city size and salaries (Combes et al. 2010). However, cities present a trade-off between costs and benefits, which are directly related to dispersion and agglomeration forces (Glaeser 1998; Combes et al. 2011; Duranton and Puga 2014). The challenge is to isolate these effects from other explanatory factors of productivity differentials in space, while investigating the possibility of convergence and divergence between regions, sectors and educational levels, among other dimensions (Lindley and Machin 2014).

Spatial wage differentials can be explained by three main sets of variables (Combes et al. 2008a): the skill composition of the local labour force; geographical characteristics and local factors of production; and interactions between firms or workers. The latter set is related to the generation of agglomeration economies.

In fact, under the conditions defined by the Spatial Impossibility Theorem (Starret, 1978), increasing returns constitute one of the possible explanations for the existence of cities with trade flows between them. When there are positive externalities to the urban size, the competitive paradigm in space is no longer valid (Combes et al. 2008b), thus creating incentives for the agglomeration of economic agents (Fujita and Thisse 2012; Glaeser 1998).

Agglomeration economies are generated by the interactions between firms and individuals. Synthesising different contributions, Duranton and Puga (2004) and Puga (2010) identify three main microfoundations: (1) the sharing of facilities and gains from individual specialisation and variety, and risk sharing, through labour pooling; (2) the higher probability of a matching, with higher quality; and (3) learning, related to knowledge generation, diffusion and accumulation. These externalities are static when they explain the cross-sectional distribution of economic activity, and are dynamic when related to productivity growth and knowledge spillovers (Groot et al. 2009; Pessoa 2014). According to Rosenthal and Strange (2004), they may be related to the city size (urbanisation economies, associated with Jacobian economies) or to the size of the particular economic sector (localisation economies, linked to Marshallian economies).

The microfoundations discussed above come from a reinterpretation of Marshall (1890), who pointed out that proximity gains could be stronger between firms and workers within the same industry (labour market pooling, input sharing and knowledge exchange). Glaeser et al. (1992) summarised further developments in this direction with the MAR model (Marshall–Arrow–Romer), in which industrial specialisation could promote knowledge spillovers by favouring the interaction between firms with a similar structure. Moreover, it may be easier for firms to internalise innovation gains when they have a higher local monopolistic power (Capello 2014). Another perspective is brought by Jacobs (1969), according to whom knowledge externalities are fostered by the interaction between economic actors of different sectors. Therefore, the source of agglomeration economies is external to the industry, allowing the exchange of complementary knowledge, thus generating urbanisation economies. In this setting, competition is seen as an incentive for firms to innovate. Porter (1990) also defends this last aspect, but his theoretical framework advocates that a higher level of specialisation reinforces those gains (knowledge exchange is stronger in vertically integrated industries).

Depending on the development stage of each industry in a certain country, they will benefit from different industrial structures in their locality. For instance, new sectors with a strong focus on innovation efforts may be located in larger and diversified urban areas where they can profit from a wide range of experiences and ideas (Henderson 2010). Following Groot et al. (2009), it is possible to organise the sources of externalities and their expected effects over employment and productivity according to each theoretical view discussed above (see Table 1):

Table 1 The effect of agglomeration externalities on employment and productivity.
figure a

This strand of the literature was initially synthesised during the 1960s and the 1970s (Capello 2014). However, recently there has been a recent new wave of theoretical and empirical studies covering the relationship of productivity, geographical proximity and local industrial composition. From the perspective of the empirical literature, detailed databases based on microdata at the individual level have allowed the brand new theoretical models to be tested by controlling for individual heterogeneity in order to assess regional-level relationships (Rosenthal and Strange 2004). In the Brazilian context, the empirical literature using this recent approach is still very limited, as is discussed in the next subsection.

Therefore, the industrial scope of agglomeration economies can be further investigated by controlling for individual characteristics and other confounding factors. This strategy allows the identification of a proper local sector mix, the optimum city size, and an adequate degree of competition for different industries. This framework is now considered in the case of Brazil, a developing economy with a continental dimension and huge regional disparities.

2.2 Empirical studies on agglomeration externalities

The literature that relates productivity and different measures of agglomeration externalities has been reviewed elsewhere (Melo et al. 2009; Beaudry and Schiffauerova 2009). In general, the results are significantly heterogeneous, and one of the explanations for this is that the outcome related to productivity can vary. It can be either directly derived from the production function or indirectly derived (from salaries, rents, job creation, new establishments, among others), according to data availability. When salaries are considered, the main underlying assumption is that they equalise the marginal productivity of labour (under perfect competition), or that at least they are higher in places that are more productive (Rosenthal and Strange 2004). An important drawback of this strategy is that elasticities in the labour market will determine the extent to which wages capture local productivity (Moretti 2011).

Apart from that, the heterogeneous results found in the literature can be associated with different measures for competition, diversity and specialisation (Beaudry and Schiffauerova 2009). Melo et al. (2009) also remark that the estimates of the elasticity of urbanisation economies vary among countries, data structure, geographical aggregation and economic sector, in the range from 3 to 8 % (an increase of 1 % in urban density, for instance, is expected to increase wages from 3 to 8 %).

The theoretical framework most commonly applied to this sort of analysis is based on the wage equation of the urban economics literature (Combes et al. 2008b). Glaeser and Maré (2001) conducted one of the first studies aimed at measuring the urban wage premium, while controlling for individual heterogeneity. They consider longitudinal individual data for the USA and find evidence that there is an urban premium even when observed and unobserved individual characteristics are taken into account. In a similar setting, Mion and Naticchioni (2009) show that individual abilities and firm size explain a large share of the spatial heterogeneity of wages in Italy.

Combes et al. (2008a) find that controlling for the sorting of individuals (with individual fixed effects) reduces the estimated coefficient of urbanisation externalities by 40–50 %. However, apart from creating a black box for the sorting process, the inclusion of individual fixed effects may introduce a selection bias, as only those individuals who have migrated will provide the variation to estimate the agglomeration coefficient. Furthermore, pooled cross sections will provide an upper bound for the estimates, while the panel estimation will generate a lower bound (Groot et al. 2014).

There is a strong concern for the potential endogeneity of urban size measures (endogenous quantity of labour), such as population density (Combes et al. 2011). This is so because cities with higher wages may attract more individuals, leading to an increase in population and consequently, in city size. The approach proposed by Ciccone and Hall (1996) and adopted by Combes et al. (2008a), Groot et al. (2014)), and Graham et al. (2010), among others, is based on long lags of the endogenous variable, while Combes et al. (2010) also suggest the inclusion of geographical characteristics as instruments. On the other hand, the individual fixed effects mentioned above aim to control for the endogenous quality of labour, namely the sorting process of the labour force (Glaeser and Maré 2001; Combes et al. 2008a).

As previously suggested, the industrial scope of agglomeration externalities can be evaluated in order to assess which theoretical framework seems more suitable to explain regional wage disparities. Glaeser et al. (1992) explore city-level data in the USA, finding results consistent with the framework presented by Jacobs (1969)—employment growth is encouraged by urban diversity and local competition. Groot et al. (2014) estimate a two-stage regression, explaining the spatial residual of a Mincerian regression. The authors find an employment density elasticity of 4.8 % for NUTS-3 areas in the Netherlands, with a stronger indication of MAR externalities and small negative effects from competition and diversity measures.

Longitudinal data on the labour market in Brazil have become available only recently, and access is still very limited. Therefore, a large part of the empirical literature is based on individual-level cross sections, or aggregated data at the area level. In an attempt to measure agglomeration externalities for the country, Henderson’s (1986) estimates indicate that localisation economies seemed more important than urbanisation economies in the 1970s, meaning that manufacturing activities might be less productive in larger urban areas. Moreover, the mean of the coefficients estimated for different sectors is 0.046, ranging from 0.003 to 0.18.

Amaral et al. (2010) estimate a New Economic Geography (NEG) wage equation and find a positive relationship between wages and market potential. Also based on a NEG model, Fally et al. (2010) find that market potential and supply access (intermediary goods) are positively related to individual wages. Silva and Silveira-Neto (2009) explore the determinants of manufacturing employment growth at the state-sector level in Brazil between 1994 and 2004. Among the explanatory variables, there are: the average wage; average firm size; connections (a concentration index based on backward and forward linkages); a proxy for transportation costs; and dynamic externalities (specialisation and diversity). They find an indication that competition is relevant to employment growth (Jacobs’s and Porter’s theoretical arguments), as well as diversity (Jacobs).

Hierarchical wage equations are applied by Fontes et al. (2010) to control for variables at different levels (individual and territorial unit of analysis). Based on Census data for 1991 and 2000 (cross sections), the authors evaluate municipalities with at least 50,000 inhabitants and find significant regional disparities even after controlling for observed individual heterogeneity. There is evidence of a positive effect of urban scale on wages, as well as of a positive relationship between the industrial concentration and the density of modern productive services with wage levels.

With an urban economics wage equation, Simões and Freitas (2014) apply municipal data to find that urbanisation economies are more relevant to sectors with high technological intensity, while sectors with low and medium technological intensity benefit more from medium-sized urban centres, which are relatively less diversified. Considering data from RAIS (Annual Report of Social Information, from the Ministry of Labour), Freguglia and Menezes-Filho (2012) find that, when controlling for individual heterogeneity, almost 63 % of the total spatial differential disappears, meaning that local policies should focus on human capital development in order to promote regional development.

The investigation of the industrial scope of agglomeration economies, while controlling for individual heterogeneity, has become the subject of analysis of recent studies, owing to an increase in the number of available longitudinal individual-level databases. There is scene to contribute to this literature, especially in the context of a developing country such as Brazil, where empirical studies have only recently started to cover these possibilities.

3 Analytical framework

The theoretical framework adopted here is based on the wage equation developed by the urban economics literature. This approach is not new, especially in its aggregated regional version, which has been presented in a variety of studies ever since the 1970s. However, the recent wave of newly available individual-level longitudinal databases has allowed a proper control of individual heterogeneity. Following Combes et al. 2008b, profit optimisation by a price-taking firm j may generate the following equilibrium wage:

$$\begin{aligned} w_j =\mu \left( {1-\mu } \right) ^{\left( {1-\mu } \right) /\mu }s_j \left( {\frac{p_j A_j }{r_j^{1-\mu } }} \right) ^{1/\mu }=f\left( {s_j ,A_j ,p_j ,r_j } \right) . \end{aligned}$$
(1)

In this case, the estimation can be conducted for individual i working in firm j, which requires detailed information at the individual level, while \(p_j \) and \(r_j \) capture, respectively, agglomeration and dispersion forces. The average unit value \(p_j \) of the good produced by the firm can be raised by higher demand, weaker competition or cheaper intermediate goods, leading to more agglomeration. On the other hand, \(r_j \) measures the effects of the prices of other production inputs, which can increase with congestion, and may provide a dispersion force. The term \(A_j \) is related to technological externalities (among others, knowledge and learning spillovers and existing technology associated with workers’ abilities). Finally, it is necessary to control for individual skills \(s_j \) in order to capture the correct effect of local characteristics over salaries.

Within this context, when wages are regressed against a measure of city size, such as population density, the main underlying assumption is that this relationship happens through the elements discussed above. It is also important to control for all possible confounding variables. Moreover, the industrial mix must also be taken into consideration, especially because local wages also vary between sectors, which are heterogeneously affected by the degree of diversity, specialisation and concentration. These last factors are embedded in the term measuring technological externalities, namely \(A_j \).

Combining all these elements, the next sections will discuss the estimation of the equation presented above in two stages. In the first stage (Eq. 2), individual-level data are used to explain wages with individual characteristics and a dummy for each combination of region–sector–year, the spatial residual (Groot et al. 2014)—henceforth referred to as the spatial wage. This relationship can be written as:

$$\begin{aligned} \ln \left( {w_{i,r,s,t} } \right)= & {} \beta _0 +\beta _1 age_{i,r,s,t} +\beta _2 age_{i,r,s,t}^2 +\sum \nolimits _{edu} \beta _4 D_{i,r,s}^{edu} \nonumber \\&+\sum \nolimits _{factor} \theta _{factor} Skill_{i,r,s,t}^{factor} +\sum \nolimits _{firm\_size} \beta _5 D_{i,r,s,t}^{firm\_size} \nonumber \\&+\sum \nolimits _r \sum \nolimits _s \sum \nolimits _t \lambda _{r,s,t} +\varepsilon _{i,t}. \end{aligned}$$
(2)

Then this estimated spatial wage \(\hat{\lambda }_{r,s,t} \) will be used as the dependent variable in the second stage (Eq. 3), as it is associated with measures of agglomeration externalities and geographical characteristics.

$$\begin{aligned} \hat{\lambda }_{r,s,t}= & {} \delta _0 +\delta _1 \ln \left( {density_{r,t} } \right) +\delta _2 area_{r,t} +\delta _3specialisation_{r,s,t} \nonumber \\&+\delta _4 diversity_{r,t} +\delta _5 competition_{r,s,t} \nonumber \\&+\sum \nolimits _{geog} \delta _{geog} geography_{r,t} +\sum \nolimits _s \delta _6 D_s^{sector} \nonumber \\&+\sum \nolimits _t \delta _7 D_t^{year} +\sum \nolimits _{macro-region} \delta _8 D_{mr}^{macro-region} +\xi _{r,s,t}. \end{aligned}$$
(3)

Such a strategy in two stages is based on the argument that individuals residing in the same labour market and working in the same sector may share some specific characteristics that are not entirely captured by the controls included in Eqs. 2 and 3. In this case, the error term of a single-stage estimation will be positively correlated across individuals from the same sector and region, generating downwardly biased standard errors for the regional and sectoral level variables, which produce higher t statistics (Combes et al. 2008a; Moulton 1990; Bell et al. 2002, and Card 1995, in a different context). A two-stage estimation procedure aims to overcome this issue.

A descriptive analysis of all variables included in the model will be presented in the next section. Among the individual characteristics considered in Eq. 2: \(w_{i,r,s,t} \) is the hourly wage; \(age_{i,r,s,t} \) is measured in years; and \(D_{i,r,s}^{edu} \) assumes value 1 when the individual has attained a certain educational level (edu \(=\) {illiterate, incomplete primary school, complete primary school to incomplete high school, complete high school to incomplete college, college degree or more}), or 0 otherwise. Moreover, \(Skill_{i,r,s,t}^{factor} \) measures the degree of certain skills that are required for the job, following Maciente (2013).Footnote 4 Finally, \(D_{i,r,s,t}^{firm\_size} \) represents a group of dummies for each firm size in which the individual works (aiming to control for technological patterns and competition differences between firms, Glaeser et al. 1992); and \(\lambda _{r,s,t} \) is the sector-spatial wage.

Equation 3 presents a set of independent variables which aim to measure how the city size and the industrial mix affect the spatial wage. Urbanisation externalities are measured by the logarithm of employment density, and the area of the regionFootnote 5 is included to control for the scale effect. In addition, different measures of specialisation, diversity and competition are considered in alternative specifications (see Table 6). Other controls include two-digit-sector dummies, year and macro-region dummies, as well as geographical characteristics (altitude and distance to the Equator line).

The estimation strategies have the following variations: in the first stage, an individual fixed effect may be included to control for the unobserved individual heterogeneity constant in time. Furthermore, in the second stage, the potential endogeneity of employment density is taken into account with instrumental variables (IV) techniques. The instruments considered here are the following: population density in 1940; distance to the coast; and dummies for sugar cane and coffee production and gold exploration during the colonial period.Footnote 6 Table 7 provides a detailed description of all the variables considered, including methods of calculation and data sources.

In relation to specialisation, diversity and competition, there is a handful of measures applied in the literature. Even if it is possible to classify their suitability according to the problem concerned, it is interesting to compare the results of different combinations of measures. Table 8 in Appendix 3 provides a list of the alternative variables considered here, with their respective method of calculation, interpretation and source. Evaluating all of them increases the robustness of the results.

The last methodological issue to be highlighted is that the regressions presented in Eqs. 2 and 3 are estimated separately for five aggregated sectors, defined in Appendix 1. The reason of for doing this is that it is possible to compare the effect of agglomeration externalities and the industrial mix over the different development stages of manufacture and service industries, based on their technological and knowledge intensity. It also reduces the computational complexity of the estimation of the spatial wage.Footnote 7 The major drawback here is that the balanced panels consider only those individuals who do not move between these five aggregated sectors over time, but only within the industries belonging to each of them. Therefore, part of individuals who move between sectors is not observed. As the group which moves is usually associated with a higher productivity, it is possible that the estimations will be downwardly biased.

4 Data

The estimation of the wage equation discussed above is based on data of the formal Brazilian labour market. For this purpose, the main database considered here is the Identified RAIS (Annual Report of Social Information, from the Ministry of Labour), which consists of identified registration data of all formal firms and their employees, which focuses on the characteristics of the work contract. It provides annual information on all formally employed individuals in the private sector (or part of the public sector, depending on the type of contract), with a significant regional disaggregation (municipal level).Footnote 8

Among the main advantages of this database is that it has a longitudinal structure at both the firm and the individual level. Moreover, as it is a mandatory report, all firms are obliged to fill in the required information, covering the entire formal sector. This is also the reason why, differently from individual self-reporting surveys, there is a smaller risk of wage under-reporting. However, a potential drawback of using this database is that it covers only the formal sector, and it does not include entrepreneurs (who are self-employed) or individuals in managerial positions. Another relevant detail is that the reporting process is more accurate for the bigger firms, which are usually located in larger urban areas, and sometimes firms with multiple branches may report incorrectly the municipality of each employee.

Following the process of formalisation of the labour market, the size of RAIS has increased significantly in the past decade. In 2004, 30.3 % of total jobsFootnote 9 were formal, while in 2012 this percentage increased to 39.3 %. It is also important to note that there is a significant regional heterogeneity in this process. In 2004, only 17.5 % of total jobs were formal in the North region of the country and 17.1 % in the North-east, while 39.4 % were formal in the South-east, 35.1 % in the South and 28.6 % in the Centre-West. Then, in 2012, these percentages went up to 23.6 % in the North and 25.4 % in the North-east, while reaching 48.2 % in the South-east, 44.9 per cent in the South and 40.2 % in the Centre-West.Footnote 10

This issue is taken into account by comparing the relationships studied here over time (2004, 2008 and 2012). In order to explore the longitudinal structure of the database, a few steps were taken to generate a balanced panel for each period (2004–2008; 2008–2012; 2004–2008–2012), and the comparison between the first two data sets allows the discussion of the potential effect of formalisation on the characteristics of the formal labour force.Footnote 11 Finally, database tractability was obtained by generating random 20 % samples for each of the five technological sectors.Footnote 12

The descriptive statistics of the main variables of the first stage for the sample based on the period 2004–2008–2012 in 2012 are presented in Table 2, at the individual level. The average wage increases for sectors with higher technological or knowledge intensities. While Manufacturing high-tech (S3) and Services high-knowledge (S5) have relatively more workers with higher education, Manufacturing medium-tech (S2) workers are more present in the groups with incomplete and complete middle school education. As expected, Manufacturing low-tech (S1) and Services less-knowledge (S4) are characterised by less qualified workers (up to incomplete primary school education).

Table 2 Descriptive statistics of the main variables of the first stage of the model, 2012 (for the database containing individuals present in the whole period, 2004–2008–2012).
figure b

In terms of the skills required for the job, Cognitive skills are more relevant to sectors with a higher intensity of knowledge or technology (S3 and S5). The same pattern is observed for Telecommunication, Attention, On-the-job experience and Team-work skills. In the case of the service sectors (S4 and S5), they have more Assistance, Transportation, Artistic, Conflict management, Sales and Monitoring skills, while the manufacturing sectors (S1, S2 and S3) are associated with Maintenance and operation, Design and engineering, Accuracy and automation and Independence skills. Age does not seem to change significantly among these groups of sectors, except for S4, which seems slightly higher, but still with a considerably high standard deviation.

In relation to their regional distribution, both service sectors (S4 and S5) have a similar distribution of workers in the sample in all regions, and the biggest disparities in this distribution are observed for the manufacturing sectors. S3 and S2 are relatively more concentrated in the South-east of the country (the richest region in terms of GDP), while S1 is more present in the North-east. Comparing the regional distribution of services and manufacturing, the South has a greater concentration of workers for S1, S2 and S3, while the Centre-West is relatively less important for S2 and S3.

Firm size also varies among sectors, as there seems to be a larger share of workers in bigger firms in S3 and S5, which indicates the need for scale to develop productive activities in these sectors, at least in 2012. Furthermore, the panel structure is extremely relevant to understand the results found in the next section. The share of individuals who do not change industryFootnote 13 or REGIC area is very high, over 80 % for most sectors (except S5, in which it is slightly smaller than 70 %). This is an indication that the sample size necessary to obtain the variation to identify the spatial wage when individual fixed effects are included will be relatively small. Table 10 in Appendix 5 provides a comparison of selected descriptive statistics. Apart from the fact that individuals who change industry (within the technological sectors) and/or REGIC area are more educated, their mean salaries can be even smaller than the full sample. This last element goes against the literature, which states that workers who move are positively selected and should be more productive.

Finally, it is not possible to observe individuals in the sample for all 482 REGIC areas of the country, because the 20 % sample restricts the analysis and the formal sector is not necessarily present in all areas of the country (especially when public administration is excluded). The share of the areas is even smaller for S3, an indication that the high-technology manufacturing industries are more concentrated in specific parts of the country.

5 Results

In the previous section, it was mentioned that three different data sets would be analysed. Here the main results will be discussed in the case in which there are individuals present in the database for the years 2004, 2008 and 2012. At a certain point though, there will be a comparison using data for 2004–2008 and 2008–2012.

The first set of results refers to the first stage of the model (Eq. 2), at the individual level (Table 11). As mentioned in Sect. 3, there are two main alternatives considered here, a simple OLS estimation and a regression including individual fixed effects. When analysing the OLS estimation version, it is noticeable that most coefficients are statistically significant for all sectors, with the relative importance of age increasing with technology and knowledge intensity (when comparing the five sets of regressions, higher coefficients for age are found for S3 and S5). For each sector, age is also increasing at decreasing rates (the age-squared coefficient has a negative sign). In the case of education, all coefficients are positive and increase in value with higher levels of educational attainment. In fact, individuals with a graduate or postgraduate degree obtain higher salaries than illiterate individuals (this difference ranges from 67 to 83 %, depending on the sector). Only for the service sectors are the two lowest educational levels non-significant. The estimated coefficients for each skill seem to follow the expected signs. For instance, Cognitive skills are positively associated with wages in all sectors, while Transportation skills are positive only for S4 and S5 (services). On the opposite side, the Physical strength coefficient has a negative sign for most sectors (non-significant only for S1), which is in accordance with the literature. In the case of firm size, it is positively correlated with the individual wage, which means that bigger firms are supposedly more productive.

When individual fixed effects are considered (the last five columns), education variables are omitted, as they do not offer sufficient variation over time for each individual. Another relevant result is that many variables lose significance (i.e. age and some skills), as they are also captured by fixed effects. However, some elements are still present, such as the positive and significant coefficients for Cognitive skills and increasing values with firm size. Both groups of regressions also include interaction dummies between REGIC areas, industries and years (as expressed in Eq. 2). As mentioned in Sect. 3, the estimated coefficients of these dummies are directly used as dependent variables in the second stage and are referred to as the spatial wage.

While the first stage aimed to control for individual heterogeneity, the second stage explores the relationship between the spatial wage and different measures of diversity, specialisation and competition, as well as of urban size. Therefore, it will be possible to investigate the industrial scope of agglomeration economies. Table 3 presents a brief descriptive analysis of the main variables included in the second stage, at the REGIC area industry level for 2012.

Table 3 Descriptive statistics of the main variables for the second stage of the model for 2012 (database 2004–2008–2012).
figure c

The urbanisation measure indicates that S3, S5 and S2, in that order, are associated with denser areas. The mean of diversity increases with technology intensity for the manufacturing sectors and with knowledge level for the service sectors. This pattern is observed even for the diversity measure related to the share of the five biggest industries, because it is inversely related to diversity (a higher percentage indicates low diversity). In the case of specialisation measures, notably for the degree of specialisation, this relationship is exactly the opposite (higher specialisation is more relevant to the lower technological and knowledge intensity sectors). The location quotient achieves a much lower value for the service sectors, probably indicating that services are more homogeneously spread across the country. The degree of competition shows a very similar pattern to that of the diversity measures. Finally, the mean spatial wage seems higher in S5, S4 and S2, but there is no clear pattern.Footnote 14 The maps below show its regional distribution in detail (Fig. 1).

Fig. 1
figure 1figure 1figure 1

Regional distribution of the mean logarithm of the hourly wage and the spatial wage (for OLS and FE in the first stage), by sector in 2012 (database 2004–2008–2012)

The set of maps presented below allows a number of different analyses. First, for each sector, the comparison of the first map on the left with the two othersFootnote 15 shows that when individual characteristics are taken into account, at least part of the regional distribution changes significantly. This is an indication that controlling for individual characteristics in this matter is essential for the analysis in the regional context. For instance, for S1 to S4, if we compare the left and the centre maps, the Northern region of the country shows darker areas for the individual wage than for the spatial wage. For all sectors, the South-South-east of the country remains relevant in terms of having higher levels of the spatial wage (after controlling for individual characteristics). With regard to S3, there is a significant share of areas without information (44 %), due to a high regional concentration.

The inclusion of individual fixed effects in the first stage significantly changes the regional distribution of the spatial wage. As previously mentioned, in this case the spatial wage is calculated only for those individuals who moved between industries (inside the technological sector) and/or REGIC area over time (between 2004, 2008 and 2012). All maps are based on the 2012 data, which are associated with the last place in which these workers are observed. The fact that just a selected part of the sample of the workers who moved has been used to build the third column of maps explains why certain counterintuitive patterns emerge (such as darker areas in the countryside of the North-east, far from the coast) for most sectors.

Even if there is an indication of a possible spatial pattern in the data, for a number of reasons the possibility of estimating a spatial model was excluded. Firstly, the dependent variable in the second stage is not available for all REGIC areas. Moreover, by taking into account REGIC areas, many possible neighbourhood effects will be lost. This is because these areas are defined by daily commuting patterns, which already measure most of the relevant relations in the local labour market. If spatial effects were included here, they would possibly be capturing spurious relationships, which are absent in the specified model. These spatial effects could also capture part of the effects of local industrial composition. The main conclusion in this respect is that the spatial scale considered here minimises the need for spatial dependence analysis, which would be more justified on a smaller scale.

The dependent variable of the second stage is similar to the spatial wage depicted in the middle and right-hand maps (except that the maps show the average of the spatial wage over the various industries). Following Eq. 3, a set of controls in the industry and area levels are included as explanatory factors, in addition to different combinations of measures for the degree of urbanisation, specialisation, diversity and competition. These different combinations, called Compositions, aim to identify a robust pattern in the results from the analysis of the industrial scope of agglomeration economies, even when the explanatory variables are changed.

Table 4 Second-stage OLS regressions for the spatial wage, with and without individual fixed effects in the first stage.
figure d

If we focus initially on the five first columns of Table 4, which present the results for the second stage when there was OLS estimation in the first stage, there is a very robust indication for the urbanisation measure. Its coefficients are positive and statistically significant for all sectors, ranging from 0.0511 to 0.0940 in different Compositions. In all cases, the effects are ordered from the highest to the lowest as S3, S1, S5, S4 and S2. This can be considered as evidence that high-tech and low-tech manufacturing benefit more from urban scale. Another robust result is found for the diversity measures: there is a positive relationship for all sectors between more diverse areas and higher productivity (measured by spatial wages). The highest coefficients are observed for the manufacturing sectors, especially S3 (except in Composition 4, in which S1 and S2 have higher values).

The specialisation measure considered in the first specification (location quotient) is non-significant for all sectors (and negative for most of them). However, all other Compositions include the degree of specialisation for this dimension, with positive coefficients (which are non-significant only for S5). Therefore, the location quotient seems unsuitable for the problem concerned, while the degree of specialisation is more suitable to capture the importance of this dimension for local productivity. Finally, the estimated coefficients of the degree of competition are negative in all cases. High significance is found only for S2 and S4, and S1 is slightly significant in most specifications. Therefore, spatial wages in S3 and S5 seem to be unaffected by local competition, while sectors with lower technological or knowledge intensity are negatively affected.

It is possible to compare the results presented in Table 4 with the theoretical frameworks summarised in Table 1. The MAR approach is associated with (\(+\)) specialisation, (−) diversity and (−) competition, while that of Jacobs is related to (−) specialisation, (\(+\)) diversity and (\(+\)) competition and that of Porter is synthesised by (\(+\)) specialisation, (−) diversity and (\(+\)) competition. Hence, there is no clear pattern to indicate the more appropriate theoretical framework for each sector. In fact, for S5, it seems that only diversity is significant (and positive), suggesting at least partially that Jacobs’s perspective is more adequate. For S1, S2 and S4, even if diversity is positive and significant, the combination of positive specialisation and negative competition coefficients indicates that Marshall could be more adequate to explain their patterns. In the case of S3, there are elements from both Marshall and Jacobs’s perspectives. The result to be highlighted here is that the best possible industrial mix to foster productivity can vary for each sector, as it is associated with their technological intensity.

The last five columns of Table 4 depict the results for the spatial wage regression when individual fixed effects are included in the first stage. Therefore, only workers who have moved (between different REGIC areas and/or industries inside a technological sector) measure the spatial wage. Even if many of the coefficients change significantly from the previous analysis, in the case of density they are still positive and significant for all sectors. It is noteworthy that for S2 there is an important increase in the size of the elasticity (which was around 0.05 and went up to something around 0.16 and 0.19), but for the other sectors these elasticities assume a very similar value to the previous case. In the literature, most studies find that, when controlling for sorting (by including individual fixed effects), the estimated coefficient for urbanisation decreases significantly (Combes et al. 2008a). Following Groot et al. (2014), there is no consensus on whether the inclusion of individual fixed effects is the best strategy, because it may create a selection bias as the spatial wage will be measured only by migrants (who should be more associated with risk-taking, entrepreneurship and adaptation, among other characteristics, which can be positively related to productivity).

Table 5 Second-stage regressions for the spatial wage, with and without individual fixed effects on the first stage, and with instrumental variables for ln(density).
figure e

In relation to the other variables which capture industrial composition, the signs are similar, but there are important changes in significance. Diversity measures are still positive, but non-significant for S3 and less significant for S4 in all specifications. The degree of specialisation becomes non-significant or slightly significant for S2, S3 and S5 in the different specifications. In the case of the degree of competition, elasticities are still negative but lose significance for S1, S3, and S5. Then, S4 gets closer to Marshall’s explanation, while S1 and S5 spatial wages will be positively associated with diversity (Jacobs’s perspective). Finally, the explanatory power of each set of models varies significantly. When the first stage is based on OLS, second-stage variables are more relevant to explain the spatial wage, with an \(R^{2}\) ranging from 0.28 to 0.44. On the other hand, the last five columns show a more heterogeneous pattern for this measure of fit: while the \(R^{2}\) of S2 ranges around 0.49, the one for S3 is much lower (around 0.06). The fit of the model is also relatively high for S5, an indication that, for the remaining sectors (S1, S3 and S4), after controlling for individual heterogeneity, spatial wage variation is less associated with the local industrial composition.

The second issue explored here involves the potential endogeneity of the urbanisation measure (density). As discussed in Sect. 3, the inclusion of individual fixed effects aims to control for a sorting process, meaning that more productive individuals will migrate to bigger cities where they can find higher salaries, reinforcing city size and local productivity. However, it is also important to tackle the potential endogeneity of the quantity of labour (Combes et al. 2011). Table 5 presents a similar set of results as to those in Table 4, but with IV estimations. Once again, Composition 1 seems less adequate, especially for the regressions with OLS in the first stage. The main conclusions seem to hold, especially for the density coefficients, which are still positive and significant. Focusing on the first five columns, once again (with OLS in the first stage), diversity seems to be positively associated with the spatial wage, even if for S5 in some specifications this coefficient is non-significant.

There is no clear order of effects among sectors, except to highlight the fact that S3 has the highest elasticities. For competition, the coefficient of S5 is non-significant and the one for S1 becomes less significant. These results also indicate that specialisation is relatively more important for S1 and then S3, while the elasticities of competition show a stronger negative effect for S2 and S4. Therefore, the inclusion of instrumental variables does not change, in a relevant way, the main conclusions drawn from Table 4, which is the evidence of robust results.

Another robustness check involves comparing these models for different time periods. Instead of analysing all years simultaneously (2004–2008–2012), it is possible to break them down into two groups: individuals who are observed in 2004–2008, and individuals present in 2008–2012. It is important to note that the sample will not necessarily be composed by the same individuals as in the previous case, because now the requirement is that the individual is observed in the database in only two instead of 3 years.

Brazil has seen a significant formalisation process over the last decade, leading to an increase in the original database of this study (see Sect. 4). It is not possible to analyse the whole labour market simultaneously, owing to the lack of identified information at the individual level for the informal sector. However, it is necessary to investigate whether and how this formalisation may have changed productivity determinants over the country. The aforementioned comparison of two periods provides elements to address this issue at least partially and, at least indirectly, controls for the regional heterogeneity in the formalisation process identified in Sect. 4. Table 12 in Appendix 7 provides the main results for the models with IV in the second stage and OLS in the first. This set of results can be compared with the first five columns of Table 5.

In fact, significance does not seem to be affected in a relevant way by the split of the database, meaning that the relationships explored here are present during the whole period. The main change refers to the size of the main elasticities, which decrease from 2004–2008 to 2008–2012. This is especially true in the case of the urbanisation measure, with the highest decreases happening for S2, S4 and S1. A possible explanation for this result is that this formalisation process was stronger in more remote areas, places in which initial informality was higher. It is expected that the urban size will be smaller for these new areas with incoming workers, reducing density elasticities in relation to the spatial wage. For the other variables, there is no clear pattern, with increasing or decreasing coefficients, depending on the sector. However, the fact that signs, significance and relative size do not change much is another indication of the robustness of the results.

In summary, the main results found here seem to be reasonably robust to different specifications and time spans. The most suitable industrial mix for each sector can vary, meaning that not necessarily only one theoretical framework can explain the economy as a whole. There seems to be a heterogeneous effect for different sectors, a fact that must be taken into account when proposing public policies aiming to foster productivity. Moreover, private sector agents should balance the most relevant factors for their industry when choosing its location.

6 Conclusions

This study has aimed to contribute to the literature on static agglomeration effects, controlling for individual skills and comparing different estimation strategies. Focusing on the industrial scope of agglomeration economies in Brazil, we provide a large set of results covering different concerns usually present in this literature.

The analysis was based on identified registration data covering all formal firms and their employees in Brazil in three particular years (2004, 2008 and 2012). After all proper cleaning procedures, a sample was selected, with information for almost the whole country. Then the two-stage estimation procedure allowed the following variations: a first stage for the logarithm of the hourly wage at the individual level, with and without individual fixed effects; and a second stage for the spatial wage (obtained as the estimated interaction dummies area–sector–year in the first stage), with and without instrumental variables for employment density. In addition, the analysis was divided into two time periods, 2004–2008 and 2008–2012, in order to assess whether the formalisation process seen in Brazil in this period could have affected the results.

With a separate analysis for each of the five broad sectors (S1—Manufacturing low-tech; S2—Manufacturing medium-tech; S3—Manufacturing high-tech; S4—Services less-knowledge; and S5—Services high-knowledge), the main conclusion is that the best possible industrial mix to foster productivity can vary for each sector, being associated with their technological intensity. An unambiguous and universally valid theoretical framework can apparently not explain the results found for the different sectors.

Synthesising the theoretical alternatives, the MAR approach is associated with (\(+\)) specialisation, (−) diversity and (−) competition; Jacobs’s approach is related to (−) specialisation, (\(+\)) diversity and (\(+\)) competition; and Porter’s approach is summarised by (\(+\)) specialisation, (−) diversity and (\(+\)) competition. In the simplest estimation (with OLS in the first and second stages), for S5 it seems that only diversity is significant (and positive), suggesting that the Jacobian perspective is more appropriate. For S1, S2 and S4, even if diversity is positive and significant, the combination of positive specialisation and negative competition coefficients indicates that the MAR framework could be more adequate to explain these patterns. In the case of S3, there are elements from both Marshall’s and Jacobs’s perspectives. These results seem to be robust to different specifications and estimation strategies.

Another important aspect to be highlighted is that, when splitting the database into two time periods, even if the main conclusions remain unchanged, the size of the elasticities for urbanisation economies is much smaller in the second period, for all sectors. This can be evidence of the fact that the recent formalisation process was more concentrated in the remote areas of the country (with higher initial informality). Therefore, the urban size in these new areas is expected to be smaller, reducing density elasticities in relation to the spatial wage for the newcomers in the formal labour market.

Finally, the urbanisation economies coefficient (the logarithm of employment density) is positive and significant for all sectors, ranging from 0.0511 to 0.0940 in different specifications, under the simplest estimation (using OLS in the first and the second stages). Ordering the effects between the sectors from the highest to the lowest, we find the following sequence: S3, S1, S5, S4 and S2. This can be considered as evidence that the high-tech the and low-tech manufacturing sectors benefit more from urban scale in Brazil, followed by services associated with higher knowledge.

Consequently, different city sizes can be more appropriate for different industries. The local industry mix can also vary for each type of activity, which means that sometimes diversity is more relevant, while on another occasion specialisation can be more important, as well as competition. This implies that both public policy and private sector actions must consider this heterogeneity when searching for the most appropriate incentives and locational strategies to increase productivity in a certain region or specific production unit.