1 Introduction

Typically, the relationship between development and urbanization is illustrated with a graph plotting the degree of urbanization (fraction of population living in an urban area) against GDP per capita or its rate of growth. This is true of international comparisons (Henderson 2003a, p. 281; Spence et al. 2009, Annex 2), as well as regions (Crédit Suisse 2012, p. 16; Zhu et al. 2012, Fig. 2). Urban concentration has also been the subject of much attention in relation to economic development. Indeed, about the urbanization process that occurs with development, Henderson (2003b) writes: “There are two key aspects to the process. One is urbanization itself and the other is urban concentration (…)”. Brülhart and Sbergami (2009) also measure agglomeration alternatively through urbanization shares and through indexes of spatial concentration. In this paper, we develop a measure which combines both aspects.

Specifically, the issue we address is how to measure urbanization, using readily available data, in a way that reflects the potential for agglomeration economies of the urbanization type. Our approach is founded on the view that agglomeration provides opportunities for interactions between economic agents, a key mechanism by which urbanization economies are generated. Our measure will therefore be closely related to the view of an “urban area” as an integrated market. And as it turns out, we “rediscover” an index originally proposed by demographer Eduardo Arriaga (1970, 1975) as a measure of urbanization. That index possesses three properties that are fundamental for a measure of agglomeration: (1) it increases with the concentration of population and conforms to the Pigou–Dalton transfer principle; (2) it increases with the absolute size of constituent population interaction zones; and (3) it is consistent in aggregation. The index has other important properties: it does not require an arbitrary population threshold to separate urban from non-urban areas; it is easily adapted to situations where an population zone lies partly outside the geographical area for which agglomeration is measured (boundary problem); and it can be satisfactorily approximated when data is truncated or aggregated into size-classes.

The rest of this paper is organized as follows. The next section presents a view of urbanization economies resulting from opportunities for interaction. Ensue the three fundamental properties that we believe an agglomeration index should possess. Concentration measures, in particular, fail to meet these conditions, but Arriaga’s index does. The next section develops this index from our theoretical view of agglomeration and discusses its properties. Then the index is computed for the Spanish NUTS III regions, and its performance is compared to that of the degree of urbanization and the Hirshman–Herfindahl concentration index. A concluding section completes the paper.

2 Measuring Urban Agglomeration as Opportunities for Interaction

A taylor, a physician and a sports coach all take measurements of the human body. But for different purposes, they use different measures. How should we measure agglomeration for the purpose of examining the potential for agglomeration economies of the urbanization type (urbanization economies, for short)? To answer that question, we need to have a theory, a model, or at least a general view of how urbanization economies arise.

The concept of Agglomeration economies, first proposed by Weber (1909), is central in regional and urban economics. Ohlin (1933), Hoover (1937), Isard (1956) clarify the idea and distinguish different types of agglomeration economies: (1) large-scale economies, (2) localization economies and (3) urbanization economies. Here we are concerned with urbanization economies, and more specifically with urbanization economies insofar as they are the result of interaction between economic agents. The concentration of population in an urban area multiplies opportunities for interaction; the greater the number of possible interactions, the greater the potential for urbanization economies, as proximity encourages formal and informal exchanges of ideas which nourish innovation and contribute to the diffusion of knowledge (knowledge spillovers). Consequently, the measure we are looking for is a measure of opportunities for interaction. This aspect of urbanization economies has been examined by Glaeser et al. (1992) in relation to city growth. The authors compare the evolution of employment in individual industries across cities, to confront the predictions of competing views of how knowledge spillovers stimulate growth (specialization vs. diversity; competition vs. monopoly and the internalization of externalities). In this paper, however, we deal with a different issue, namely, measuring opportunities for interaction in a geographical area (region, group of regions, country) that may comprise several cities, or a single city, or even none at all (a region of villages, for example).

Now, in the abstract, restricting our attention to interaction between pairs of individuals, the number of possible pairs in a group increases as the square of the number of individuals.Footnote 1 Not all links, however, are equally probable: distance (physical or social) may impede communication, and congestion may interfere with exchanges. But, practically speaking, it is impossible to weigh all pairs according to distance and congestion factors. Yet these can be taken into account indirectly, using labor mobility as an indicator of whether interactions are likely or not.

Local labor market areas (LLMAs) are delineated on the basis of labor mobility, so they are well-suited as basic territorial units for examining interaction-based urbanization economies. Labor mobility is also a key criterion used by statistical agencies to circumscribe metropolitan or urban areas. Consequently, an LLMA cannot in principle comprise only part of a metropolitan or urban area. LLMAs may (and do) however transcend municipal boundaries and include several population nuclei, as do metropolitan areas. But, contrary to metropolitan and urban areas, the set of LLMAs covers the whole territory, not just the main urban areas (it is a partition of the territory). It follows that any area that would not be comprised in any metropolitan or urban area is nonetheless included in some LLMA; so LLMAs containing metropolitan areas may also contain some additional areas. And, of course, not all LLMAs are metropolitan areas or even urban areas. The delineation of LLMAs has been implemented in several countries.Footnote 2 The partitioning of geographic space into LLMAs is an exercise in the spatial analysis of commuting patterns, in order to “define geographical units where the majority of the interaction between workers seeking jobs and employers recruiting labour occurs (i.e. to define boundaries across which relatively few people travel between home and work)” (Casado-Díaz 2000).

Now, population is often used as an indicator of potential urbanization economies for a metropolitan area. In such case, when the focus is on opportunities for interaction, the implicit simplifying assumption is made that the interactions which matter are the ones taking place within the limits of the metropolitan area. The same simplifying assumption can be made regarding LLMAs to estimate the number of interaction opportunities in each LLMA. And since LLMAs define a partition of geographic space, including zones that would be classified as non-urban, it becomes possible to estimate the number of interaction opportunities for a region by aggregating the number of interaction opportunities in the LLMAs that constitute the region, following the procedure detailed below. Finally, to relate the number of interaction opportunities to productivity (output per unit of input), it must be divided by population. By computing interaction opportunities at the LLMA level before aggregating, we use the labor mobility criterion which defines LLMAs to determine which interactions are likely or not. This is how we implicitly take into account distance decay.Footnote 3

What does this interaction-based view of agglomeration tell us about how to measure the possibility of urbanization economies? First, since the number of opportunities for interaction increases more than proportionately with population, an agglomeration index should increase with concentration. More specifically, an agglomeration index should conform to (an inverted form of) the Pigou–Dalton transfer principle:Footnote 4 the number of opportunities created by moving one person to a larger LLMA (increasing concentration) is greater than the number of opportunities destroyed by his/her leaving the smaller LLMA. Second, a measure of concentration only is not a proper agglomeration index, because concentration is a property of the distribution of population between LLMAs, and it is insensitive to their sizes: there are more interaction opportunities per capita in a region consisting of two LLMAs with populations of 150,000 and 50,000 than in a region consisting of two LLMAs with populations of 15,000 and 5,000; or, to take an absurd example, a desert with a single inhabitant is 100 % concentrated, but offers no opportunities for interaction. Third, an agglomeration index should be consistent in aggregation: the same rule that is applied to aggregate individual LLMA agglomeration indexes into a regional index should also be valid to aggregate regional indexes into an agglomeration index of a group of regions.

There are other desirable properties, which will be discussed later. Since there are interaction opportunities even in the smallest of villages, an agglomeration index should be defined without reference to a population threshold below which an area is excluded from calculations. On the other hand, data is sometimes truncated, or aggregated into size-classes, so it would be advantageous to be able to adapt an agglomeration index to such circumstances. More generally, an index with modest data requirements is more susceptible of application. Another attractive property is the possibility of computing an agglomeration index for a region that includes only part of some LLMAs (more about the boundary problem below).

Let us now examine measures of urbanization found in the literature to see whether they have the fundamental properties enunciated above, namely: (1) to increase with the concentration of population and conform to the transfer principle; (2) to increase with the absolute size of constituent LLMAs; and (3) to be consistent in aggregation. The degree of urbanization (percentage of population living in an urban area) fails as a measure of agglomeration with respect to all three criteria, in addition to raising the difficulty of drawing a line between urban and non-urban.

Much of the literature on urbanization and development, however, focuses on the relationship between urban concentration and economic development or productivity. Yet, by definition, measures of concentration do not satisfy our second criterion. As a matter of fact, according to Cowell (2009), a measure of concentration (which Cowell applies to measure income inequality) should satisfy the “income scale independence principle”, which, transposed to the measurement of agglomeration, is the exact opposite of our second criterion of an interaction-based measure of agglomeration. It will nonetheless be useful to briefly review urban concentration measures. Wheaton and Shishido (1981), Henderson (1988) use the Hirshman–Herfindahl index, to which we shall return later. In the present context, it is defined as the sum of squared population shares of LLMAs.Footnote 5 Henderson (2003a, 2003b) uses urban primacy, the share of urban population that lives in the largest city. It does not satisfy the transfer principle, but Henderson argues that it is conveniently available for many years and many countries, and that it is closely correlated with Hirshman–Herfindahl indexes. Rosen and Resnick (1980)Footnote 6 measure urban concentration using the econometrically estimated exponent of the Pareto distribution for the city-size distributions of 44 countries. They find that it is quite sensitive to the definition of the city and the choice of city sample size (number of cities or city-population threshold). Brülhart and Sbergami, (2009) apply Theil entropy indexes developed in Brülhart and Traeger (2005). But, as mentioned earlier, these are all concentration measures which fail to satisfy our second criterion.

Uchida and Nelson (2010) look for a remedy to inconsistencies in published UN data on the degree of urbanization which are due to divergences between reporting countries in the way they delineate urban areas and in the population thresholds applied to define what is urban and what is not. They propose an “agglomeration index” which focuses on the key indicators of the sources of agglomeration economies: population density, the size of the population in a “large” urban centre, and travel time to that urban centre. Combining data from several sources (including GIS data on transport networks), and applying interpolation techniques, they map all three factors on the surface of the earth in 1-kilometer pixels. For each of the three factors, a threshold is defined, and the estimated population in pixel areas that meet all three criteria is classified as “urban”. This yields an estimate of the degree of urbanization. The Uchida-Nelson agglomeration index approach is promising, but it remains tied to the urban/non-urban dichotomy and as such, it fails all three of the criteria mentioned above.

Reflecting on the weaknesses of the percentage of population living in “urban areas” as a measure of urbanization, demographer Eduardo Arriaga (1970, 1975) proposed an index which “takes into account the statistical concept of the expected value of the size of the locality where a person, randomly chosen, resides” (p. 208). In the following section, we develop an index based on interaction opportunities, which turns out to be Arriaga’s index applied to LLMAs. This index, as we shall see, satisfies all three basic conditions listed above.

3 Arriaga’s Index Applied to LLMAs

3.1 Arriaga’s Index as a Measure of Interaction Opportunities

We derive Arriaga’s agglomeration index from our view of urbanization economies as the result of interaction between economic agents. In so doing, we make two simplifying assumptions. First, we limit our attention to pairwise interactions, so that the number of interaction opportunities among n persons is n(n–1)/2, which we approximate as n 2/2. Second, invoking the space-analytic foundations of LLMA delineation, we take into account interaction opportunities within LLMAs, while ignoring interactions that may take place across LLMAs.

Formally, let n i be the population size of the ith LLMA in the geographical area under consideration. Then the total amount of interaction opportunities is approximated as \(\frac{1}{2}\sum\nolimits_{i} {n_{i}^{2} }\). We relate productivity (output per unit of input) to the number of interaction opportunities of the average individual, which leads to the following measure of agglomeration:

$${\text{Measure of agglomeration}} = \frac{{\sum\limits_{i} {n_{i}^{2} } }}{{2\sum\limits_{i} {n_{i}^{{}} } }}$$
(1)

Expression (1) is precisely one half of Arriaga’s mean city-population size. Now let

$$f_{i} = \frac{{n_{i} }}{{\sum\limits_{j} {n_{j}^{{}} } }}$$
(2)

be the fraction of population in the ith LLMA. Dropping the division by 2, we have an index of agglomeration defined as

$$I = \frac{{\sum\limits_{i} {n_{i}^{2} } }}{{\sum\limits_{i} {n_{i}^{{}} } }} = \sum\limits_{i} {\left( {\frac{{n_{i}^{{}} }}{{\sum\limits_{j} {n_{j}^{{}} } }}} \right)n_{i}^{{}} } = \sum\limits_{i} {f_{i}^{{}} n_{i}^{{}} }$$
(3)

This is Arriaga’s (1970) index. His interpretation of the index stands out from formula (3)Footnote 7: since f i is the probability that a randomly chosen individual reside in the ith LLMA, then I is the mathematical expectation of the size of the LLMA in which a randomly chosen individual lives.Footnote 8 In other words, the average individual lives in an LLMA of size I.

Let us pursue the development, substituting from (2), we find:

$$I = \sum\limits_{i} {f_{i}^{{}} n_{i}^{{}} } = \left( {\sum\limits_{j} {n_{j}^{{}} } } \right)\sum\limits_{i} {f_{i}^{{}} \left( {\frac{{n_{i}^{{}} }}{{\sum\limits_{j} {n_{j}^{{}} } }}} \right)} = \left( {\sum\limits_{j} {n_{j}^{{}} } } \right)\sum\limits_{i} {f_{i}^{2} }$$
(4)

where:

$$H = \sum\limits_{i} {f_{i}^{2} }$$
(5)

is the Hirschman–Herfindahl (HH) concentration index. So:

$$I = \left( {\sum\limits_{j} {n_{j}^{{}} } } \right)H$$
(6)

Our index is the HH concentration index, multiplied by the total population in the geographical area under consideration. This illustrates how it accounts for both concentration and size, which we consider as two aspects of the potential for economies of agglomeration of the urbanization type.

The reader may be surprised that the index takes the same value I = n for a territory with a single LLMA of size n as it does for a territory with K LLMAs, all of size n. This is true, and it is as it should be. It can be seen from Eq. (6) that in the first case, I = n × 1; in the second, using the numbers equivalent propertyFootnote 9 of H, I = (K × n)(1/1 K × K) = n, as greater size makes up for less concentration. In both cases, the average individual lives in an LLMA of size n (following Arriaga’s interpretation). Insofar as I measures the average individual’s interaction opportunities, the potential increase in productivity associated with agglomeration economies is equal in both cases.

Finally, Arriaga’s index has an interesting geometric interpretation: in a graph of the stepwise cumulative distribution of population according to LLMA sizes, it is the area above the curve. We illustrate this using fictitious data as an example. Table 1 gives the population of each of 5 LLMAs in some geographic area under investigation.

Table 1 Fictitious data

As mentioned before, a common measure of the degree of urbanization is the proportion of population living in urban areas above a certain size. For instance, given the data in Table 1, the degree of urbanization could be measured as the proportion of population living in LLMAs with a population of at least 50,000. In our example this would be 92.4 % (425,000/460,000). Referring to the distribution of population (Table 1, column 2), this way of measuring urbanization amounts to lumping together all categories but the first. The conventional measure of urbanization is therefore based on a highly simplified representation of the more detailed distribution of population, represented in Fig. 1. It is completely insensitive to the distribution of population among the LLMAs with more than 50,000 inhabitants.

Fig. 1
figure 1

Cumulative distribution of population according to LLMA sizes

Let K be the number of LLMAs in the geographical area under consideration (here 5). Then the area above the curve is equal to:

$$I = \sum\limits_{i = 1}^{K} {\left( {1 - F_{i - 1} } \right)\;\left( {n_{i} - n_{i - 1} } \right)}$$
(7)

where LLMAs are assumed to be ordered from smallest to largest, and \(F_{i} = \sum\nolimits_{j = 1}^{i} {f_{j} }\) is the cumulative distribution. In our example, this is equal to 0.0242. It is shown in the Appendix 1 that (7) is equivalent to (3).

Before moving on to examining the properties of the index, it should be pointed out that Arriaga’s original presentation dealt not with LLMAs, but with traditional city-population data. Arriaga examined the sensitivity of his index to the choice of an urban threshold, a cut-off point in the distribution below which localities are classified as rural and excluded from the computation of the index; he found that it was pretty robust.

3.2 Properties of the Index

To start with, the domain of the Arriaga index is well defined. It is non-negative, and its lower bound is zero. This extreme case would be approximated if all of the population lived in rural areas, in very small autarkic villages (of, say, 100 inhabitants); the cumulative distribution curve would then be close to an upside-down «L», with the horizontal line at the 100 % level, and the vertical line at the 100 population level. The upper bound of the index is equal to the population of the largest LLMA in the geographical area under consideration, and would occur if all population were concentrated in that largest LLMA. The curve would then be a mirror-image of an «L», with the vertical bar to the right.

3.2.1 Axiomatic Properties

We now verify whether the Arriaga index possesses the fundamental properties of an agglomeration index. Considering Eq. (6), the value of the index clearly increases with concentration as measured by the HH index.Footnote 10 Moreover, it is demonstrated in the Appendix 3 that it conforms to the transfer principle. Second, once again turning to Eq. (6), the index clearly increases with the average size of LLMAs.

Let us examine whether the Arriaga index is consistent in aggregation. First, notice that, for a geographical area containing a single LLMA of population n, formula (3) shows that the value of the index is simply n 2/n = n. The same formula shows that for K LLMAs, the value of the index is a weighted average of individual LLMA indices n i , where the weights f i are the population shares of LLMAs. Now consider a geographical area of interest partitioned into two regions, defined by a pair of complementary sets A and \(\bar{A}\): the ith LLMA belongs to one region if i ∊ A, and to the other if i ∉ A. Now let

$$F_{A} = \left( {\frac{{\sum\limits_{i \in A}^{{}} {n_{i} } }}{{\sum\limits_{i}^{{}} {n_{i} } }}} \right)\;{\text{and}}\;F_{{\bar{A}}} = \left( {\frac{{\sum\limits_{i \notin A}^{{}} {n_{i} } }}{{\sum\limits_{i}^{{}} {n_{i} } }}} \right) = 1 - F_{A}$$
(8)

where FA is the fraction of population that lives in an LLMA belonging to the set A of LLMAs that constitute one region, while \(F_{{\bar{A}}}\) is the fraction of population that lives in the other region. Also let

$$f_{i\left| A \right.} = \frac{{n_{i} }}{{\sum\limits_{j \in A}^{{}} {n_{j} } }} = \frac{{n_{i} }}{{\sum\limits_{j}^{{}} {n_{j} } }}\frac{{\sum\limits_{j}^{{}} {n_{j} } }}{{\sum\limits_{j \in A}^{{}} {n_{j} } }}\quad{\text{and}}\;{\text{likewise}} ,\; \frac{{f_{i} }}{{F_{A} }}f_{{i\left| {\bar{A}} \right.}} = \frac{{f_{i} }}{{F_{{\bar{A}}} }}$$
(9)

According to formula (3), we have

$$I_{A} = \sum\limits_{i \notin A} {f_{{i\left| {{A}} \right.}} n_{i}^{{}} } \quad{\text{and}}\;I_{{\bar{A}}} = \sum\limits_{i \notin A} {f_{{i\left| {\bar{A}} \right.}} n_{i}^{{}} }$$
(10)

Recalling that n i is the agglomeration index of the ith LLMA, formula (3) applied to the two regions translates as

$$I = F_{A} I_{A} + \left( {1 - F_{A} } \right)I_{{\bar{A}}}$$
(11)

Substitute from (9) and (10), and use \(F_{{\bar{A}}} = 1 - F_{A}\) to find

$$I = F_{A} \sum\limits_{i \in A} {\left( {\frac{{f_{i} }}{{F_{A} }}} \right)n_{i}^{{}} } + F_{{\bar{A}}} \sum\limits_{i \notin A} {\left( {\frac{{f_{i} }}{{F_{{\bar{A}}} }}} \right)n_{i}^{{}} } = \sum\limits_{i} {f_{i} n_{i}^{{}} }$$
(12)

So indeed, the same rule that is applied to aggregate individual LLMA agglomeration indexes into a regional index is also valid to aggregate regional indexes into an agglomeration index of a group of regions.

3.2.2 Boundary Problem and Other Applicability Considerations

So far, the Arriaga index has been developed under the assumption that every LLMA is entirely contained in the geographical area for which the index is to be computed. But this cannot be guaranteed, since LLMAs are delineated without regard for administrative boundaries.Footnote 11 Let f ir be the fraction of the population in region r residing in the ith LLMA. Then, bearing in mind Arriaga’s (1970) interpretation, the average size of the LLMA where a randomly chosen individual lives is given by a formula slightly different from (3):

$$I_{r} = \sum\limits_{i} {f_{ir}^{{}} n_{i}^{{}} }$$
(13)

where f ir replaces f i . Eq. (13) is the way to compute our index for regions when LLMAs extend across regional boundaries. Now, however, the tight relationship with the HH concentration index breaks down.

To sum up, the Arriaga index applied to LLMAs satisfies the three fundamental properties of an agglomeration index. It is less difficult to compute than the Uchida and Nelson (2010) agglomeration index, while implicitly taking into account their three criteria—population density, the size of a “large” urban centre, and travel time—through the spatial-analytic underpinnings of LLMA delineation, and without the requirement of defining an arbitrary urban threshold. On the other hand, if a delineation of LLMAs is not available, the Arriaga index is applicable to traditional city-population data, although it is preferable that “cities” be defined as functional areas as are metropolitan areas (in general, every metropolitan area is an LLMA). Finally, we have shown elsewhereFootnote 12 that it would also be possible to compute the index, and obtain similar results, when the underlying population data is available only in LLMA size-categories, rather than for individual LLMAs.

4 Empirical Application

In this section, we apply this index to the Spanish provinces (NUTS III regions), using LLMAs as basic units for the construction of the index. Spain is a very good example because data availability forces much of the empirical research to be conducted at the NUTS II or NUTS III levels, even though a different geography may be preferable. Most of the economic information provided by the Spanish National Statistical Institute (INE) (GDP, stock of capital, wages or employment data…) is available only for the whole country or at the NUTS III level. There is little data available at a finer level of geographical detail, such as municipalities. To find geographically disaggregated economic information, one has to look up some very specific databases or use the data provided by taxes or unemployment registers. This scarceness of finer information also prevails in many other countries. Fortunately, detailed population data is often available. And from detailed population data, it is possible to construct the Arriaga index and put the degree of agglomeration in relation with other key economic concepts such as regional productivity or growth.

4.1 The Spanish Provinces

Administratively, Spain is divided into 8,105 municipalities that are aggregated into 52 provinces (NUTS III level) and 17 Autonomous Communities or NUTS II regions. The number of municipalities within each province ranges from 34 (Las Palmas) to 371 (Burgos). Furthermore, there are Autonomous Communities with several provinces (for example, Andalusia with eight), and others with only one, like Asturias. For comparison with other European Union member-states, the seventeen Autonomous Communities can be aggregated into seven administrative regions or NUTS I regions, which have no real internal political or administrative meaning.

It is important to point out that municipalities are not the basic territorial units from which we construct our index. In Spain, a municipality is an administrative division of the territory which has not necessarily been defined with economic significance in mind. Indeed, in many cases, there is a high level of commuting between neighboring municipalities. And so municipalities have been aggregated into LLMAs which may transcend municipal boundaries, and eventually make up a metropolitan area, which might include several population nuclei surrounding a core one. To delineate LLMAs in Spain, Boix and Galleto (2006) have applied the regionalization method developed for Italy by Sforzi (ISTAT 1997, 2005, 2006; Sforzi 2012). The Spanish LLMAs have been delineated through a multi-stage process. Applying an algorithm that consists of four main stages and a fifth stage of fine-tuning, Boix and Galleto aggregate the 8,106 Spanish municipalities into 806 LLMAs. The algorithm starts with the municipal administrative unit and it generates the LLMAs using data on resident employed population, total employed population and home-to-work commuting, from the 2001 Spanish Population and Housing Census (INE).Footnote 13, Footnote 14

The LLMA data is used to compute the Arriaga index of urban agglomeration for each province. Since there are LLMAs which straddle provincial boundaries, the actual formula used in the calculations is Eq. (13), developed above to deal with the boundary problem.Footnote 15 Finally, for ease of presentation, all index values were divided by the population of the Madrid LLMA, the largest in the country, so that their range of variation is from 0 to 1.

4.2 Comparison of the Indexes

Figure 2 represents the cumulative distribution of population according to LLMA size for the province of Asturias, 2001 (similar to Fig. 1). The area above the curve is equal to the Arriaga index, the value of which is 233,637, or 4.4 % of the population of the Madrid LLMA.

Fig. 2
figure 2

Cumulative distribution of population according to LLMA size, Asturias (2001). Source: own based on INE (2001) data

All the Spanish provinces are plotted in Fig. 3, ranked by the value of the index. Madrid is followed by Barcelona and, at a much greater distance by Vizcaya, Valencia and Seville, which contain cities among the biggest in the country. At the opposite end are located the provinces with the lowest population densities of the country (Huesca, Cuenca, Soria and Teruel).

Fig. 3
figure 3

Arriaga index for Spanish provinces, 2001 (% of Madrid LLMA population). Source: own based on INE (2001) data

Figure 4 plots the HH index of urban concentration and Fig. 5 the classical degree of urbanization (the percentage of population living in cities of more than 50,000 inhabitants). Both figures show how ranking provinces according to the HH index or the classical degree of urbanization leads to obvious aberrations if one wants to compare provinces with respect to their potential for interaction-based urbanization economies.Footnote 16 Ceuta and Melilla, two autonomous cities located in North Africa, with a joint population of less than 140,000 inhabitants in 2001, stand at the top of the hierarchy, with a 100 % degree of urbanization and a HH index of 1. The province of Teruel has no LLMA of 50,000 inhabitants or more, so its degree of urbanization is zero, and the tail-end of the curve in Fig. 5 drops abruptly. In spite of this, and although it is the fourth least populated province of Spain with the second lowest density, Teruel ranks significantly better, 39th, with respect to the HH index; the Arriaga index puts it in 49th position. There are other cases of misrepresentation with the HH index: for example, Alicante is in 50th position with respect to that index, but it occupies the 5th rank in Spain for population, and the 7th in terms of density; the Arriaga index, which takes into account both concentration and scale, ranks Alicante 24th. Barcelona, the province with the country’s second largest city, is surprisingly ranked 15th according to the HH index. Aberrations also appear with the classical index: Sevilla, for instance, is the 4th largest province in terms of its population (1.7 million), 70 % of which live in the Sevilla metropolitan area, the 4th largest city in Spain; yet it falls to the 18th place according to its degree of urbanization, and to 10th place according to the HH index, behind Álava with a mere 300 thousand in population and a much lower density. We quote two final examples of violent changes in ranking between the three indexes. Málaga, the 6th largest city in Spain, occupies the 18th place according to the classical index, but the 8th with Arriaga’s (9th with the HH index); Guadalajara, among the least populated provinces, with one of the lowest densities in the country, ranks 7th with the HH index and 13th with the classical index, while it is 27th with Arriaga’s. We conclude that the Arriaga index displays a more suitable classification of the Spanish provinces, which better reflects potential agglomeration economies from urbanization, something the other two indexes are unable to capture.Footnote 17

Fig. 4
figure 4

HH concentration index for Spanish provinces, 2001 (%). Source: own based on INE (2001) data

Fig. 5
figure 5

Degree of urbanization of Spanish provinces, 2001 (% of population living in LLMAs of 50,000 hab. or more). Source: own based on INE (2001) data

4.3 Index Performance Evaluation

To illustrate how our index is able to better capture economic patterns, we correlate it with the location quotients of some of the activities known to be highly sensitive to agglomeration economies of the urbanization type: high order producer services, also called knowledge intensive business services (henceforth KIBS). There are numerous empirical studies that use location quotients (alongside other measures) to confirm the tendency of these industries to concentrate in relation to agglomeration economies of the urbanization type.Footnote 18 Hence, we expect a good index of urban agglomeration to be highly correlated with the location quotients of these services, and among competing indexes, we would prefer the one with the highest correlation.

The location quotient (LQ) that we use is the simplest one, defined as follows:

$$LQ_{jp} = \frac{{\left( {{{e_{jp} } \mathord{\left/ {\vphantom {{e_{jp} } {e_{p} }}} \right. \kern-0pt} {e_{p} }}} \right)}}{{\left( {{{E_{j} } \mathord{\left/ {\vphantom {{E_{j} } E}} \right. \kern-0pt} E}} \right)}}$$
(14)

where LQ jp is the location quotient of sector j in province p; e jp is employment in sector j in province p; \(e_{p} = \sum\nolimits_{J}^{{}} {e_{jp} }\) is total employment in province p; \(E_{j} = \sum\nolimits_{i = 1}^{n} {e_{jp} }\) is the total employment in sector j in Spain (n is the number of spatial units: 52 provinces). Finally, \(E = \sum\nolimits_{j} {E_{j} }\) is the total employment in Spain.

Table 2 shows the correlations between each index (the degree of urbanization, the HH concentration index and our index), and the location quotients of eight high order producer-service industries. In all cases, our index is more closely correlated than the others with the location quotients. The second part of Table 2 shows the same correlations, but without the two outlier observations Ceuta and Melilla (see above). Interestingly, our index barely changes, while the two others improve substantially (but not to the point of becoming better than the our index); this may indicate that perhaps the Arriaga index deals more effectively with outliers.

Table 2 Correlations between the location quotients of KIBS industries and the proposed index, the degree of urbanization and the HH index (2001)

Figure 6 illustrates the relationship between each of the three indexes and the location quotients of four of the KIBS industries. To make the graphs more legible, our proposed index is plotted against a logarithmic scale, which transforms the linear trend line into a curve.

Fig. 6
figure 6

Relation between the location quotients of selected KIBS industries and the proposed index, the degree of urbanization and the HH index (2001). Relation between a the proposed index and financial services LQs, b the proposed index and computing and information technologies LQs, c the proposed index and advertising services LQs, d the proposed index and audiovisual and entertainment services LQs, e the degree of urbanization and financial services LQs, f the degree of urbanization and computing and information technologies LQs, g the degree fo urbanization and advertising services LQs, h the degree of urbanization and audiovisual and entertainment services LQs, i the HH concentration index and financial services LQs, j the HH concentration index and computing and information technologies LQs, k the HH concentrationn index and advertising services LQs, l the HH concentration index and audiovisual and entertainment services LQs. Note: the curves in 6.1 a, b, c and d are not quadratic. They are straight lines displayed on a semi-logarithmic scale. The logarithmic abscissa is used to make the graphs legible: with a linear abscissa, the majority of data points would be crammed together near the origin. Source: own based on INE (2001) data.

For all these activities, the proposed index captures much better the effect of the main metropolitan areas of the country: Madrid, represented by the top right-hand point in the trend lines, and Barcelona, the first point to the left of Madrid. For the rest of the provinces, this index clearly displays a better fit between the location quotients of KIBS industries and the measure of urban agglomeration. As can be seen, the relation with both the degree of urbanization and the HH index, in many cases, shows apparent heteroskedasticity. The largest deviations from trend appear for the higher values of the index, that is, for the provinces where the biggest cities are. In addition, both the degree of urbanization and the HH index are equal to 1 for Ceuta and Melilla, two autonomous cities, located in North Africa, with a joint population of less than 140,000 inhabitants in 2001. In general, the degree of urbanization and the HH index account for concentration, but not for size. As a consequence, they take on high values for provinces with a population that is substantial, but not so large, concentrated around a medium-sized city. In such cases, these two indicators clearly overstate the interaction opportunities and the potential for urbanization economies. The Arriaga index, which takes account of both size and concentration, does not suffer from the same distortion.

5 Summary and Conclusions

In this paper, we put forth the view that the potential for urbanization economies increases with interaction opportunities. From that premise follow three fundamental properties that an agglomeration index should possess: (1) to increase with the concentration of population and conform to the transfer principle; (2) to increase with the absolute size of constituent LLMAs; and (3) to be consistent in aggregation. Concentration measures, in particular, fail to meet condition (2).

We then develop an index of agglomeration based on the number of interaction opportunities per capita in a geographical area of interest. This is made possible thanks to two simplifying assumptions: (1) we limit our attention to pairwise interactions, and (2) invoking the space-analytic foundations of LLMA delineation, we take into account interaction opportunities within LLMAs, while ignoring interactions that may take place across LLMAs. This leads to Arriaga’s mean city-population size, which is the mathematical expectation of the size of the LLMA in which a randomly chosen individual lives.

We apply the index to the Spanish provinces, and compare it to the degree of urbanization and the Hirshman–Herfindahl concentration index. We find that the three indexes rank the provinces quite differently. An examination of the more extreme cases of rank change shows that ranking according to the proposed index better reflects the geographical distribution of population, both with respect to size and concentration, and allows to correctly capture the potential for agglomeration economies from urbanization. Next, we correlate all three indexes with the location quotients of four knowledge intensive business services (KIBS) known to be highly sensitive to agglomeration economies of the urbanization type. We find that our index clearly displays a better fit between the location quotients of KIBS industries and the measure of urban agglomeration, as is confirmed by the much higher correlation coefficients.

The index has other advantages. It does not require to define an arbitrary population threshold which excludes areas classified as non-urban from calculations. It is easily extended to accommodate situations where an LLMA lies partly outside the geographical area for which agglomeration is measured. Finally, its already modest data requirements can be weakened if necessary to compute a satisfactory approximation of the index using data that is truncated or aggregated into size-classes. All these properties, together with the fact that the practice of delineating LLMAs is spreading among statistical agencies, make the index easily reproducible for different areas or countries and so, it will become increasingly convenient to use it.

This index, both in its original version and applied to LLMAs, is well rooted in a theoretical view of agglomeration economies, its data requirements are modest, and we have shown that, at least in the case of Spain, it performs better than other commonly used agglomeration indicators. We look forward to seeing its use expand.