1 Introduction

Public Use Microdata Sample (PUMS) files are one of the U.S. Census Bureau’s most valuable data products for social science and policy analysis, providing detailed questionnaire responses from the decennial censuses and the American Community Survey (ACS) for a large sample of the U.S. population. Using PUMS data, researchers can generate custom cross-tabulations with great flexibility and investigate relationships among all reported characteristics for individual respondents and their households. One limitation of PUMS files is that, in order to protect privacy, the Census Bureau restricts the detail of reported geographic information. The only sub-state geographic units identified are Public Use Microdata Areas (PUMAs), which are custom-designed agglomerations of other standard units (census tracts, counties, etc.), each required to have at least 100,000 residents. This restriction makes it impossible to identify smaller communities and neighborhoods in PUMS data, and identifying larger regions is also often complicated by mismatches between PUMAs and other geographic units.

IPUMS USA (https://usa.ipums.org), a website that disseminates harmonized U.S. census microdata, has developed numerous tools and resources to facilitate microdata access and use. To expand on the limited geographic information provided in PUMS files, IPUMS USA supplies supplemental variables that identify several standard geographic units other than PUMAs. IPUMS can identify only units that correspond well to a set of PUMAs, but this approach has still enabled the identification of hundreds of counties, cities, and metropolitan areas for most decennial and ACS microdata samples.

We introduce here two PUMA-based indices that IPUMS USA recently added to its collection of supplemental variables in order to facilitate analysis of demographic variation across different levels of urbanization. The two indices—average tract population density and average metro/micro-area population—correspond to two distinct dimensions of settlement patterns: “concentration” (the local intensity of settlement) and “size” (the total population of the commuting system). For analysts seeking to distinguish levels of urbanization in microdata, IPUMS USA has also long provided a categorical variable named “METRO,” which identifies metropolitan status and central/principal city status based on PUMA information. We demonstrate here how the new indices offer valuable advantages relative to METRO. Crucially, because they are continuous and represent two distinct dimensions, the new indices distinguish a much broader range of variation across the rural–urban universe of settlement patterns.

In succeeding sections, we first discuss limitations of the standard metropolitan and urban classifications, particularly for analyses of microdata, and the potential advantages of using two continuous indices. We then provide the exact definitions of the new IPUMS USA indices and discuss how they correspond conceptually to two important dimensions of settlement patterns. Finally, we demonstrate the utility of the new indices in an examination of poverty across the rural–urban spectrum. We find that poverty rates are lowest in moderately dense parts of major metro areas, and they are high in both low-density and high-density areas, as well as in smaller commuting systems. We also find that correlations between poverty and demographic characteristics vary considerably across settlement types. More generally, our findings demonstrate the value of modeling urban/rural status as a continuously varying, multi-dimensional phenomenon, an approach that is directly facilitated by the new indices from IPUMS USA.

2 Limitations of Standard Classifications

To distinguish rural and urban populations, analysts commonly use one of two classification systems defined by federal agencies: the core-based statistical area (CBSA) definitions of the Office of Management and Budget (OMB), which delineate metropolitan and micropolitan statistical areas (i.e., metro and micro areas), or the official urban/rural classification of the Census Bureau.Footnote 1 The Bureau’s criteria and guidelines for PUMA delineationsFootnote 2 do not require any agreement with CBSAs or urban/rural delineations, so PUMAs do not consistently align with either of these systems (Fig. 1). To distinguish suburban populations, analysts sometimes also use the OMB’s central or principal city definitions,Footnote 3 treating as “suburban” the population living in metro areas but outside of central/principal cities (e.g., Mattingly & Bean, 2010), but PUMAs need not align with city boundaries either.

Fig. 1
figure 1

2010 PUMAs, 2010 urban areas and 2013 CBSAs (metropolitan and micropolitan areas) in a section of south-central Texas

PUMAs do occasionally align with CBSAs but almost never with urban areas (Fig. 1). The boundaries of urban areas are complex and idiosyncratic, and urban areas can also have relatively small populations (down to 2500), so outside the cores of major urban areas, nearly all PUMAs encompass a mix of urban and rural areas. CBSA boundaries, on the other hand, always follow county boundaries, which often also form PUMA boundaries. Metro areas generally have populations larger than 100,000, enough for a single metro area to comprise one or more whole PUMAs. Likewise, central and principal cities are often large enough to comprise whole PUMAs.

The numerous correspondences between PUMAs and the OMB delineations make it possible to identify the exact metro/nonmetro status for most PUMS records and occasionally also the central/principal city status. Nevertheless, identifying a status for all PUMS records, as IPUMS USA does through its METRO variable, requires special handling for the many PUMAs with populations both within and outside of metro areas, or both within and outside of central/principal cities. In such cases, IPUMS USA assigns a “mixed” status, resulting in 5 distinct METRO classes, including 3 “pure” and 2 “mixed” classes (Fig. 2). The Economic Research Service (ERS) of the U.S. Department of Agriculture has produced a similar classification that identifies all PUMAs as either metro or nonmetro, allocating each “mixed” PUMA to one of these two classes based on where the majority of PUMA residents live (U.S. Department of Agriculture, 2019b).Footnote 4 Translating the standard OMB classes into microdata, using the approach of either IPUMS USA or the ERS, offers the benefits of familiarity and conceptual consistency with many other applications that use the OMB definitions. This framework, however, unavoidably yields inexact class identifications because of the many discrepancies between PUMAs and OMB delineations.

Fig. 2
figure 2

IPUMS-USA METRO classes for 2010 PUMAs in a section of south-central Texas

An important related problem is that as both the PUMA and OMB delineations change over time, so does the correspondence between them, which can impair studies of demographic change by metro status. For example, when the Census changed the PUMAs identified in ACS PUMS files from the 2000 to 2010 definitions, IPUMS USA also changed which metropolitan definitions it used as the basis of the METRO variable, switching from the 1999 to the 2013 OMB delineations. In effect, the portion of population having a mixed metro/nonmetro status grew from 7% for the 2011 ACS (using 2000 PUMAs) to 12% for the 2012 ACS (using 2010 PUMAs), and the portion with a mixed central/principal city status grew from 30 to 36%. These shifts are much larger than typical annual changes, and they both result in larger portions of population in “mixed” PUMAs, indicating that they are mainly artifacts of a weakened correspondence between PUMAs and OMB delineations. The degree to which any such changes in METRO status are meaningful is difficult to determine.

Even if the correspondence between PUMAs and OMB delineations were exact and persistent, there remain two important conceptual problems for analyses that rely exclusively on the metro/nonmetro classification. First, subdividing the full range of U.S. settlement patterns into only a few classes is imprecise, potentially masking important variations within each class and separating similar locations into distinct classes (Waldorf, 2006). For example, the largest U.S. metro areas have a hundred times more residents than the smallest, and socio-economic conditions may vary enormously across this spectrum. A single “metro” class nevertheless groups all these areas together.

A second limitation of the metro/nonmetro classification—especially when analysts use it alone to distinguish “urban” and “rural” populations—is that it emphasizes only one of the multiple dimensions of urbanization: population size. The delineation of metro areas does incorporate other factors; both population density and size are used to define the core urban areas of CBSAs, and commuting flows are used to determine which counties are associated with which urban cores, but after these associations are established, the single basic feature that distinguishes a metro county is that it is part of a commuting system where the urban core has a population of 50,000 or more. In other regards, a metro county may be very urban or very rural; it may have a high or low population density, and it may be at either the core or the distant periphery of a commuting system. The metro/nonmetro classification alone tells us little about these other important dimensions of settlement patterns, which are correlated with but distinct from population size (Coombes & Raybould, 2001; Isserman, 2005; Wang et al., 2012). A phenomenon of interest could be associated with each of these dimensions in different ways, but that is impossible to determine using only the metro/nonmetro classes.Footnote 5

Similarly, the Census Bureau’s urban/rural classification also incorporates multiple aspects of settlement patterns while emphasizing a single dimension of urbanization—in this case, concentration. Unlike CBSAs, urban areas are designed to correspond closely to the extents of urbanized land use (Ratcliffe, 2015; Ratcliffe et al., 2016), and the Bureau is able to achieve this goal with a high degree of precision by building up from individual census blocks, units much smaller than the counties that form CBSAs. According to the 2000 and 2010 standards, the main criteria for a group of blocks to be identified as an urban area is that they surpass a density threshold (1000 persons per square mile at the core and 500 persons per square mile in neighboring areas, with other rules used to add some types of low-density blocks), and their combined population is at least 2500. The classification is therefore associated with both population size and density, but in combination, the two criteria serve as a singular indicator of concentration at a certain resolution. For example, if a few neighboring blocks in an isolated community have high densities, but their combined population is below 2500, then for a somewhat larger area, e.g., a space of 5 square miles, the density would be less than 500 persons per square mile, effectively a “rural” level of concentration at that resolution. From this perspective, each urban area represents a substantial region of concentrated settlement, and rural areas may contain some groups of dense blocks but never in a “large concentration.” Meanwhile, this distinction alone tells us little about other dimensions of urban/rural variation, such as the size of the commuting system or accessibility to a major city center.

The coarseness of the standard binary classifications and their distinct emphases are useful and appropriate in many contexts. Many regional disparities have a basic metro/nonmetro or urban/rural divide, and the standards used to delineate these classes have evolved through decades of refinement to correspond well with important, observable distinctions in settlement and commuting patterns. Problems can arise, however, when analysts draw general conclusions about urban/rural differences using only one of these classifications. For example, using the binary OMB classes, nonmetro areas had higher poverty rates than metro areas in 2017, but using the official urban/rural classification, the relationship is reversed; the rural areas had the lowest poverty rates overall (Fig. 3). In other words, poverty is higher outside of large commuting systems but also lower outside of concentrated settlements. How could this be? Poverty may be especially high in concentrated settlements outside of large commuting systems (urban nonmetro population) or especially low in the exurbs of large commuting systems (rural metro population). Investigating such possibilities and accurately characterizing them requires a more robust analytical framework than is supported by either of these standard classifications alone.

Fig. 3
figure 3

Poverty rates using standard metropolitan/non-metropolitan and urban/rural classifications. 2017 American Community Survey 1-Year Summary File, retrieved from IPUMS NHGIS (Manson et al. 2018)

Toward this end, the ERS has produced several alternative classifications (U.S. Department of Agriculture, 2019b), including the rural–urban continuum codes (distinguishing 9 classes of counties), urban-influence codes (12 classes of counties), rural–urban continuum areas (10 primary and 21 secondary classes of census tracts), and frontier and remote area codes (4 levels of remoteness among ZIP Code areas). These schemes offer more granularity than the standard OMB and Census classifications, and they distinguish more than one dimension of variation. However, in accord with ERS’s focus on agricultural economics, their classifications mainly differentiate types of rural areas and only minimally distinguish higher levels of urbanization. Also, aside from ERS’s metro/nonmetro classification of PUMAs, none of their classifications are PUMA-based, so they are not directly applicable for public-use microdata.

3 Two PUMA-Level Indices

When developing new rural–urban indices for IPUMS USA, we have sought out options that not only could be computed at the PUMA level consistently across time but that also varied continuously across a full spectrum of rural and urban settlement types. A secondary consideration was to select measures that are relatively easy to compute and to extend forward when integrating new PUMS files into IPUMS USA. The two newly added indices satisfy these aims.

3.1 Two-Dimensional Conceptual Framework

Conceptually, the two new indices correspond to two basic dimensions of settlement patterns: concentration, ranging from sparse to dense, and size, ranging from small to large (Fig. 4). In common usage, the terms “rural” and “urban” indicate variation in both dimensions: rural places are more sparsely settled and have smaller populations (bottom-left quadrant in Fig. 4); urban places are more densely settled and have larger populations (upper-right quadrant). But places may be distinctly “rural” or “urban” along one dimension and not the other. An isolated town center (left-hand side) may be somewhat urban in its concentration level but decidedly rural in population size. Conversely, exurban large-lot developments (lower-right quadrant) may have rural levels of concentration but urban levels of access to amenities and services due to the large population of a nearby city. We expect the upper-left corner to be empty because the highest levels of concentration can occur only where there are ample populations (depending on the exact spatial resolution at which densities are measured).

Fig. 4
figure 4

Conceptual model of two continuous dimensions of settlement patterns

Our choices for which two dimensions to emphasize and how to measure them are inspired by the two standard federal classification systems. As discussed in Sect. 2, the Census Bureau’s urban/rural classification emphasizes the concentration dimension, corresponding roughly to the horizontal dividing line in Fig. 4. The OMB’s metro/nonmetro classification in turn emphasizes the size dimension (specifically, the size of the urban core of a commuting system), corresponding roughly to the vertical dividing line in Fig. 4.

The model also accords with other previously developed frameworks. Isserman (2005) and Wang et al. (2012) both argue—and demonstrate through case studies—that the two standard classifications can and should be treated as distinct, complementary indicators of urbanicity and rurality, though neither of these research efforts developed continuous indices. They instead use county-based classifications, translating the Census Bureau’s urban/rural classes to a county basis through a new system of “Rural Urban Density Codes,” which classify counties into four groups based on population density and the amount of urban population in each county. Coombes and Raybould (2001) suggest continuous measures for three dimensions of settlement patterns: settlement size (from hamlet to metropolitan), concentration (from sparse to dense), and accessibility (from remote to central). Our model re-uses their concentration dimension directly, and our second dimension corresponds with the settlement size dimension and, to a lesser extent, the accessibility dimension.Footnote 6

3.2 Index of Concentration

The name of the new IPUMS USA variable that indexes concentration is DENSITY, and the specific measure it reports is the population-weighted geometric mean of census tract population densities in each PUMA. In the initial release, DENSITY is available for 2000 census samples (using 2000 tract densities) and for the 2010 census and ACS samples (using 2010 tract densities).

We choose to use a population-weighted average density rather than the density of the whole PUMA because the latter (the PUMA’s population divided by its area) is often a weak indicator of the local densities where PUMA residents live. Many PUMAs span across both densely and sparsely settled areas, as demonstrated by the varying densities of tracts within PUMAs (Fig. 5a). The density of each whole PUMA (Fig. 5b) is effectively an “area-weighted average” of these varying densities (Craig, 1984), representing the typical density across all subdivisions of the PUMA rather than among the residents of the PUMA. Because PUMA residents are (by definition) more concentrated in the denser parts of a PUMA, the average of their local densities is generally higher (and cannot be lower) than the entire PUMA’s density.

Fig. 5
figure 5

Four measures of 2010 population density within PUMAs in a section of south-central Texas

To summarize local concentrations throughout a large area, a better strategy is to compute densities in smaller “local” units, such as census tracts, and then compute the average of these densities, weighted by the local units’ populations, so each resident’s local density is given equal weight. The right-hand panels of Fig. 5 illustrate the outcomes of measuring the population-weighted average of tract densities using the arithmetic mean (c) and the geometric mean (d). Following the notation of Craig (1984), the population-weighted arithmetic mean density is computed as:

$$d_{AM} = \frac{{\sum P_{i} d_{i} }}{{\sum P_{i} }}$$
(1)

where \(P_{i}\) and \(d_{i}\) are the population and density of subdivision \(i\) (in our case, a tract). The population-weighted geometric mean is computed as:

$$d_{GM} = \prod d_{i}^{{\alpha_{i} }}$$
(2)

where \(\alpha_{i}\) is the proportion of the containing unit’s (the PUMA’s) population living in subdivision \(i\). It can be helpful to think of the geometric mean density as measuring the average density on a logarithmic scale, which recasts Eq. (2) into this form:

$$d_{GM} = \exp \left( {\frac{{\sum P_{i} \log d_{i} }}{{\sum P_{i} }}} \right)$$
(3)

In practice, our computations deviate somewhat from these equations where a PUMA boundary subdivides a tract.Footnote 7 In such cases, we use the whole tract's density, but we limit the population weight to the portion that also resides in the PUMA (determining this portion by summing the populations of the census blocks with centroids in each PUMA).

Some prior applications of population-weighted average tract densities have used arithmetic means (e.g., Wilson et al., 2012; Kolko, 2016), but we agree with others (Craig, 1984; Dorling & Atkins, 1995) that a geometric mean is more suitable. Densities generally have a log-normal distribution, heavily concentrated at the lower end of the distribution with a long positive tail. For such distributions, the geometric mean is appropriately less sensitive to large outliers, more sensitive to variations among small values, and typically closer to the median than is the arithmetic mean. In practical terms, a logarithmic scaling makes sense because a difference between densities of 10 and 100 is about as significant for the character of a place as any other factor-of-10 difference (e.g., 1,000 and 10,000), and it is clearly more significant than an equal absolute difference of 90 at high densities (e.g., 10,010 and 10,100).

Figure 6 illustrates how four PUMA-level density measures relate to the distribution of tract densities in the PUMAs labeled on Fig. 5. The first PUMA, 04503, is roughly coincident with The Woodlands, a suburb of Houston. In this case, there is relatively little variation in densities among the tracts in the PUMA, so all four measures (the PUMA density, the population-weighted arithmetic mean, the population-weighted geometric mean, and the population-weighted median) are close to each other on both a linear and log scale, ranging only from 2061 to 2290 persons per square mile.

Fig. 6
figure 6

Distributions of 2010 census tract population densities in four Texas PUMAs, plotted on a linear scale and logarithmic scale, along with four PUMA-level density measures

In PUMA 06100, which encompasses lower-density exurbs, small cities, and rangeland southwest of San Antonio, the tract densities vary less than PUMA 04503’s on a linear scale but more than 04503’s on a log scale. PUMA 06100’s four density measures therefore bunch closely together on a linear scale but differ substantially on a log scale. As expected for a log-normal distribution, the median (51) and the geometric mean (48) are similar on either scale, but the arithmetic mean (84) is 65% greater than the median, and all three population-weighted densities are well above the whole PUMA’s density (29).

In PUMA 06200, the tract densities have a relatively wide distribution on either scale, and on the log scale, the distribution is clearly bimodal, split between a set of large, sparse tracts and a set of small, dense tracts. The whole PUMA’s density (12) lies within the lower cluster of tract densities, which is a good indication of the large expanses of sparsely populated rangeland in the PUMA, but it poorly represents the much higher local densities of most PUMA residents. The arithmetic mean (1436) and median (953) are both much higher, lying in the upper cluster of tract densities, but this in turn poorly represents the sparse local densities of many PUMA residents. The geometric mean (194) is located between the two modes, suitably splitting the difference. (Of course, no single statistic can represent well the “typical” value of a bimodal distribution, but if a single statistic must be selected, something that lies between the two modal clusters seems most appropriate.)

The last of the four example PUMAs, 06302, has the widest distribution of tract densities. While the PUMA’s population is concentrated in dense tracts around Laredo, most of the PUMA’s area lies in three very sparse outlying tracts. This results in a PUMA density (35) that is much lower than the median (1285) and geometric mean (1245). At the other end, a few tract densities that are exceptionally high on a linear scale (but not on a log scale) result in an arithmetic mean (2945) that is more than double the median and geometric mean.

These four example PUMAs indicate well the variety of density distributions across all PUMAs, and in each example, we find the population-weighted geometric mean of tract densities to be as good as or better than the other density measures as a general index of concentration. There are still more measures that could be considered, and there is one issue in particular that is of concern: a census tract is only one arbitrary approximation of a person’s local context (Fowler et al., 2019), and averages of tract densities are subject to the Modifiable Area Unit Problem, or MAUP (Openshaw & Taylor, 1981). For example, two PUMAs with identical population distributions could have very different mean tract densities depending on how the tract boundaries are drawn. One measure that would be less sensitive to the MAUP would be an inverse-distance-weighted average of block-level densities in a moving window around each census block. This is similar to the approach that Coombes and Raybould (2001) propose for an index of concentration. We have opted to rely on tract densities (for now) because it simplifies the computation and description of the index, because measures of population-weighted density are often based on tracts, and because we suspect its liabilities relative to a more robust measure are not important for most applications.

3.3 Index of Size

For an index of size, we use the population-weighted geometric mean of the populations of CBSAs (metro and micro areas) in each PUMA. The general aim is to summarize the typical population size of the commuting systems where PUMA residents live. Where a PUMA lies entirely within a single metro area, as is the case for 78% of 2010 PUMAs, this measure simply equates to the metro area's population. Elsewhere, the measure summarizes the sizes of all CBSAs where PUMA residents live. For the “noncore” counties located outside of any CBSA, we use the county’s population as an approximation of the commuting system size.Footnote 8

We refer to this index as METPOP, and currently, IPUMS USA provides two versions of the index through two variables: METPOP10 summarizes the 2010 populations of 2013 CBSAs and noncore counties, and METPOP00 summarizes the 2000 populations of 2003 CBSAs and noncore counties. Figure 7 illustrates how 2010 PUMAs correspond to the CBSA and county populations that METPOP10 summarizes, and Fig. 8 illustrates METPOP10 values.

Fig. 7
figure 7

2010 PUMAs and the 2010 populations of 2013 CBSAs and noncore counties, which form the basis for the METPOP10 variable, in a section of south-central Texas

Fig. 8
figure 8

METPOP10 values in a section of south-central Texas

The formula we use to compute the METPOP index mirrors the DENSITY formula [Eq. (3)]:

$$P_{GMj} = \exp \left( {\frac{{\sum P_{ij} \log P_{i} }}{{\sum P_{ij} }}} \right)$$
(4)

where \(P_{GMj}\) is the population-weighted geometric mean of the populations of CBSAs and noncore counties in PUMA \(j\), \(P_{i}\) is the population of CBSA or noncore county \(i\), and \(P_{ij}\) is the population in the area of intersection between \(i\) and \(j\). We again use a geometric mean because commuting system populations, like tract densities, have a roughly log-normal distribution, and relative differences in populations are more important than absolute differences. For example, a difference between populations of 100,000 and 200,000 is about as significant for the character of a commuting system as any other factor-of-2 difference (e.g., 1 million and 2 million), and it is clearly more significant than an equal absolute difference of 100,000 in large commuting systems (e.g., 10.1 million and 10.2 million).

Like the DENSITY index, METPOP may also be impaired by an inexact spatial basis. The extents of “true” commuting systems need not correspond well with counties, and this is a limitation not only where METPOP is based on noncore counties but even where it is based on CBSAs. For example, because of the great extents of its component counties, the Riverside–San Bernardino–Ontario CBSA in California includes the small city of Needles, a 220-mile drive from the CBSA’s largest city, Riverside. The PUMA that contains Needles is comprised mainly of desert, and the fraction of its residents who commute to the CBSA’s urban core is likely small, but the PUMA nevertheless has a high METPOP value. A more effective index of size might delineate commuting systems based on tracts rather than counties, or it might use ERS commuting zones (Fowler & Jensen, 2020; U.S. Department of Agriculture, 2019a), a system which allocates every county to a zone, eliminating the problem of “noncore” counties. Alternatively, as with the index of concentration, the most effective approach may be to use a moving window, but instead of using a “local” moving-window average of densities, the index of size would use a larger “regional” moving-window summing populations within a typical commuting distance. We leave these possibilities for future research.

3.4 Pairing the Indices

Figure 9 illustrates the two-dimensional spread of average tract densities (DENSITY) and average CBSA populations (METPOP10) for all 2010 PUMAs. The point colors indicate the METRO class of each PUMA. The overall distribution mirrors closely the conceptual model in Fig. 4: the upper right contains PUMAs with high densities in large metro areas; the lower right contains PUMAs with low densities in large metro areas; the lower left contains PUMAs with low densities and outside (or mostly outside) of any CBSA; and as expected, the upper left is empty, indicating that PUMAs with high average densities occur only in or around medium-to-large CBSAs.

Fig. 9
figure 9

Relationships among three urbanization indices for 2010 PUMAs. Labels identify the four Texas example PUMAs. (Color figure online)

The colors in Fig. 9 indicate that most PUMAs that lie entirely within metro areas have relatively high average densities, but some have low average densities. Such low-density metro PUMAs may or may not fit our expectations for “rural” areas. They may or may not share characteristics with other low-density PUMAs. Similarly, the nonmetro and mixed PUMAs with moderately high densities may have more in common with metro PUMAs at similar densities than with nonmetro and mixed PUMAs at lower densities. We believe that this two-dimensional framework offers great potential as a means to investigate such possibilities and to determine whether “concentration” or “size” are important factors, separately or together, in any study of urban–rural discrepancies. Because the indices are continuous measures, the framework also makes it possible to distinguish fine gradations of variation and to identify inflection points across all densities or across all levels of the urban hierarchy.

4 Illustrative Results

To demonstrate the utility of the two indices, we analyze how poverty rates vary across settlement types in the U.S. Past analyses of “rural” poverty have often used the “metro/nonmetro” classification alone to distinguish urban and rural populations (e.g., Cromartie, 2017; Ziliak, 2018). This practice is problematic. As Fig. 3 shows, the basic question of whether rural or urban areas have higher poverty rates has distinctly different answers depending on how “rural” and “urban” areas are distinguished. The availability of two continuous measures, indexing two dimensions of urbanization, allows researchers to complete a more thorough and nuanced analysis of variations across geographic regions.

In this section, using the new indices with 2012–2017 ACS microdata from IPUMS USA (Ruggles et al., 2019), pooling six 1-year samples, we illustrate how both poverty rates and individuals’ likelihood of being in poverty vary across levels of urbanization as distinguished by both concentration (DENSITY) and population size (METPOP10). Using microdata with the new indices enables us to fit regression models predicting poverty while controlling for other demographic factors. The use of our continuous measures shows that the correlations between poverty, rurality, and other demographic characteristics vary in ways that cannot be captured by a simple metro/nonmetro distinction.

We begin our analysis with the basic METRO classification that had previously been (and still is) available in IPUMS USA. PUMAs classified as wholly nonmetro (neither metro nor mixed) have a higher poverty rate than other PUMAs, and the “mixed” PUMAs, those that straddle metro and nonmetro areas, have a poverty rate between nonmetro and metro PUMAs’ (Table 1).

Table 1 Poverty rates by METRO category, 2012–2017

Examining how poverty rates vary with the two new indices uncovers a more nuanced geographic pattern. Each point in Fig. 10 represents a PUMA while the color represents levels of poverty: blue represents lower poverty rates and red represents higher rates. As in Table 1, we see again that metro areas have generally lower rates of poverty than nonmetro areas, but by using two continuous indices, we can see how the metro/nonmetro dichotomy masks significant differences in poverty rates within metro areas. The high-density PUMAs in large metro areas generally have high poverty rates—similar to or even higher than the rates in the PUMAs of small commuting systems—while the lower-density PUMAs in large metro areas, encompassing mostly suburban and exurban communities, appear to have the lowest poverty rates overall.

Fig. 10
figure 10

Relationship between poverty and two urbanization indices of urbanization for 2010 PUMAs. IPUMS USA 2012–2017 ACS samples. (Color figure online)

To quantify how rates vary across this two-dimensional space, we first classify PUMAs along both dimensions with four levels of METPOP10 values (breaks at 50,000, 400,000, and 3.2 million) and five levels of DENSITY values (breaks at 80, 400, 2000, and 10,000 persons per square mile). To avoid having a class represented by only one PUMA, we drop the lowest DENSITY break (at 80) for PUMAs in large commuting systems (above 400,000). This produces 14 classes of PUMAs, each with unique ranges of METPOP10 and DENSITY values. Table 2 shows how poverty rates vary among these classes, with the highest rate of poverty (22.8%) found in the densest PUMAs with moderately large CBSA populations (between 400,000 and 3.2 million residents). The lowest poverty rates are found in areas with medium density in the largest CBSAs (over 3.2 million residents). Table 3 shows how the poverty rate for each class differs from the rate for the lowest-poverty class, which we use later as a benchmark for analyzing poverty rates in a multiple regression framework. Both tables show that, within each of the four size classes, the highest poverty rates occur in the highest-density classes. The PUMAs in small commuting systems also have relatively high rates, but not as high as in the high-density PUMA classes.

Table 2 Poverty rates (%) by DENSITY and METPOP10, 2012–2017. (Color figure online)
Table 3 Difference in poverty rate (%) by DENSITY and METPOP10 from lowest-poverty class (METPOP10 of 3.2 m+, DENSITY of 400–2000), 2012–2017. (Color figure online)

Of course, classifying populations according to PUMA-level averages, as in Tables 2 and 3, may obscure variations that are apparent only within PUMAs. Perhaps the highest rates of poverty actually occur within the lowest-density census tracts, but if all the PUMAs that contain these tracts also include a mix of higher-density tracts, then the distinct characteristics of the lowest-density tracts would be “averaged out.” Why then use PUMA-level indices? We reiterate that a key motivation is to enable microdata-based analyses that are impossible using existing census summary data. For example, the main reason for the higher poverty rates in high-density areas could be that those areas have disproportionately high concentrations of higher-poverty demographic groups, such as younger and/or minority populations, in which case the relationship between density and poverty rate might be insignificant after controlling for individuals’ demographic characteristics. Directly controlling for individuals’ characteristics is not possible with census tract summary data, but it is possible using microdata and PUMA-level indices.

To demonstrate the value of the indices with an analysis that requires microdata, we begin with two models that associate poverty with metropolitan status. The first model predicts poverty status based only on the METRO status of a person’s PUMA:

$$poverty = \alpha + {\boldsymbol{\beta}}{\text{METRO}}.$$
(5)

where \({\boldsymbol{\beta}}\) is a vector of coefficients for each METRO class (mixed and metro, with nonmetro omitted). The second model expands on the first by controlling for a large range of demographic characteristics:

$$poverty = \alpha + {\boldsymbol{\beta}}{\text{METRO}} + {\boldsymbol{\gamma}} {\boldsymbol{D}}$$
(6)

where \({\boldsymbol{D}}\) is a vector of individual-level demographic controls, available only through microdata. Specifically, \({\boldsymbol{D}}\) includes age, sex, race, ethnicity (Hispanic/Latino), nativity, citizenship, marital status, health insurance coverage, educational attainment, employment status & sector, year, and geographic subregion (census division). Table 4 gives the metro and mixed coefficients after fitting these two models through linear regression on 2012–2017 ACS microdata.

Table 4 Coefficients for percent likelihood of poverty by METRO class, with and without controls, 2012–2017

Based only on metropolitan status, people residing in wholly metro PUMAs are about 3.2 percentage points less likely to be in poverty than those living in nonmetro PUMAs. (This is consistent with Table 1, which shows the difference between poverty rates in nonmetro and metro PUMAs to be 17.6 − 14.4 = 3.2). However, when we include a battery of demographic controls, the metro coefficient decreases from -3.2 to -4.0, meaning that for individuals with the same demographic characteristics, those living in metro PUMAs are about 4 percentage points less likely to be in poverty than those in nonmetro PUMAs. This indicates that the demographics that predominate in nonmetro PUMAs would generally yield lower poverty rates than the demographics in metro PUMAs, but living in nonmetro PUMAs increases the likelihood of poverty enough to produce a higher poverty rate in those areas despite their demographics. Clearly, using microdata to control for demographic characteristics can help to reveal key dynamics in rural–urban poverty discrepancies.

Incorporating the two continuous urbanization indices into the analysis yields yet again more value. To demonstrate, we estimate the following linear probability model:

$$poverty = \alpha + {\boldsymbol{\beta}}{\text{DENSITYxMETPOP}}10 + {\boldsymbol{\gamma}} {\boldsymbol{D}}$$
(7)

where \({\boldsymbol{D}}\) is the same vector of controls as in Eq. (6) and \({\boldsymbol{\beta}}\) represents coefficients for each DENSITY-by-METPOP10 class. Table 5 provides the results in a format that can be directly compared with the uncontrolled rate differences in Table 3, with both tables color-coded on the same scale. This comparison shows that the differences in the likelihood of poverty, relative to the lowest-poverty reference class, are consistently smaller after controlling for demographics. Without controls, the highest poverty rate among the classes is 13.6 percentage points above the lowest rate. After controlling for demographics, the same two classes differ by only 8.3 points. This indicates that 61% (8.3/13.6) of the difference between these classes is explained by the difference in densities and population sizes, and the remaining 39% is attributable to demographic differences between the populations in these areas.

Table 5 Coefficients for percent likelihood of poverty by DENSITY and METPOP10, with controls, 2012–2017. (Color figure online)

A final angle we take is to examine how associations between poverty and specific demographic characteristics vary across settlement types. We run this model

$$poverty = \alpha + {\boldsymbol{\gamma}} {\boldsymbol{D}},$$
(8)

separately for each of the fourteen DENSITY-by-METPOP10 classes, where \({\boldsymbol{D}}\) is again the same vector of controls as in Eqs. (6) and (7). Table 6 reports the resulting Hispanic/Latino coefficient for each class. The pattern is altogether different from the pattern in previous tables. In the class with the highest overall poverty rate (22.8 in Table 3), the difference in poverty between Hispanic/Latino populations and other groups is at its smallest (2.1 in Table 6). The increase for Hispanic/Latino population is greatest (at + 9.2) in the least dense PUMAs in mid-size metro areas. Table 7 similarly shows how coefficients for noncitizens vary among the classes. In this case, the likelihood that a noncitizen is in poverty differs most from citizens’ likelihood (+ 6.6) in medium-density PUMAs in moderately small commuting systems, and it is generally smallest in both the upper right (denser PUMAs in large metro areas) and lower left (low-density PUMAs in small commuting systems).

Table 6 Coefficient for Hispanic/Latino likelihood of poverty by DENSITY and METPOP10, with controls, 2012–2017. (Color figure online)
Table 7 Coefficient for noncitizen likelihood of poverty by DENSITY and METPOP10, with controls, 2012–2017. (Color figure online)

In all, these results demonstrate the utility and flexibility offered by continuous PUMA-level indices of urbanization, enabling researchers to distinguish a diverse range of settlement types and to quantify associations with robust demographic controls.

5 Conclusion

The limited precision of the geographic information in public use microdata from the U.S. Census and ACS makes it impossible to identify all urban areas and commuting systems exactly as they are defined by federal standards, but the spatial units that are identified in these microdata—PUMAs—are still small enough and diverse enough to enable analysts to investigate and account for demographic variations across multiple dimensions of urbanization. This ability, combined with the unique flexibility microdata offer for developing robust models of demographic relationships, make these data a potentially critical source for understanding and responding to the evolving differences among rural, suburban, and urban populations.

To support the effective analysis of multiple dimensions of urbanization using public use microdata, we have introduced two PUMA-based indices—average tract population density (IPUMS variable: DENSITY) and average metro/micro-area population (IPUMS variables: METPOP00 and METPOP10)—which correspond to two distinct dimensions of settlement patterns: “concentration” (the local intensity of settlement) and “size” (the total population of the commuting system). We have specified how the indices are computed, explained how to interpret them, and demonstrated their value by using them to distinguish a broad range of nonlinear variations in poverty rates and in demographic covariates of poverty across the rural–urban universe of settlement patterns.

We expect that these indices should be similarly useful for many other research applications on a range of subjects at least as wide as the subject coverage in ACS microdata, including migration, commuting, racial disparities, regional economics, housing markets, socio-economic sorting, etc. We would also emphasize that the analytical framework we use in our case study of poverty is only one of many approaches supported by the indices. For example, rather than classifying PUMAs into discrete groups, as we have, analysts could directly include the continuous measures in models, making it possible to quantify and plot formulaic relationships between population density, commuting-system size, and demographic variables, potentially identifying interactions or specific inflection points. We hope our framework will also serve as a basis for further innovation in indices of settlement patterns, including some exploration into the alternatives we proposed here such as a block-based (rather than tract-based) local density measure or a regional moving-window population total (rather than an average of CBSA populations).