Introduction

Growing evidence supports a relationship between neighborhood-level characteristics and important health outcomes.1 One secondary source of neighborhood data includes commercial and administrative databases integrated with geographic information systems to measure availability of certain types of businesses or destinations that may have either favorable (e.g., grocery stores) or adverse (e.g., liquor outlets) effects on health outcomes. For example, locations of food stores and restaurants provide information on food choices and access to healthy foods, which may be particularly important for low-income and minority populations.24 Availability of places that provide opportunities for physical activity have been associated with individual physical activity and obesity levels.58 Building on this work, the neighborhood service environment, as measured by availability of various types of businesses and organizations hypothesized as favorable or unfavorable to health, has been examined for its association with health outcomes.9,10

Researchers are beginning to assess the quality of secondary data sources with business information,1114 given that these data sources are increasingly being used in research with some expressing concerns about the accuracy and completeness of these data sources.5,1518 Information on the completeness and accuracy of multiple data sources for ascertaining local businesses is important not only to health researchers, but also to community organizations, policy-makers, transportation planners, and others who seek to map local businesses and resources.

The aim of this study was to assess the concordance of two commercial databases for ascertaining the presence, locations, and characteristics of businesses by type and area-level poverty status, racial composition, and population density. Capture–recapture methods were applied to estimate the total number of business listings, including listings excluded from either database. These methods are used to estimate a population size when a census may be infeasible or impossible to conduct.1921 Originally used for population estimation and for wildlife research and management, this technique has been applied to epidemiology for estimating incidence and prevalence of various diseases and health-related problems using data collected by two or more incomplete sources (e.g., hospital records and death certificates).1922 For the purposes of the present study, the capture–recapture method allowed estimation of the coverage (i.e., sensitivity or extent to which the databases included the complete number of listings) of two databases for ascertaining businesses.

Methods

Study Area

The study area consisted of the City of St. Louis and St. Louis County, Missouri, United States. This area includes 590 square miles, 286 census tracts, and 1,686,724 people.23

Data Sources

Business names and locations were obtained from two of the major commercial vendors of business databases in the United States.

  • Database A includes data from the InfoUSA database bundled with the ArcGIS 9.2.5 Business Analyst software (ESRI, Redlands, California, USA). InfoUSA data are compiled from phone books, business directories, 10Ks and Securities and Exchange Commission information, government data, business magazines, newsletters and newspapers, and information from the U.S. Postal Service, verified by calling businesses.24 The InfoUSA database includes business name, industry description (i.e., Standard Industry Classification [SIC] or North American Industry Classification System [NAICS]), sales, employees, and location (latitude and longitude) based on geocoding to Tele Atlas address and street databases.24 The addresses in the database bundled with the Business Analyst software are pre-geocoded. Latitude and longitude are provided, but addresses are removed. The data were current as of January 2008.

  • Database B includes data purchased from Dun and Bradstreet, a business information provider. Companies apply for free for credit purposes. According to the company’s website, the data are collected, aggregated, edited, and verified “from thousands of sources daily.”25 Data include address, a primary four-digit SIC code, a primary six-digit NAICS code, company names, business descriptions, number of employees, sales volume, and square footage of buildings. The database for this study included businesses active through the year 2007, with a cost of approximately $4,800.

Businesses were selected for inclusion in this study based on their four-digit SIC codes and classified as in previous studies7,9,26,27 into five broad and nine more specific categories (see Table 1 for categories and Appendix for SIC codes).

Table 1 Agreement and coveragea of two commercial databases, by business type and area-level characteristics

Geocoding Addresses

Records from the database with addresses (database B) were geocoded using the 2005 Streetmap Extension of ArcGIS. Any records not coded or coded with scores less than 80 (n = 2,646, 13.6%) were geocoded again using the TeleAtlas web-based geocoder.

Matching Businesses

Because business addresses were not included in database A, business listings in the two databases were matched by standardized business name within specified distances. Initially, business matches were sought within nine adjacent 1,000-m cells surrounding a database B business location (hereafter, 1,000-m grids; Figure 1). When this system produced multiple matches for a single business listing, a tiered approach was adopted, whereby matches for business listings were sought within 10-m grids (tier 1), followed by 100-m grids (tier 2) and then 1,000-m grids (tier 3). Duplicates within a single database and matches between databases were removed prior to the initiation of subsequent tiers.

FIGURE 1
figure 1

Illustration of grid system for identifying matches between business listings in two commercial databases. Note: The size of each individual cell was 10 m for tier 1, 100 m for tier 2, and 1,000 m for tier 3. The shaded, nine-celled grids illustrate the grids surrounding database B listings (denoted by the letter B). The letter A denotes listings from database A. The A listings within the shaded area are candidates for matching.

Multiple methods were employed to standardize the business name to identify matches within the 10-, 100-, and 1,000-m grids, including the following:

  1. (1)

    Finding common names for chain businesses in multiple locations. Business names were parsed into words in sequence. Commonly occurring single-word names and two- to five-word combinations were identified to select standardized names. Most multi-location businesses (∼4,500) were matched in this manner.

  2. (2)

    Reordering business names for listings that included an individual’s name as the business name or part of the name. One database would consistently use the reverse order from the other source (last name first vs. first name first), so to facilitate matching, the standard name for one source was reordered.

  3. (3)

    Searching similar character strings. The COMPLEV function in the SAS software program was used to calculate the Levenshtein distance between two strings—a mathematical formula for the similarity of character strings. This function identified the top ten most likely matches for unmatched businesses within the tier 3 grids.

Analysis

Concordance and coverage estimates were calculated for all businesses and by business type, with further stratification by census-tract-level population density, percent below poverty, and race based on cutpoints used by others (Table 1).28,29

Capture–recapture methods were used to estimate the total number of listings (N) and the number of listings missed from both databases (x), based on the generic data structure in Figure 2.1922

FIGURE 2
figure 2

Generic data structure used to assess concordance and coverage of two commercial databases.

If we assume the databases are independent, then the probability of a business being present in database A if it is present in database B is equal to the probability of a business being present in database A if it is not present in database B:

$$ P\left( {{\hbox{in A}}|{\hbox{in B}}} \right) = a/\left( {a + b} \right) = P\left( {{\hbox{in A}}|{\hbox{not in B}}} \right) = c/\left( {c + x} \right) $$

Rearranging the formula, we can calculate x:

$$ \begin{array}{*{20}{c}} {a/\left( {a + b} \right) = c/\left( {c + x} \right)} \hfill \\{x = \left( {b \times c} \right)/a} \hfill \\\end{array} $$
(1)

Therefore,

$$ N = a + b + c + x $$

or, using formula 1:

$$\begin{array}{*{20}{c}}{{N = a + b + c + {\left( {b \times c} \right)}/a}} \\{{ = {\left[ {{\left( {a + b} \right)} \times {\left( {a + c} \right)}} \right]}/a}} \\ <!\end{array}>$$

The following statistics were calculated:

  1. 1.

    \( \% {\hbox{ difference}} = \left( {M-C} \right)/M \times {1}00\% \)

  2. 2.

    \( \% {\hbox{ agreement}} = a/\left( {a + b + c} \right) \times {1}00\% \)

  3. 3.

    Coverage of \( {\hbox{Database A}} = M/N \times {1}00\% \)

  4. 4.

    Coverage of \( {\hbox{Database B}} = C/N \times {1}00\% \)

  5. 5.

    Coverage of both \( {\hbox{databases }}\left( {{\hbox{i}}.{\hbox{e}}.,{ }\% {\hbox{ captured by either or both lists}}} \right) = \left( {a + b + c} \right)/N \)

Concordance was also measured for attributes of the listings present in both databases (hereafter, matched listings). Distance between geocoded points was examined, as well as percent agreement for four-digit SIC code and weighted kappas for US Census-based categories of number of employees and sales volume (Table 2).

Table 2 Distance in meters and agreement in four-digit SIC codes, business size and sales among business listings contained in both commercial databasesa

Results

After excluding duplicates (189 in database A; 90 in database B), database A included 18,199 listings, and database B included 16,615 listings (8.7% difference; Table 1). With the exception of religious and membership organizations, database A contained more listings than database B across all business classifications.

The percent agreement between the databases was 32.0% for all types of businesses combined, ranging from 21.8% for leisure and entertainment businesses to 51.0% for eating places (Table 1). The coverage of database A exceeded the coverage of database B for all business categories except religious and membership organizations. Most of the differences in coverage estimates between databases were small, with the exception of those for libraries/post offices and banks. The coverage of both databases for all businesses combined was 73.6%, but ranged from 58.7% for leisure and entertainment businesses to 89.6% for eating places.

As shown in Table 1, agreement between databases for all businesses combined varied only slightly by area-level characteristics. The most apparent trend appeared for census-tract population density; agreement and coverage tended to increase with population density. A positive trend was also present, but much weaker, for the agreement and coverage estimates by poverty status.

For some business types, percent agreement, and coverage of the individual and combined databases varied by area-level characteristics (Figure 3). The most consistent trends were observed for census-tract population density, where the coverage of both databases differed by more than 10% between at least two strata. Coverage estimates increased with increasing population density for businesses with undesirable amenities, banks, food stores, and retail businesses. An inverse association was observed for libraries and post offices. While some stratum-specific estimates for coverage of both databases varied by more than 10% for the poverty and race strata, the directions of the trends were inconsistent (data not shown). The only exceptions were observed for physical activity facilities where the coverage estimates varied by racial composition (77.3% for ≥75% white; 75.8% for mixed race; and 66.1% for ≥75% black) and for services with undesirable amenities where the coverage estimates increased with higher poverty levels (74.6% for <10% below poverty; 78.8% for 10–19.9% below poverty; and 83.4% for ≥20% below poverty). Similar patterns were observed for percent agreement.

FIGURE 3
figure 3

Coverage of both databases by business classification and population density of census tracts. Asterisk indicates >10% difference in coverage estimates between at least two strata.

Agreement in attributes of the 8,434 matched listings (distance, four-digit SIC code, number of employees, and sales volume) was also examined (Table 2). The mean (SD) distance between matched listings was 108.2 (179.0) m (∼0.07 miles). The 95th percentile and maximum for distances between matched listings were 369.0 m (∼0.23 miles) and 2,283.2 m (∼1.4 miles), respectively (data not shown). The average distance between matched listings varied somewhat by business type, from 71.3 m difference for alcohol and tobacco establishments to 136.9 m difference for physical activity facilities.

The percent agreement for four-digit SIC was 84.6% for all businesses combined and ranged from 61.9% for banks to 97.3% for libraries and post offices. Agreement in number of employees was moderately high for all businesses combined (weighted kappa = 0.63) with the weighted kappa for most business categories falling be 0.55 and 0.80. Agreement in sales volume was poor for all businesses combined (weighted kappa = 0.04) and across all business categories (all weighted kappas <0.25). Most of this discrepancy was attributed to differences in the distribution of matched listings with <$100K sales (27.6% of database A and 53.9% of database B).

Discussion

Although commercial databases provide a feasible means to characterize availability and density of a variety of businesses, our study supports recent evidence2,1214 that data from commercial databases may contain substantial errors, with little agreement between one another. We found mostly fair agreement between databases (32%) which varied somewhat by business types and area-level characteristics. Agreement and coverage were highest for popular walking destinations, eating places, and alcohol and tobacco establishments. Perhaps these are businesses that benefit most from advertisement and registration in the databases. Agreement and coverage were lowest for health-care facilities and leisure and entertainment businesses. The low agreement among health-care facilities is likely attributable to the heterogeneity of businesses within this category, from individual health professionals to large hospitals.

Four published studies of smaller geographic areas (either selected census tracts or city blocks) assessed the quality of commercial data by comparing such data on physical activity facilities (e.g., health clubs, dance studios) and/or food stores (e.g., grocery stores, convenience stores) with field data1113 or governmental records.14 These studies found moderate agreement for physical activity facilities12,13 and moderate to high agreement for food stores.11,13,14 We are aware of only two other studies that have directly compared databases A and B. The first compared aggregated counts of various classifications of food-related businesses in 50 zip codes in the Minneapolis/Saint Paul, Minnesota metropolitan area.2 Although the counts for the broader businesses classifications for food and beverage stores and food services and drinking places showed only minor differences between databases A and B (−13% and −5%, respectively), large discrepancies in the counts of subtypes of these businesses were found. Another smaller study assessed concordance for retail services and personal care businesses in one zip code in west Miami Dade, Florida showing 46.5% agreement, with coverages of 84.2% for database A and 62.4% for database B.30 Our findings extend these results by including many different types of businesses across a much larger geographic area.

Agreement and coverage of both databases varied slightly by census-tract population density, racial composition, and poverty level for some businesses. This finding differs from that observed by Boone et al. who found slightly higher agreement for physical activity facilities among nonurban vs. urban census tracts.12 Paquet et al. observed no variation in agreement for food stores or physical activity facilities by area-level socioeconomic status.13 Bader et al. found no consistent pattern between area-level sociodemographic characteristics and disagreement between field data and a commercial data source for the presence of various food outlets.11 The reason for variation in agreement by population density in this study may be attributed to geocoding errors, particularly for addresses that were pre-geocoded in database A. Zip code centroids are often the default location when addresses cannot be located and may be a considerable distance from actual addresses, particularly in lower-density, suburban areas. Also, the low-to-moderate agreement in the number of employees and sales volume for businesses present in both databases raises questions about the utility of these data for characterizing the size of businesses, as observed by others.30

Limitations of our study include possible outdated business listings, false negative matches, and false positive matches. We did not assess the extent of outdated business listings, but differential errors in ascertainment between the databases would result in an underestimation of the coverage of listings in the more accurate database. Despite the possibility of this error, coverage remained low for individual databases. Also, some matches may have been missed (false negatives) if their business name and locations differed between sources or fell outside the catchment area for detecting matches. Significant manual labor and programming skills were required for geocoding addresses (database B) and comparing business names at the varying catchment areas (approximately 500 hours of total person-time), so error is possible but unlikely to change the conclusions about the low levels of agreement. Field validation was infeasible given the number of businesses and our geographic area. This represents a significant limitation of this study because the direction of error for each database cannot be confirmed. False negative matches may have also resulted from businesses being differentially classified by four-digit SIC code between the data sources. Some discordant listings may be included in both databases but may have been classified by SIC codes excluded by our a priori selection in one of the databases.12 Four-digit SIC codes are not standardized across databases.2 Finally, different businesses may have been incorrectly matched (false positives) if they contained similar names or were branches of the same chain business within the catchment area. The fact that approximately 95% of matched listings were within 400 m of one another provided some support that the matches were true matches.

Despite these limitations, this study contributes important evidence about the low concordance of two of the most widely used commercial business databases. The geographic area and number of businesses examined exceed those of previous studies. Moreover, this study applied capture–recapture methods to estimate the total number and coverage of the databases—a promising method to estimate exposure to specific destinations when multiple, incomplete data sources are available.

To measure exposure to certain businesses for neighborhood-effects studies, researchers must select between existing databases or collect field data when feasible. Based on our findings, combining commercial databases may be impractical for studies covering large geographic areas when the databases lack standardized business classifications and common identifiers to match businesses between databases. Overall, health researchers should cautiously interpret findings when using either of these commercial databases to yield measures of the neighborhood service environment. Differences in agreement raise questions about differential misclassification when using these databases to characterize neighborhoods and their effects on health outcomes.