1 Introduction

Geography plays an important role in the establishment and growth of communities. People tend to interact more when they live in proximity, and these interactions in turn determine the social norms they are likely to form and follow (Coleman, 1990; Bendor & Switstak, 2001; White & Johnson, 2016). The concept of neighborhood or community, and its effects on attitudes and values, is fundamental to the fields of human ecology, sociology, and geography—indeed all areas of social science. A “neighborhood” is defined as an area where the residents are "interrelated and integrated with reference to its daily requirements" (Hawley, 1950, p. 257). These intra-community relationships can generate correlated data when individuals from multiple communities are studied. Thus, ignoring neighborhood effects, or spatial autocorrelation in general, may lead to biased results from models investigating the effects of individual and household variables on behavior, including migration decisions (Bilsborrow, 2016; Bilsborrow et al., 1984; Chen et al., 2009; Sullivan et al., 2017; Zhang et al., 2021; Zvoleff et al., 2013).

Spatial processes in migration and their impacts on populations have long been examined in spatial demography (Rogers, 1968, 1975; Wilson, 1974; Howell & Frese, 1983). Thus migration can have significant impacts on population compositions in both origins and destinations, especially when persons migrate from poorer to wealthier countries or from rural to urban areas, which is well-known to be usually primarily for better economic opportunities (e.g., Ravenstein, 1885; Sjastad, 1962; Lee, 1966;, Bilsborrow et al, 1984; Chandrasekhar & Sharma, 2015; Vega & Brazil, 2015; Mazza et al., 2018; Raymer et al., 2020). Interest in examining neighborhood effects inherent in migration decisions and consequences has mostly evolved more recently (Findley, 2019; Massey et al., 1990). Migrant flows have been found to be highly differentiated by citizenship and nativity (e.g., Lichter & Johnson, 2009; Raymer et al., 2013). Social networks created by geographical proximity and shared experiences have also been observed to be among the major factors affecting migration (and non-migration, from the psychic costs of leaving family, friends, and one's local community) (Curran & Rivero-Fuentes, 2003; Lee et al., 1994; Lichter & Johnson, 2009; Massey, 1990; White & Johnson, 2016). The likelihood of migration, and the way it is funded and facilitated correlate highly with the spatial patterns of migrant origins (Bell & Ward, 2000; Crowder & South, 2008). For our study area, the Fanjingshan National Nature Reserve (FNNR), out-migration from the rural reserve area to distant cities is part of the major trend of rural–urban migration ongoing in China since the “opening up” in the 1980s. In addition to socioeconomic factors, migration decisions are also impacted by spatial location and interactions with neighbors.

In order to detect and reduce spatial autocorrelation from which neighborhood effects arise, studies in geography and ecology have sometimes utilized various forms of models, including multilevel models and eigenvector spatial filtering (ESF: see Beisner et al., 2006; Rangel et al., 2010). While multilevel models require substantive information about the spatial units, ESF is less demanding and uses a form of data mining, in which alternative spatial definitions can be tested until the optimal one is found (Getis, 1992; Griffith, 2000).

In this paper, we compare the effectiveness of both multilevel and ESF methods in reducing spatial autocorrelation in the out-migration model. Using two different ways to measure neighborhoods (i.e., defined by Euclidean vs. topological distance) in ESF, we evaluate the potential for using topologically based distance matrices in an area with large elevation variations.

2 Background

2.1 Eigenvector Spatial Filtering (ESF)

Traditional regression analyses that do not account for neighborhood effects may be compromised due to violating the essential assumption that regression residuals are independent. And although individuals or elements in multilevel statistical models can be organized hierarchically to deal with neighborhood effects (Goldstein, 2011), such multilevel models require prior knowledge about the sizes of the classes or clusters (e.g., schools for studies of students, hospitals for patients, political districts for voters). However, in many if not most situations, the most relevant geographic area or size for such classes or clusters is not known, if even identifiable; it varies greatly with the prevalence and quality of transportation linkages and the specific type of behavior or decision under consideration (Bilsborrow et al., 1984; Hawley, 1950).

In cases when we have little or no a priori knowledge of the appropriate neighborhood, an effective way to examine neighborhood impacts is to apply spatial filtering methods (Getis, 1992; Griffith, 2000). Other spatial models, including the simultaneous autoregressive (spatial error) model and the autoregressive response (spatial lag) model, are restricted to OLS regression models (Chun et al., 2016). The ESF method decomposes key variables in normal multiple regression models into spatial and non-spatial components to thereby eliminate spatial autocorrelation. The non-spatial components can then be analyzed in any standard regression model—this offers considerable potential to detect mechanisms that might be overlooked in models that ignore the spatial component. It defines an n x n spatial weights matrix C (n being the number of observations or data records), which is comprised of 1's and 0's, each representing a pair of objects being or not being considered spatial neighbors. In our data mining approach, we assign households at varying distances from the household under investigation as neighbors (i.e., these households are assigned a 1 and all the rest a 0), to account for different yet unknown neighborhood sizes. Transforming the C matrix, we get:

$$MCM = (I - 11^{T} /n)C(I - 11^{T} /n)$$
(1)

where I is the n × n identify matrix, 1 indicates a n × 1 matrix (column vector with n rows of 1), and T represents the operation of transposing the matrix. It has been shown that all the eigenvectors of MCM, i.e., E1, E2 … En (representing the eigenvectors associated with eigenvalues ranked in descending order as λ1 > λ2 … > λn, where λn represents the eigenvalue for the nth eigenvector), are orthogonal. Through some selection procedures (e.g., stepwise regression or choosing the top k eigenvectors where k < n), a subset of eigenvectors can be chosen and used as regressors in multivariate regression analysis. Adding these eigenvectors to regression models can remove or at least reduce the contribution of spatial components and generate less biased estimates of regression coefficients (Chun & Griffith, 2011; Griffith, 2000; Tiefelsdorf & Griffith, 2007).

ESF has been widely used in ecology (Beisner et al., 2006; Rangel et al., 2010) and social sciences for topics such as studies of land prices (Seya & Tsutsumi, 2013) and car ownership (Hankach et al., 2022). A few recent studies on migration also utilize this approach, but they focus on network autocorrelation, where the migration of someone from a given origin to a given destination is correlated with that of other persons from the same origin to the same destination (Liu et al., 2017; Gu et al., 2020). Gu et al. (2020), for example, examine the intentions of migrants to transfer their hukou from the origin to the destination (that is, their legal household registration determining permanent residency, conferring residency benefits such as free education and healthcare). Their research found significant network autocorrelation in migration intentions, whose impact on the regression model was filtered out by ESF. It has therefore been argued that the inclusion of variables to control for network autocorrelation can significantly improve models of the determinants of migration (Chun & Griffin, 2011).

In the context of our study, we are interested in out-migration decisions from FNNR and therefore have coupled the ESF method with the Cox model (see details in Sect. 3.2) to examine neighborhood effects. Households close to each other are more likely to make similar decisions and are also subjected to the same local government policies and environmental conditions. We hypothesize that neighborhood effects exist in the model at three different distance ranges (short, medium, and long distances; see details in Sect. 3.2) based on our prior knowledge of the site. We also hypothesize that models using topological distances can detect more neighborhood effects compared to the Euclidean-distance-based models because topological distances capture the traveled distance better in the landscape of our study site. The data-mining approach of ESF allows us to determine which neighborhood definitions have the most significant impacts on the variables we are interested in. In addition, in social analyses, ESF has been mostly applied in urban contexts where there is no meaningful variation in elevation. In a prior study, neighborhoods were identified in the same FNNR area with a similar approach but for a different model (Zhang et al., 2021), in which only Euclidean distances were used. Here, we add a topological distance matrix to the Euclidean distance matrix, which may shed light on how ESF could be applied in areas with larges elevation variations, as in vast rural parts of the world.

2.2 Rural-to-Urban Migration in China

In the past 40 years, China has undergone profound socioeconomic changes following the "reform and open-door policy" initiated in 1978. Modernization and urbanization have expanded from coastal provinces to inland regions, as well as from big cities to many remote rural areas. Meanwhile, over 200 million adult migrants have left their rural homes and towns for metropolitan areas such as the Yangtze and Pearl River Deltas, seeking better opportunities for personal development and higher incomes to support their families (Zhao, 1999; Liang, 2016). Rural–urban migration is important in most developing countries (it was so earlier in present-day developed countries as well) as it is linked directly to the processes underlying socioeconomic development. Rural families may migrate as units or may send off a household member to earn a higher income and then benefit from receiving the migrant's remittances. In the latter case, and consistent with the theory of the New Economics of Labor Migration (Stark, 1991; Stark & Bloom, 1985; Stark & Taylor, 1991), the migrant is expected to earn more in the destination area and share it with the household of origin by sending (or bringing) back money or goods, increasing and diversifying the income sources of the origin household.

This type of migration appears to have accelerated in China following the large, national payments for ecosystem services (PES) programs, such as the Grain to Green Program (GTGP) and the Natural Forest Conservation Program (NFCP) (Liu et al., 2008; Zhang et al., 2018). The GTGP program has encouraged the transformation of farmland (or pasture) to secondary forests (or grassland), compensating landowners for withdrawing land from cultivation or pasturing. In addition, the Chinese government had already loosened long-standing restrictions on migration from rural to urban areas, such as the hukou system, which, as previously mentioned, was used in part to deter migrants from becoming permanent residents in destination cities. This institutional change released farm labor from cultivating farmland or pasturing and allowed them to out-migrate, especially to metropolitan or coastal areas (Zhang et al., 2018; Zhao, 1999).

The goal of most migrations is to seek a better life, but the paths and fruits of the pursuits vary. In the remote villages of the Fanjingshan Nature National Reserve (FNNR) of south-central China, over half of the households surveyed had at least one outmigrant, but their socioeconomic circumstances differed considerably. Some families still relied mainly on agriculture for their livelihoods, while others had already diversified their sources of income, via participating in tourism or other non-agricultural employment. Households in this study area also differed in ethnicity (most being minorities), age composition, education level, and even access to natural resources. Finally, they were participating in the two aforementioned programs (GTGP and NFCP) to varying degrees, receiving different amounts of subsidies based on how much farmland and forest land, if any, they had enrolled in the two programs (Yost et al., 2020).

This research examines the potential impacts of spatial neighborhood size on out-migration from households in the FNNR. Building on the basic Cox statistical model of household migration developed before for the study area (of Yang, 2019), we examine differences in model results before and after incorporating neighborhood effects. In addition to adding to the literature on the spatial effects in migration, our methodology can also inform other forms of processes subjected to neighborhood effects.

3 Data and Methods

3.1 Study Site and Data Collection

Located in the Wuling mountain range in the Guizhou province, China, the FNNR is a highly biodiverse area and home to many endangered species such as the Guizhou snub-nosed monkey (Rhinopithecus brelichi), the Chinese giant salamander (Andrias davidianus), and the forest musk deer (Moschus berezovskii). The altitude in the reserve ranges from 700 to 2600 m, encompassing a variety of ecosystems (Yang et al., 2002). At the time of the survey in 2014, there were 3256 households residing within or near the boundaries of the FNNR. In addition to resource collection in the forest and agricultural practices, remittances from outmigrants are an important source of income. Following the rural-to-urban migration trend discussed previously, an increasing number of residents have chosen to migrate to cities with more job opportunities.

We conducted a household survey in 58 randomly selected natural villages from all the 123 natural villages located in or near the boundary of FNNR (Fig. 1). We interviewed the household head, if present at the time of the survey, or another knowledgeable adult, usually the spouse, if the head was not present. We collected data on each household's agricultural land area and land use, sources of household income, and the household's enrollment in and value of subsidies received from the GTGP and NFCP programs in the previous 14 years (2001–2014). For migration, we considered all persons aged 15–59 in the household as the persons of interest (number of laborers in Table 1), who could be making migration decisions (whether or not to migrate) in the year of migration. In every household, we also collected data on each adult's age, gender, education, marital status, etc., along with changes over time each year since 2000, notably in education, marital status, residence (whether an outmigrant or return migrant), main work activity (on the farm, off-farm, managing non-farm businesses, none). The main source of household income was also obtained for each year.

Fig. 1
figure 1

Fanjingshan National Nature Reserve, showing the locations of interviewed households in this study

Table 1 Variable descriptions and summary statistics

In households with an outmigrant still living away at the time of the survey (summer of 2014), the data above were collected for each year using an individual event history table (if more than one adult migrant existed in the household, one was selected randomly). For all households with or without an outmigrant, one (non-migrant) member aged 15–59 was also selected to obtain the same event history since 2000. All such individuals selected were considered the "population at risk of migration."

3.2 Cox Model

The dependent variable in the model is whether the selected individual was an outmigrant from the household in any year during the reference period (2001–2014)—an outmigrant being defined as a member (or former member) of the household who lived outside the county for more than 6 months and was living away at the time of the survey. Based on the migration literature and visits to the survey households, we identify a suite of independent variables at both individual and household levels to predict out-migration decisions (Table 1; drawing on Yang, 2019). We include individual-level variables, such as age, gender, years of school completed, marital status, main work activity, as well as household-level variables, i.e., household size, main source of income, farmland area, participation in GTGP and NFCP, etc. A household is defined as having a migration network if the household respondent reported having a close relative (defined as a parent, child, or sibling) living outside the local county in the year prior to migration, or for a household with no migrant in the reference years, having such a relative living outside the county five years before the survey. Note that most variables were measured with a time-sensitive perspective (i.e., are time-varying), so changes during the modeling period (2001–2014) are incorporated in the survival analysis model, as will be explained later. Thus, some young persons aged into the main "population at risk of migration" (15–59), while others aged out during the reference period.

To include contextual variables in the model, we recorded the village ID of each household, which links the household to its village clusterFootnote 1 (23 in total). We also used GPS devices to record each household's exact geographical location, which was then used to calculate neighborhood metrics. Further details on the survey design, implementation, variable selection, and the multilevel Cox model specification are available in Yang (2019) and Yost et al. (2020). In this study, we identify 513 households (16% of total population)Footnote 2 in the study area where all the variables needed for the model are available (Fig. 1).

Survival analysis, a proportional hazards statistical estimation model, is an appropriate technique for examining the occurrence and timing of events (Allison, 2010; An & Brown, 2008; Klein & Moeschberger, 2003). In particular, Yang (2019) and Yost et al. (2020) utilize a multilevel Cox hazard model to predict the determinants of an individual's out-migration. The dependent variable in survival analysis is the “hazard,” expressed as the binary result of out-migration in a year, with 1 indicating out-migration of that individual and 0 non-migration (Therneau, 2018):

$$y\left( t \right) = y_{0} \left( t \right)e^{X\beta + Zb}$$
(2)

where \({y}_{0}\) is the baseline hazard function, X and Z are the design matrices for the fixed and random effects, respectively, and \(\beta\) and \(b\) are vectors of regression coefficients. In our analysis, we start with this basic non-spatial model, and then incorporate dummy variables in the basic model, using the Village cluster ID to capture overall contextual effects on migration decisions. To better account for spatial effects, we integrate into this basic model the ESF method (An et al., 2016; Chun & Griffith, 2011; Griffith, 2000). We calculate eigenvectors for each household at several predefined neighborhood sizes.

To do this, we first define neighborhoods based on Euclidean distance; all households within a certain fixed distance are identified as "neighbors" of the household of interest. We use the "spdep" package in R to generate the neighborhood matrices, starting with 0.04 km—which is the minimum distance that allows more than half of the data points to have at least one neighbor—and ending at 10 km, which covers around a third of the spatial extent of the entire area. In addition to ensuring each specification includes enough neighbors, these neighborhood definitions are also selected based on previous studies and theorization. Thus, they have been found to have significant impacts on another model at the same study site (Zhang, 2021). While the short-distance definitions (0.04 km to 0.1 km) capture interactions between close neighbors who might see each other on a daily basis, neighbors based on the moderate-distance definitions (0.1 km to 1 km) would not interact with each other as frequently, but are likely to share key infrastructure (e.g., access to schools, roads, and markets) and similar environmental conditions. Finally, the long-distance definitions (1 km to 10 km) are included because such more remote neighbors may belong to the same village, or other larger administrative unit, and be subjected to similar local government policies.

We also calculate the topological distance (see definition below) between each household with the "topoDistance" package in R and replace the Euclidean distance matrix with the resulting topological distance matrix. For this method, we use the same fixed-distance definitions from 0.04 to 10 km. The topological distances are defined as distances that also take into account terrain differences. Taking an additional digital elevation model (DEM) raster layer, the tool overlays the household locations on the DEM to find the elevation for each point. Then, it calculates the topological distances by finding the shortest topographic path between points, thus better representing real distances traveled. The differences between the Euclidean distances and the topological distances range from 0.03 to 3.78 km, which are significant in our neighborhood definitions.

To find the exact neighborhood size appropriate for our model under the ranges mentioned above, we use a data mining approach: choosing the one distance (out of the multiple ones tried) that matches certain thresholds of some indicators—to uncover or approach the optimal size. We then calculate the top 5 eigenvectors based on each distance specification (i.e., the ones with the highest eigenvalues: An et al., 2016; Chun & Griffith, 2011; Sullivan et al., 2017), and attach them to the original Cox model developed by Yang (2019) to re-estimate the regression. As the dependent variable is binary, the calculation of the residuals d follows from the following equations:

$$ln\frac{{p_{i} }}{{1 - p_{i} }} = b_{0} + \mathop \sum \limits_{1}^{j} b_{j} X_{ij} ,\;and\;d_{i} = \left\{ {\begin{array}{*{20}c} {\sqrt {2|\ln \left( {p_{i} } \right) } ,\;y_{i} = 1} \\ { - \sqrt {2|\ln \left( {1 - p_{i} } \right) } ,\;y_{i} = 0 } \\ \end{array} } \right.$$
(3)

where p is the probability of migration calculated from the statistical model, y the dependent variable, X the set of independent variables, and b the coefficients of the independent variables.

We then calculate the Moran's I statisticFootnote 3 and the related z scores of the regression residuals of both the spatial and non-spatial models in R using the spdep package. For the non-spatial model, we calculate the Moran's I statistic with each neighborhood definition used in the spatial models in order to compare the results with those from the spatial models. For the spatial models, we calculate the statistics with the same neighborhood definition (e.g., the Moran's I results of the 5-km spatial model are calculated with a 5-km neighborhood definition). Akaike information criterion (AIC) scoresFootnote 4 are also calculated for each model and used to select models with better fit. Finally, we calculate the Moran's I statistics for all non-binary variables to identify specific variables that could be more spatially autocorrelated than others and thus causing more bias.

4 Results

4.1 The Basic Cox Model Results

The difference between the random-intercept and fixed-intercept multilevel models is found to be insignificant in ANOVA analysis (p > 0.05), so we report the results only for the fixed-intercept models.

Individual's age and the household having agriculture as its main source of income are consistently significant and negatively linked to out-migration, while gender (male), being married, number of working-age adults in the household, and the household having migration networks are also significant positive predictors (Table 2). Education and the area farmed are only weakly positively linked to out-migration, while the other variables left in the model for theoretical reasons (expected to be important) do not have statistically significant results when included in the full multivariate model. Overall, the results for all the statistically significant variables are consistent with theoretical expectations.

Table 2 Results of basic non-spatial model and dummy variable model on factors affecting determinants of out-migration

We then compare the results from the two non-spatial multilevel models, first the basic one and then including village dummy variables to capture the overall effects of village factors on individual out-migration. The significance levels of the independent variables are nearly identical in the two models, and their coefficients do not change much (Table 2). Five village clusters out of 23 have statistically significant effects, and three more marginal effects. In this case, however, no significant changes are observed in the individual or household variables—only a very slight weakening of the effects of education and marriage at the individual level, of the household having agriculture as its main income source at the household level, and tiny increases in the importance of household farm area and migration networks.

When ESF is incorporated into the model, spatial autocorrelation in the regression residuals (see 3.2 for details) is significantly reduced (Table 3). While the significance levels of most variables remain the same, several do change (Table 4).

Table 3 Moran’I test and AIC results for the ESF models where decrease in residual spatial autocorrelation is significant
Table 4 Cox model coefficients for non-spatial and ESF Models with significant improvements in z values, as determined from the Moran’I test and AIC score in Table 3

Thus the household labor availability variable, significant (at the p = 0.05 level) in the non-spatial model, becomes insignificant in some of the ESF models (Models 1, 2, 4, 5, and 6). The p-values of the area farmed are also greatly reduced in ESF models, switching from being insignificant (at the p = 0.05 level) to significant in several models (Models 1, 5, and 6) after filtering out neighborhood effects. Third, the education variable has a decrease in significance level, its p-value changing from 0.055 in the non-spatial model to greater than 0.1 in most ESF models (Models 1, 2, 3, 5, 7, 8, 9, and 10). All of the coefficient values are similar to those in the non-spatial model.

4.2 Spatial Autocorrelation

For the Euclidean distance definitions, the Moran's I results from the non-spatial model show that in 5 out of 15 neighborhood definitions examined (3-km, 4-km, 6-km, 7-km, and 8-km), model residuals are significantly spatially autocorrelated (|z|> 1.96; Table 5). We observe a decrease in the absolute values of the z score in 11 out of 15 of the definitions, the exceptions being under the 0.5-km, 1-km, 2-km, and 5-km neighborhoods, suggesting an overall reduction in the spatial autocorrelation of residuals after applying the ESFs. The AIC score also decreases as we increase the measure of distance, becoming smaller than that of the non-spatial model (4281) at 2 km and then rising again at 9 km. This means that between the neighborhood definitions of 2 km and 9 km in this study area, ESF is more effective in reducing spatial autocorrelation and providing a better model fit. Among these models where the spatial autocorrelation is reduced, we identify 4 models (3-km, 4-km, 7-km, and 8-km) where the reduction is significant (|z| decreases from more than 1.96 to less than 1.96). The AIC scores of these models are also optimal among all the ESF models.

Table 5 Moran’I test and AIC results for ESF models and the non-spatial model based on model residuals

For the topological distance definitions, the model residuals of the ESF models are significantly spatially autocorrelated (|z|> 1.96) based on using 6 neighborhood definitions (3-km, 4-km, 5-km, 7-km, 8-km, and 9-km). Similarly, there is an overall reduction in |z|, except for two definitions (1-km and 2-km). The AIC scores follow the same pattern as the Euclidean distance definitions, decreasing till 10-km. We identify 6 models (3-km, 4-km, 5-km, 7-km, 8-km, and 9-km) where there is significant reduction in |z|.

Finally, we conduct Moran's I calculations on each non-binary independent variable at each of the ten neighborhood definitions (Moran's I calculation is not appropriate for binary variables). While age and education are not spatially autocorrelated under any neighborhood definitions, the other statistically significant variables (number of working-age adults, area farmed, and household size) are highly spatially autocorrelated (|z|> 1.96). Again, by comparing the change in the p-values of their coefficients, we can see the effect of incorporating ESF on specific spatially autocorrelated variables. As discussed above, the number of laborers and area farmed are two independent variables that undergo significant changes in p-value. Both variables have larger z-scores from the Moran's I statistics, suggesting that they are spatially autocorrelated. The other spatially autocorrelated variable, household size, however, do not experience a significant change after the application of ESF, but is insignificant in any model anyway. Interestingly, one variable that is not spatially autocorrelated, education, also changes its significance level in the spatial models (Table 6).

Table 6 Moran's I results based on individual variable values at different neighborhood definitions

5 Discussion

5.1 Effects of ESFs on Cox Model

While the multilevel Cox model with dummy variables does not capture significant neighborhood effects, the Moran's I results show that spatial autocorrelation is still present in the model residuals under several neighborhood definitions (Table 5). The application of ESF significantly reduces this spatial autocorrelation in model residuals, resulting in changes in the significance levels of three variables (education, number of laborers, and area of farmland). The Moran's I tests on individual variables also confirm spatial autocorrelation in some of the variables.

Education is weakly linked to migration in the non-spatial model (0.05 < p < 0.1), perhaps due to limited educational opportunities beyond primary in the rural study area—migrants usually leave to seek low-skilled work. Adding eigenvectors into the model consistently increases the p values in all neighborhood definitions. With exception of model 4 and model 6, the p values cross from below 0.1 to over 0.1, indicating that the impact of neighborhood is significant for the variable (Table 4). The Moran's I test on the education variable, however, does not indicate spatial autocorrelation at any neighborhood definition (Table 5). The change in significance level of education is therefore likely due to the change in significance of other variables that are somewhat correlated with level of education. This may also be due to the size of the dataset, which might be too small to capture its neighborhood effect.

At the household level, our results demonstrate that as the farm area variable became more significant, the number of laborers in the households became less significant, after the application of ESF. Their significance levels change likely due to the high spatial autocorrelation in the data (Table 5); once the appropriate ESF is used to filter out the negative impacts of spatial autocorrelation, its "hidden" impacts on migration are recovered: thus farm area has a more significant, positive impact on migration decisions (coefficient p-values decreased in all ESF models, with the p-value at model 1, 5, and 6 less than 0.05), while the number of laborers only appears to have a mild significance.

The positive relationship between farmland area and out-migration may seem contradictory, since in other studies on developing countries, access to land, either cultivated or non-cultivated (forest in this case), tends to be a key factor that reduces out-migration (e.g., Shaw, 1975; Bilsborrow et al., 1984, Chapters 2, 10; Massey et al., 1993) as having more land provides more opportunities to engage in agricultural production. But there are cases in which the ownership of more household assets, including agricultural land, facilitates out-migration (Bilsborrow et al., 1987; Davis et al., 2016). First, households with more farmland are likely to have more household income from land. In the case of China, they have an additional advantage: they are more likely to have land to enroll in the GTGP, which provides a modest cash compensation each year and could help fund out-migration to an urban destination (Davis et al., 2016; Yost et al., 2020). Apart from the household's capacity for funding out-migration, a person's willingness to migrate also relates to the household's food security in his/her absence. Households with more farmland and thus more grain production are more likely to have enough land and crops to meet their basic subsistence needs even after GTGP enrollment of land and out-migration. This is because farmers may intensify agricultural production on the remaining land, made possible by remittances from the outmigrants which can be used to purchase better farming equipment (Xu et al., 2006); in the FNNR, the labor-to-farmland ratio is indeed high. Finally, greater availability of farmland is often associated with remote, poorer places, where farmers are more likely to leave for higher-paying jobs (Zhang et al., 2018).

Interestingly, the number of laborers in the household becomes insignificant when ESF is applied, which might also arise from spatial autocorrelation in the data (Table 5). Given the one-child policy in China over the past four decades (terminated in 2016), the numbers of children in rural households are very similar, rarely being different from one or two, making the variable relatively stable (mean = 2.13; max = 6; min = 0; standard deviation = 0.96; Table 1). Such low variation in the number of laborers makes it difficult to detect any effects of the variable: while there is still the expected positive relationship between number of laborers in the household prior to migration and out-migration (even after the ESFs are applied), p values are consistently lower than 0.1).

5.2 Euclidean and Topological Distance Definitions

Comparing the Moran's I results of model residuals under the same distance from the Euclidean and topological neighborhood definitions, we can see that the topological distance models have lower |z| values from 3 to 10 km (Table 5; e.g., at 3 km, model 1 (Euclidean) has a |z| of 1.589 while model 5 (topological) has a |z| of 1.396). The topological models also better reduce the level of spatial autocorrelation than the Euclidean models—there are two additional distances (5-km and 9-km) where |z| values become significantly lower in the ESF models using topological distances. In Table 4, we can also see that the changes in p values in variables are slightly larger in the topological models (e.g., for area of farmland at 4-km definition, the variable had a p-value of 0.50 in model 2 (Euclidean) and a p-value of 0.064 in model 6 (topological). This shows that the topological models are better at detecting and eliminating neighborhood effects. Although the topological models do not identify additional variables that underwent significant changes after the ESF process, they do generally highlight the strength of neighborhood effects in the dataset. Therefore, the topological models appear better suited for our study area, where elevation differences are quite significant (elevation ranges from 484 to 1632 m for the data points). This is especially true for larger distances because the difference between the Euclidean distance and the topological distance is small when the households are very close, i.e., only 0.2 or 0.5 km away from each other. But when the distance becomes larger, it can involve a significant hike up or down a mountain.

For both the Euclidean and topological models, greater neighborhood effects are detected using the longer-distance definitions, being especially significant for 3 km and 4 km. These distances are roughly consistent with the area of an administrative village, which may implement local policies (e.g., land or forest management) or carry out agricultural education programs, making households within the area more similar to one another. But there are also likely to be neighborhood effects that exist even outside the shared administrative boundaries; e.g., people near such boundaries may interact with one another or share natural resource conditions, but will still be classified into separate administrative units. Therefore, dummy variables for the administrative village will not fully capture all neighborhood effects. Even though there is a significant reduction in spatial autocorrelation in model residuals at longer distances (7 km, 8 km, and 9 km) as well, definitions at those distances do not show the same changes in significance levels of independent variables as they do at 3 km and 4 km. As the distance became bigger, a large portion of the dataset becomes categorized as neighbors, so the ESF method becomes less effective in reducing spatial autocorrelation. Even though our results demonstrate that neighborhood effects are more statistically significant based on using longer-distance definitions, this does not explain why some distances capture neighborhood effects better than others—for example, in the Euclidean definitions, the model residuals are spatially autocorrelated at 3 km, 4 km, 6 km, 7 km, and 8 km, but not at 5 km (Table 5). Such inconsistencies could result from our relatively small sample size or a few outliers (e.g., extreme values) in the data, but our results still demonstrate how a data mining component can be useful to detect neighborhood effects and refine prior theorization.

6 Conclusions

The neighborhood context is recognized as a potentially important predictor of individual-level behaviors as well as socio-demographic outcomes for individuals and for origin households (Bilsborrow et al., 1984; Hawley, 1950; Lee & Cubbin, 2002; Pickett & Pearl, 2001; Sampson, 2003). In addition to sharing similar social-economic characteristics, individual agents (e.g., persons or households) in the same neighborhood or community interact closely with each other, and thus tend to have many similarities in behaviors, values, and decision-making processes.

Nonetheless, identifying what constitutes a "neighborhood" has proved to be a challenge, both in theory and in practice. Due to difficulties in defining “neighborhood” and the cost of collecting detailed substantive community-level data (e.g., from interviewing community-level leaders directly to seek specific data on population size and characteristics, presence of infrastructure of various types, transportation facilities, wage and price levels, etc.: see Bilsborrow et al., 1984 regarding migration), researchers often resort to using existing administrative or political boundaries, creating artificial sampling clusters (grouping data from natural villages into artificial clusters), or otherwise creating “neighborhoods” arbitrarily, rather than on substantive or environmental grounds. For example, the boundary between administrative districts may pass through the middle of a valley where inhabitants of villages naturally interact regularly. Our study has experimented in one rural area of China in defining "neighborhoods" according to Euclidean distance or topological distance using various neighborhood sizes, and finds—in rural areas of China at least—it is only at certain sizes that we can observe significant neighborhood effects, i.e., differences in results compared to those of non-spatial models. Although the neighborhood sizes used (3-km and 4-km, in our case) might correspond to administrative units such as village, simply adding dummy variables for each village in a multilevel model cannot necessarily capture these effects. Thus, relationships may remain hidden in the non-spatial and multilevel models, or observed where they do not exist. Further study is needed to better understand the relationships between a "true" neighborhood size and the ones based on our data mining approach.

It is important to recognize that the neighborhood effects observed here are not very significant, as indicated by the consistency of most coefficients' significance levels (p-value; Tables 4) in the spatial and non-spatial models. The sizes of the coefficients of all variables also remain similar after ESF is applied. The Moran's I and |z| scores of the spatial models' (with ESF) regression residuals, however, do undergo significant changes under some of the neighborhood definitions after eigenvector filters are applied. By calculating Moran's I statistics for the independent variables, we also find that several appear to have spatial autocorrelation. Thus, we select ESF models to test for spatial autocorrelation based on model residuals from the non-spatial models. The results identify three out of five non-binary variables to be spatially autocorrelated, two of which (number of laborers and area farmed) undergo changes in significance level after the spatial filtering was implemented, although no such change is observed in other spatially autocorrelated variables. Household size, for example, is spatially autocorrelated (|z| scores > 1.96) under all four selected neighborhood definitions, but has little change in significance level from incorporating ESF. In addition, a variable (education) with little spatial autocorrelation changes its significance level after ESF is applied (i.e., loses its statistical power). This may mean that the population diversity of our data set is not sufficient to capture neighborhood effects well due to the small area and fairly homogeneous population.

However, our results do suggest that employing appropriate modeling methods and including tests on the effects of various measures of neighborhood size can help to identify and capture the impacts of "neighborhood" and thereby generate more accurate estimates of the coefficients of variables that are often subject to neighborhood effects. At the same time, the methods demonstrated here will often, at minimum, increase the accuracy of measurement of effects of some variables (e.g., at the individual or household levels) even if they are not directly subject to statistically significant neighborhood effects. As shown in many previous studies, spatial correlation among variables is often present and should not be ignored. After including the spatial filters (i.e., eigenvectors), the independent variable, household farmland area, changed from being statistically insignificant to significant at several definitions of neighborhood distances, while the number of household members of working age changed from significant to insignificant. It is important to note that when a data-mining approach is used, neighborhood effects (e.g., distances at which it is the most prominent, and its levels of significance) can alter depending on the dependent variable and model under question. For example, Zhang et al. (2021) detected neighborhood effects in 0.002 km, 0.1 km, and 0.5 km in their model on household participation in PES programs using the same data. Both analyses illustrate that ignoring neighborhood effects can lead to misleading and even incorrect conclusions (An et al., 2016; Sullivan et al., 2017). The difference in neighborhood effects of the two models also suggests that different distances should be tested in different models even with the same dataset.

Finally, topological-distance-based neighborhood definitions might generate even more accurate results for areas with large elevation variations such as our study site, and many other rural sites around the world. Definitions of neighborhood generated with eigenvector filtering may be more relevant than, or at least can be complementary to, those captured by traditional approaches, e.g., using dummy variables for each community or cluster controls. Therefore, we recommend that the ESF approach be tested in other, especially larger or more diverse geographic conditions, to better determine an appropriate size for "neighborhood" in different contexts, according to the particular topic or decision process being studied, which can then be incorporated in the model to correct for spatial autocorrelation and thereby lead to better results in investigations of factors influencing people's decision-making and behavior. Replicating this method in many other different contexts and for different variables can further enhance understanding of what an appropriate “neighborhood” is for all manner of variables of human behavior, thereby facilitating the use of the ESF or other methods described here to effectively control for the neighborhood effects when trying to understand human behavior.