Introduction

The spatial distribution of the human population is a classic and eternal topic in population geography. It was systematically studied by Friedrich Ratzel as early as 1890, and it has been one of the key fields within population geography for over a century (Hu 1935; Mera 1977; Bedford 1999; Beeson et al. 2001; Voss 2007; Feng and Li 2011; Matthews and Parker 2013; Zhu et al. 2016). In earlier studies, many spatial and statistical methods have been employed to analyze population distributions, such as population centers of gravity, Lorenz curves, spatial autocorrelation analysis and multivariate regression (Chi and Zhu 2007). Recent studies have focused on the spatial mechanisms and logic behind population distribution by integrating statistical and GIS spatial analyses (Zhu et al. 2016). Meanwhile, advanced GIS techniques are used to simulate population distribution in order to overcome spatiotemporal resolution limitations in population census data by using various data sources, such as remote sensing data (Lung et al. 2013), land use data (Luo and Wei 2006; Gallego 2010), and mobile phone data (Deville et al. 2014).

Population distribution in China is particularly important because it is the world’s most populous country, having a vast territory and huge regional differences. China is undergoing the largest migration in human history, with the population flocking to developed areas in the east of the country (Fan 2008; Ye Liu et al. 2014). Revealing the cause of population distribution and determining the dominant factors that promote or inhibit population growth represent foundational undertakings that provide a scientific basis for creating population policies that promote orderly immigration and optimize population distribution.

The past two decades have witnessed the rapid development of GIS technology and the implementation of China’s population census, both of which have provided technical and data support for quantitative research on population distribution and its influencing factors. Many excellent studies have been conducted (Table 1). Some researchers (Dong et al. 2002; Chen et al. 2007; Feng et al. 2008; Y. Fang et al. 2012; Ying Liu et al. 2015) have focused on natural influencing factors. They found that population density was positively correlated with water sources, precipitation, temperature and net primary productivity (NPP) but was negatively correlated with the relief degree of land surface (RDLS), elevation, and distance to rivers and coastlines. The other researchers (Song et al. 2007; Lv et al. 2009; J. Wang et al. 2012; Bai et al. 2015; L. Wang et al. 2015) have examined both natural and socioeconomic factors. Their findings indicated that the influence of socioeconomic factors cannot be ignored. Due to the complexity of these factors, researchers might choose different indicators in different study areas and scales. However, it is commonly believed that complex topography and harsh climate will hinder population agglomeration. Humans prefer to live in places with long development histories, advanced economies and good public services and infrastructure conditions. These findings contributed to the selection of influencing factors in our paper.

Table 1 Summary of the factors influencing China’s human population distribution

Most previous studies (Chen et al. 2007; Song et al. 2007; J. Wang et al. 2012) have used traditional global regression models to test the relationship between population distribution and influencing factors. The global regression model assumes that the relationships between the variables are homogenous. However, spatial dependences often are not homogenous across large geographical regions (Matthews and Parker 2013). For example, water resources are particularly important influencing factors for population distribution in arid areas, but may not be the main factor in humid areas. To illustrate these differences, some researchers (Bai et al. 2015; L. Wang et al. 2015) divided their study areas into several sub-districts then used traditional global regression model in each sub-district. This approach helps, but at the same time isolates each sub-district. The impact between samples in different sub-districts is neglected, especially geographically adjacent samples.

To overcome this weakness, we employed the Geographically Weighted Regression (GWR) model to explore the spatially varying relationships between human population distribution and potential influencing factors in mainland China at the prefectural level. The GWR model is a local spatial statistical technique for exploring spatial nonstationarity (Brunsdon et al. 1996, 1998) and is widely used in spatial statistical analyses of land and house prices (Dziauddin et al. 2015; S. Li et al. 2016), migration (Helbich and Leitner 2009; Jivraj et al. 2013; Villarraga et al. 2014), climate change and carbon emissions (Samson et al. 2011; S. Wang et al. 2014) and health and epidemiology (Black 2014). Thus, the GWR is expected to be used as a powerful tool for identifying the factors influencing China’s human population distribution and spatial heterogeneity.

The rest of the study is structured as follows: We first introduce the data and methods. The results indicate that road density, GDP, temperature and arable land proportion are identified as the key factors influencing population distribution and that the influence of each factor varies in different regions. We argue that regional population and development policies should be made according to the specific factors influencing the population distribution in each region. Finally, we discuss the meaning of our findings for policymaking as well as the limitations in this paper.

Data and Methods

Data

Population density, which is the population size divided by the land area in a region, is used as dependent variable. Due to China’s large population and area, the administrative divisions, including the provincial, prefecture, county and township levels, are complicated. The prefecture level was chosen as the analysis scale of this study, because it shows more detailed differences compared to the provincial-level. In addition, data accessibility is another concern. 342 prefecture-level administrative districts in mainland China were studied, including 283 prefecture-level cities, 17 prefectures, 30 autonomous prefectures, 3 leagues, 4 municipalities, and 5 regions directly governed by provinces. Permanent population data were derived from the 2010 Population Census of the PRC. The spatial distribution of the population densities in mainland China in 2010 are presented in Fig. 1.

Fig. 1
figure 1

Population densities of various prefectures in mainland China in 2010. Note: The Hu-line was imagined by the Chinese demographer Hu Huanyong, stretching from the city of Heihe to Tengchong County, in 1935. At that time, land to the east of this line contained 96% of the population, and land to the west contained the remaining 4%. This marks a striking difference in the distribution of China’s population at that time, a pattern that still remains after nearly 70 years (Ge and Feng 2008)

Twelve explanatory variables were selected (Table 2) according to the previous studies and on the basis of their availability (Chen et al. 2007; Song et al. 2007; Lv et al. 2009; L. Wang et al. 2015). Elevation and terrain slope were selected to indicate topographical factors. Similarly, precipitation and temperature were chosen to indicate climatic factors. As an agriculture-dominated country throughout history, China has witnessed a population agglomeration process accompanied by the process of land reclamation; accordingly, arable land proportion was chosen to indicate the development history. Gross Domestic Product (GDP), per capita GDP and per capita income are used to indicate both the size and level of the economy. The public services and infrastructure in China are mainly provided by the government. Therefore, the local government general budgetary expenditure (LGGBE) was chosen to indicate the general level of public services and infrastructure. In addition, educational resources, medical and health services and road density were considered separately. Both the dependent and independent variables were normalized using the z-scores method to allow for the comparison of corresponding normalized values among different datasets.

Table 2 Potential factors influencing population density and their data sources

Methods

The Pearson’s correlation coefficients between the dependent variable and independent variables were calculated. The OLS (ordinary least squares) regression was then conducted to investigate the influence of the 12 independent variables. The stepwise method was conducted in IBM SPSS Statistics 19.0. Considering the classical theory of demographic geography and the results of previous empirical research, the model’s suggestions were checked and modified. Finally, the GWR model was used in ESRI ArcGIS 9.3 to explore the spatially varying relationships between population density and the potential influencing factors. As mentioned above, the GWR model improves on the OLS regression model by taking spatial structure into account (Brunsdon et al. 1996, 1998; Fotheringham et al. 1998). The model can be expressed as follows.

$$ {y}_i={\beta}_0\left({\mu}_i,{v}_i\right)+{\sum}_k{\beta}_k\left({\mu}_i,{v}_i\right){x}_{i k}+{\varepsilon}_i $$
(1)

where.

y i :

is the population density of the ith region;

(u i ,v i ):

is the central spatial coordinates of the ith region;

β 0 (u i ,v i ):

is the local estimated intercept for ith region;

β k (u i ,v i ):

is the effect of kth independent variables for ith region;

x ik :

is the kth explanatory variable associated with β k ;

ε i :

is a random component assumed to be independently and identically distributed.

Bandwidth is an important parameter of the GWR model. In this paper, the AICc method is adopted to determine the optimal bandwidth (Fotheringham et al. 2002). The Akaike Information Criterion (AIC) (Akaike 1974) was used to compare OLS with GWR. If the AIC is more than 3 units smaller, the smaller one is significantly improved the model fit. Moran’s I is used to examine the spatial autocorrelation of the standardized residuals. If Moran’s I is close to zero, it means that the residual is randomly distributed and indicates that the model fits well.

Results

All 12 independent variables were significantly correlated with population density (Table 3). Of these, the correlation coefficients for elevation and terrain slope were negative, while those of the other 10 factors were positive. Those factors with correlation coefficients larger than 0.4 were all socioeconomic factors, such as the road density, GDP, per capita income, LGGBE, and medical and health services. In addition, the correlation coefficients among the independent variables were significant, and some were even larger than 0.8, indicating that multicollinearity should be considered in the next step. We tested for multicollinearity utilizing the variance inflation factor (VIF) (Menard 2002).

Table 3 The Pearson’s correlation coefficients between the dependent variable and 12 independent variables

The stepwise OLS tests suggested that road density, LGGBE, temperature and arable land proportion were the key factors influencing population distribution. As the classical theory of population migration and many related empirical studies have shown, the major causes of migration are economic (Ravenstein 1885, 1889; Lee 1966; Grigg 1977). In addition, GDP, which is the most basic and commonly used economic indicator in empirical studies (Fan 2005; C. Liu et al. 2007; L. Li and Clarke 2012; Jiao et al. 2016), should be emphatically investigated in this paper. On the other hand, the Pearson correlation between GDP and LGGBE was 0.913, which means that these two indicators contain much of the same information. To prevent a serious model estimate deviation caused by multicollinearity, we eventually replaced LGGBE with GDP. Other selected independent variables were also checked. For example, the road density has the strongest Pearson correlation with population density in this paper. Some studies have also shown that improved transport facilities can reduce travel coasts and bring convenience to residents (Kotavaara et al. 2011; T. Li et al. 2012). So, it plays a very important role in population agglomeration. Finally, road density, GDP, temperature and arable land proportion were chosen as the 4 key factors influencing population distribution. The highest VIF value was 2.10, which was well below the common cut-off point of 10 (Menard 2002), indicating that multicollinearity was not biasing the OLS estimations.

The GWR analysis showed that the bandwidth was 1205 km, the adjusted R2 was 0.788 (larger than OLS’s 0.753), the AICc was 459 (lower than OLS’s 500) and the standardized residual of Moran’s I index was 0.010 (nearly zero), indicating that the GWR had an improved model fit. The regression coefficients from the GWR analysis are summarized in Table 4, and the visualized maps are shown in Fig. 2. The solid white areas indicate that the parameter estimated with a t-value with a significance less than 95%. The shaded areas display where the spatially varying effects were significant. It was found that the influence of each factor on the population distribution varied in different regions. The mean regression coefficient between the population and road density was the largest, and the regional coefficients reached the highest in the Southwest but were lower in Northeast China (Fig. 2a). The regression coefficients between population density and GDP decreased from Southeast to Northwest China (Fig. 2b). The regression coefficients between population density and temperature were higher in southeastern coastal areas but were lower in inland area of China (Fig. 2c). The mean regression coefficient between population density and the arable land proportion was the lowest, and the coefficients in Xinjiang (short for Xinjiang Uyghur Autonomous Region) in Northwest China were most significant (Fig. 2d). The GWR results of local r-square values are mapped in Fig. 3, ranging from 0.717 to 0.960. Hence, this model fits well in most parts of the country.

Table 4 Comparison of parameter estimates from global (OLS) and local (GWR) models
Fig. 2
figure 2

The GWR regression coefficients between population density and road density a, GDP b, temperature c, and arable land proportion d. Note: Significant areas at 95%

Fig. 3
figure 3

Local r-square values of GWR model

Discussion and Conclusion

Explaining Spatially Varying Population Densities

The abovementioned results showed that road density, GDP, temperature and arable land proportion were the 4 key factors influencing population distribution, and that socioeconomic factors influenced population density more significantly than natural ones. This result can also be obtained by traditional global regression method and is roughly the same as many other studies (Lv et al. 2009; L. Wang et al. 2015). However, few studies have been done to estimate the spatial heterogeneity of these influences.

Complex terrain and a lack of transportation infrastructure in Southwest China seriously restrict the economic development and population agglomeration in this region. Therefore, it was found that the regression coefficients between population and road density were the highest in these regions. In contrast, population density was not very sensitive to road density in Northern and Eastern China due to better transportation infrastructure. Y. Zhang and Ren (2012); Tang et al. (2015) have also found that in mountainous areas in China, the better the transportation infrastructure was, the less it influenced population distribution. That is, the influence of road density on population density is marginally decreasing.

The regression coefficients of GDP were almost positive and were higher in southeastern coastal areas than inland. Southeastern coastal areas are some of the more developed areas in China, where there are more employment opportunities, higher wages and better living standards, attracting a large amount of migration from the interior. This result conformed to Ravenstein’s classic population migration theory and was also consistent with many scholars’ research results, for example, the results of Fan (2002) documenting the rising population in the eastern region as a result of population growth due to migration and in response to the widening economic gap between coastal and interior China.

The regression coefficients of temperature were almost positive and decreased from southeast to northwest. This finding was consistent with Samson et al. (2011), who developed the first global index to predict impacts of climate change on human population density using a GWR model, and found that human population density tended to be positively related to average annual temperature in the high latitudes of the world. However, their research scope was global and the model bandwidth was nearly 3 times that of ours, and therefore did not show the detailed differences that were apparent from our model. In addition, as mentioned above, China has been an agriculture-dominated country throughout history. Agricultural production efficiency is an important factor in population agglomeration. Southeast China has a subtropical monsoon climate zone, and the abundant rainfall and sufficient heat there are favorable to agricultural production. Therefore, the influence of temperature is significant. Northwest China is arid, with an annual precipitation almost under 500 mm. The higher the temperature, the greater the evaporation, which aggravates drought, reducing the suitability for human habitation. In Northeast China, it is too cold for humans to live, but there are abundant forest resources. To exploit these resources, some forestry cities like Yichun have been established and have attracted a certain amount of population influx (J. Zhao 1992). Southwestern China is hot and humid. Malaria, which is transmitted through mosquito bites, is strongly determined by temperature (X. Zhao et al. 2014) and to some extent hinders the immigration of Han Chinese and affects the population distribution patterns of local ethnic minorities (Cang 2004). Thus, the effects of temperature are not very significant in these areas.

The significant regression coefficient between population density and arable land proportion in Xinjiang (in the extreme northwest) might be due to the Xinjiang Production and Construction Crops (XPCC) project, a special frontier migration project aimed at cultivating wilderness and guarding the frontier (Cappelletti 2015). The XPCC has reclaimed more than 10,000 km2 of fertile lands since 1954, and now has 176 farming and herding regiments with 2.5 million people (almost 10% of the population of Xinjiang). Therefore, population distribution is highly correlated with the proportion of arable land in Xinjiang.

Implications for Population Policy Making

One of the Chinese central government’s long-term population development strategies is to promote orderly migration, reasonable population distributions, and maintain a balance between population, social development, resources and environmental protection (State council of the PRC 2000). The Hu line still remains, and high population density areas are concentrated mainly in the North China Plain, the Yangtze River Delta, the Pearl River Delta and Sichuan Basin. We argue there is no “one size fits all” policy. Regional population and development policies should be made according to the specific factors influencing the population distribution in each region. Our results provide some useful wisdom:

First, different regions have different driving factors that promote or restrict population agglomeration. Therefore, suitable regional population development policies should be made according to different local conditions. For example, poor transport infrastructure conditions are one of the main limiting factors for population agglomeration in southwest. Therefore, accelerating the construction of transport infrastructure is a key factor for enhancing population agglomeration in these areas.

Second, the regression coefficients of GDP were higher in the southeastern coastal areas than inland. This means that, even if the speed of economic growth in the coastal and inland areas were equivalent, the attraction of migrants to coastal areas will be higher than that of the inland areas. That is, the population will continue migrating to coastal areas inertially. To reverse the trend of people flocking to the eastern coastal cities and build a population-balanced society, China must promote the Great Western Development Strategy and speed up the economic development of Midwest China.

Third, food transport has historically cost a great deal; therefore, local land reclamation is an important measure to promote the development and prosperity of a region. However, this relationship has been gradually weakened with the development of food production, storage and transport technology. Our results show that, except in Northwest China, the regression coefficients of arable land proportion are lower. This reminds us to take a new look at the relationship between agricultural development policies and population agglomeration at present and in the future.

We also found an interesting phenomenon: As the capital city, Beijing’s GDP ranks second among Chinese cities, only behind Shanghai. However, Beijing’s GDP regression coefficient was not particularly high (Fig. 2b). As mentioned above, the GWR model performs local regressions considering not only the region itself but also the surrounding regions, and the closer the distance the greater the weight. Unlike urban agglomerations in the Yangtze River Delta and Pearl River Delta, the cities in Beijing-Tianjin-Hebei region are polarized (Sun and Yuan 2014). Except for Beijing and Tianjin, other cities in Hebei province are relatively small. Therefore, Beijing’s value is lowered by its surrounding cities. Currently, Beijing is suffering a series of “big city diseases”, such as population over-expansion, high housing prices, traffic congestion and so on (Xinhua 2015). Our results, from one aspect, demonstrate the importance of regionally coordinated development and support the execution of the Beijing-Tianjin-Hebei coordinated development strategy.Footnote 1

Study Limitations

While this study has several strengths, it is not without limitations. The GWR model is a useful technique for exploring the spatially varying relationships between variables. We successfully revealed the spatial heterogeneity of our variables. However, we cannot indicate that there is strict causality between independent variables and dependent variable by using this method alone. For example, one could ask whether population agglomeration caused the increase of the road density, or the increase of the road density caused the population agglomeration. Though we cited a number of previous researchers’ achievements and tried to supplement the explanation of our results, more work is needed to reveal the mechanisms of these influencing factors’ effects on population distribution.

Moreover, there are some limitations in the data. Concerning data accessibility, prefecture level units are used in our final analysis. However, population distribution is scale-dependent. The results may be a little different at different scales. Additionally, there are potential concerns about the selection of independent variables. For example, we cannot directly quantify the level of public services and infrastructure, so we elected to use LGGBE for instead.

Third, in a country like China where the government has strong control over resource allocation and macro-economic dynamics, the effects of policy on population distribution cannot be ignored. For example, the household registration system restricts population migration (C. Fang 1995; Y. Wang 2013), and the family planning policy actually varies between regions and ethnic groups (Xu 2010; Z. Zhang 2016; J. Zhang 2017). In addition, some historical “movements” or “national development strategies” may also have had a great influence on population distribution, including the “Up to the Mountains and Down to the Countryside Movement” (Glassman 1978) and the “Third Front Movement” (Meyskens 2015) during the Maoist era, a large number of people and factories moved to Midwest China. During the Chinese Economic Reform in the post-Mao era, people returned to coastal areas. However, it was very difficult to quantify the effects of policy in this paper. A detailed analysis about these issues will be a focus in our further studies.