1 Introduction

Since the early 2000s, the number of students in Germany has increased significantly more than predicted in forecasts (see, Nutz 1991; KMK 2005; Gösta and von Stuckrad 2007; Wissenschaftliche Dienste des Deutschen Bundestages 2006; Multrus et al. 2017). Reasons for this are the politically desired expansion of higher education offerings, the rising high school graduation rate, the introduction of the bachelor’s/master’s system, the abolition of compulsory military service, and double high school graduation cohorts. While the deviations of the total predictions are often hard to comprehend due to the effect size of such non-predictable political decisions, it is of great importance for decision makers to forecast spatial patterns of student mobility, since the basic funding is strongly related to the enrolment (HMWK 2015).

Gravity models are typically used to forecast student migration (Sá et al. 2004; Alm and Winters 2009; Cooke and Boyle 2011; Faggian and Franklin 2014). However, these models have some weaknesses, for example, empirical data are needed for fitting (Viboud et al. 2006; Balcan et al. 2009; Kaluza et al. 2010; Krings et al. 2009; Simini et al. 2012). For this reason, Simini et al. (2012) developed the classical radiation model as an alternative approach to estimate mobility between two sites. The advantage of this approach is both, the small amount of data and the parameter freedom. In addition to the classical radiation models, other models have taken up the approach in recent years and have developed the original idea into promising strands (Masucci et al. 2013; Yan et al. 2014; Ren et al. 2014; Kang et al. 2015; Lenormand et al. 2012; Liu and Yan 2019; among others). Nonetheless, these models often do not accurately describe mobility flows (Litmeyer et al. 2023) because a variety of regional characteristics, hard and soft location factors, such as employment rates (Cooke and Boyle 2011; Dotti et al. 2013), play a role in the choice of higher education location besides study location attractiveness, proxied by current enrolment, and distance. The incorporation of co-variates into regression equations is usually improving the model performance greatly. However, as there are typically complex non-linear relationships among the variables and among the observations in spatial interactions, it needs a complex approach to arrive at a valid goodness-of-fit with formal models.

Machine learning algorithms are capable of handling this complexity and the non-linearity but are often criticized for their black-box characteristics of the estimation procedure and, more importantly, for the limited capability to provide details for the effect sizes of individual variables. However, there are significant improvements in providing transparent, interpretable, and explainable machine learning methods, in recent years (Miller 2019; Roscher et al. 2020 for comprehensive reviews on the requirements for explainable machine learning). Recently, Morton et al. (2018) and Spadon et al. (2019) used the machine learning algorithm XGBoost to show for the USA and Brazil that this algorithm is particularly well suited to predict commuting as one case of spatial interaction and conclude that this method may also be a significant improvement for other cases of spatial interactions. Moreover, the combination with additional calculations of coefficients such as the so-called SHAP values opens a consistent way to interpret the influence of individual variables on the estimate (Morton et al. 2018; Spadon et al. 2019). In this article, the XGBoost algorithm is applied to the case of the transition phase from high-school graduation to higher education, which is frequently related to migration. Therefore, we estimate the number of first-year students per German county, based on a comprehensive set of hard and soft factors of college location choice. The following questions are answered:

  • How can we employ machine learning algorithms such as the XGBoost algorithm in order to deliver comprehensible results that are transparent, interpretable and explainable, when extended with model specific and variable specific indicators and visualizations?

  • How do results compare to those presented in the literature, typically based on formal regression models, and what are additional insights into the knowledge domain of student migration from the machine learning approach, leading to an original contribution of such techniques?

The second chapter introduces the higher education landscape in the case study area of Germany. This is followed on the regional characteristics for migration processes of first-year students derived from the international literature and the state-of-the-art. The methodology used is then presented, and presents the results. This is followed by a discussion and an outlook.

2 Germany as a higher education location

The German higher education system has changed considerably in the last 30 years after German reunification. First, new universities were founded in the 1990s in the territories of the former German Democratic Republic (GDR, 64 in total) (Erhart 2002). The new establishments, since 2000 (91 in total), are often private schools, private universities, satellite campuses, regional offshoots of existing vocational academies, or spin-offs from universities and research institutions (e.g., the Baden-Württemberg Cooperative State University or the Karlsruhe Institute of Technology (HRK 2019; KIT 2018)). In addition, the Bologna Declaration of June 13, 1999, led to the harmonization of study structures with bachelor’s and master’s degrees and thus to greater compatibility and comparability in the European Higher Education Area (EHEA 2016). Besides the changes to the degrees, other changes, both internal and external to higher education institutions, were made. Between 2006 and 2007, tuition fees were initially introduced in all western German states (except Bremen, Rhineland-Palatinate and Schleswig-Holstein) and abolished again by 2014 due to political changes and changes in government (Kauder and Potrafke 2013). The switch from a nine-year to an eight-year Abitur (KMK 2012) and the suspension of compulsory military service in 2011 (Deutscher Bundestag 2011) led to significant increases in student numbers. The average annual growth rate in the number of first-semester students (German and foreign first-semester students in all types of higher education institutions) from 2008 to 2016 was 2.82%. In total, enrolment increased from 396,800 to 509,760 during that time. In 2011, the highest number of students in the first semester was measured at 518,748 students (Statistisches Bundesamt 2008–2017). Overall, Germany as a knowledge location is particularly well suited as a study area since universities and universities of applied sciences are (relatively) evenly distributed throughout the country due to historical and political reasons. Moreover, Germany consists of rather homogeneous budgeting situations in the higher education system, when compared to the strong disparities in the Anglo-American higher education system. This means that, unlike in the USA, there are no “educational deserts” in Germany (Hillman 2016) and the lack of tuition fees does not lead to large distorting effects.

3 Motives for student mobility

The most important aspect, for a study location decision, is the spatial proximity between the origin and destination (Alm and Winters 2009; Dotti et al. 2013; Gibbons and Vignoles 2012). In general, student migration intensity decreases exponentially with increasing distance (e.g., Montgomery 2002; Sá et al. 2004; Frenette 2004, 2006; Spiess and Wrohlich 2010; Alm and Winters 2009; Dotti et al. 2013; Gibbons and Vignoles 2012). Along with increasing distance, emotional costs, e.g. giving up social ties, are a barrier to student mobility in addition to higher relocation and transportation costs (Winters 2011; Dotti et al. 2013, 2014). However, e.g., marketing activities of universities (Vrontis et al. 2007) or strong collaboration between universities and schools reduce the negative effects of geographical distance (Raab et al. 2018). Regions with a higher degree of urbanization and higher population density are more attractive and often draw in students (Sá et al. 2004; Cullinan and Duggan 2016; Weisser 2019). The same is true for regions with high employment rates (Cooke and Boyle 2011; Dotti et al. 2013). Furthermore, especially in agglomeration areas, financial aspects such as rent levels (Dotti et al. 2013) and future earning potential in the home and destination regions play a role in study choices (Sá et al. 2004). The direction of impact of tuition fees, on the other hand, is not always clear and depends on the level of fees charged (Spiess and Wrohlich 2010; Dwenger et al. 2012; Dotti et al. 2013). In Italy (Ciriaci 2014), the U.S. (Cooke and Boyle 2011), and Ireland (Walsh et al. 2018), a significant impact of the quality of higher education teaching on student mobility could also be measured, while no impact could be detected by Sá et al. (2004) for the Netherlands. The employment rate of graduates (Sá et al. 2011), faculty and student ratios (Sá et al. 2004), expenditure per student (Cullinan and Duggan 2016), research intensity (Adkisson and Peach 2008) or the place in international rankings (Ciriaci 2014) served as measurement indicators for quality. Both the educational background of parents (Lörz 2008) and gender (Belfield and Morris 1999; Ciriaci 2014) influence student mobility. In addition, different studies suggest that potential students often migrate to student-dominated regions or regions with a high share of highly educated people due to similar lifestyles, as well as to regions with strong cultural proximity (Buenstorf et al. 2016; Haussen and Uebelmesser 2018). This multitude of indicators explaining student mobility shows that a very large number of highly individual factors can play a role in the decision-making process for and against a particular university. Finally, location- and weather-related amenities are also important in the choice of study location (Kodrzycki 2001).

The migration patterns of first-year students cannot be discussed completely isolated from the whereabouts of students after graduation, as universities contribute greatly to regional economic activities (e.g., Kodrzycki 2001; Marinelli 2013; Dotti et al. 2013; Krabel and Flöther 2014; Kitagawa et al. 2022). In this context, universities and colleges as centers for research and development as well as teaching and training students occupy a special position in the (regional) innovation system (Geissler and König 2021). On the one hand, they generate knowledge, make it available to other stakeholders and promote the development of the next generation of scientists. It has been observed for years that more and more people are doing their doctorate and working at universities after completing their doctorate (e.g. Briedis et al. 2014; Buenstorf et al. 2023). On the other hand, the training of students is an important aspect for the labor market.. The private sector benefits from the well-trained graduates as well as from the corresponding knowledge of the universities and can thus improve its innovative capacity (Fritsch and Slavtchev 2007).

Another aspect is that cooperation between private-sector companies and universities in the manner of scientific publications, seminars, workshops and informal relationships have a positive influence on the transfer of academic knowledge to industry (Fritsch and Slavtchev 2007). However, academic knowledge is relatively immobile in this context, so geographical proximity and graduate ties play a vital role. This offers the advantage of directly increasing a region’s human capital endowment and thus having an impact on its innovation potential in the medium to long term.

Accordingly, the retention of graduates in a region is relevant, even before higher education policy measures such as scholarships. However, the effectiveness of scholarships is controversial (Groen 2004; Busch and Weigert 2010; Geissler and König 2021). For Germany, Busch and Weigert (2010) and Buenstorf et al. (2016) showed that more than half the number of graduates take up employment in the university region and the corresponding state or return to their home region.

Furthermore, it can be observed that graduates and scientists often work near their home university and newly founded innovative companies also actively seek spatial proximity to universities. Whereby basically regional differences exist between urban and non-urban areas as well as the fields of study (Marinelli 2013; Buenstorf et al. 2016; Kitagawa et al. 2022). Krabel and Flöther (2014) and Kitagawa et al. (2022) were able to demonstrate that urban areas or metropolitan regions have a high retention of university graduates, while rural areas are characterized by a higher mobility requirement of graduates. In non-urban areas, the establishment of a company at the university location seems to increase the retention in the region. It can be seen that the retention rate of graduates in natural sciences is significantly higher in urban regions. One reason for this is that labor markets in agglomeration areas increase the match between STEM graduates and STEM professions (Kitagowa et al. 2022). Krabel and Flöther (2014) and Teichert et al. (2020) were able to show that graduates are more likely to stay in the university region if they gain work and professional experience in the university region during their studies (Teichert et al. 2020).

This wide range of indicators explaining student mobility highlights that a large number of highly individual factors can play a role in the decision for or against a particular university.

4 Methodology

We seek to predict the number of students at any German county that hosts a higher education institution, based on the aggregate of dyadic migration decisions. In order to be able to predict the weighting of each connection more reliably, we employ three XGBoost regressors, each of it, representing the number of first-year students who migrate from their home county i, i.e. the place of high-school graduation, to the university location j.

The XGBoost algorithm is a method that uses the mathematical data representation of decision trees. Decision trees are non-parametric and often used in supervised machine learning. They use loss functions to evaluate the gradual improvements of the predictions during the learning process. Therefore, they belong to the class of ensemble learning problems. The procedure starts with an initial calculation of a simple model (a tree), which is used to predict the training data. The error of these predictions compared to the actual values is then determined by a loss function and another tree is created to minimize these errors (gradient descent optimization). This process is repeated and with each new tree the error of the previous tree is corrected. Since all machine learning methods have a stochastic element, model outputs may not be deterministic, compared to formal regression modeling. Thus, the whole procedure is usually repeated to arrive at ensembles of converged predictions of all trees and are then averaged.

In a basic model 1, we consider the first-year students to mi and nj at the home and university locations, and the distance rij between the locations. This is equivalent to a gravitation model, but using non-linear estimation techniques from machine learning.

A second model is based on the seminal idea of Simini et al. (2012), who introduced a radial “opportunity” component. This led to a significant improvement of the forecast of commuter movements for the U.S. at the municipality level, utilizing a very reduced set of variables (number of inhabitants in the destination and origin region (mi; nj) as well as sij defined as inhabitants from all locations within radius ij around i, the total number of commuters Ti in the system) and without parameters. Formally, it follows that

$${T}_{ij}^{\text{radial}}=T_{i}\frac{m_{i}n_{j}}{\left(m_{i}+s_{ij}\right)\left(m_{i}+n_{j}+s_{ij}\right)}=\frac{\vartheta }{M}\frac{{m}_{i}^{2}n_{j}}{\left(m_{i}+s_{ij}\right)\left(m_{i}+n_{j}+s_{ij}\right)}$$

Transferring these considerations to the mobility of first-year students, it follows that mi is defined as the number of high-school graduates m at place i. For the university location j, nj is chosen at time t‑1. It represents the number of students in the first semester at time t‑1. Basically, it is assumed that future students compare university locations considering different conditions. From these considerations, it follows that sij describes the total number of freshmen at time t‑1 within the radius of the distance between the home county and the future university location, around the home county, and thus represent all potential university locations in the vicinity. To further calculate the average freshman migration Tij from location i to j, the average freshman migration rate at time t‑1 for the entire country is also calculated. ϑ is the total number of all mobile students (excluding students whose home region corresponds to the university region) and M describes the number of all first-year students in Germany. Thus, this second model is an extension of model one by adding sij as the representation of all other opportunities within a given distance ij around i. Again, the XGBoost regressor is allowing for non-linear relationships among the variables and observations. The calculation is performed using the R package ‘xgboost’ (Yuan 2023).

The final model 3 incorporates 28 co-variates to model 2 to control for important aspects in the study location choice of high-school graduates. A student decision is modeled as

$$\boldsymbol{U}_{\boldsymbol{k}}:= \left\{{\boldsymbol{u}}_{\boldsymbol{k}}^{\mathbf{1}}{,}\ldots {,}{\boldsymbol{u}}_{\boldsymbol{k}}^{\mathbf{28}}\right\}{,}\boldsymbol{for}\boldsymbol{k}\in \left\{\boldsymbol{i}{,}\boldsymbol{j}\right\}\boldsymbol{and}{\boldsymbol{u}}_{\boldsymbol{k}}^{\boldsymbol{m}}\in \mathbb{R}\boldsymbol{for}\boldsymbol{m}\in \mathbb{N}$$

To comprehend the most relevant motives, derived from the literature for the home and the university location respectively (cf. Table 1). That is, for each interaction and set:

$$\mathbb{R}^{\left| \boldsymbol{S}\right| }\ni \boldsymbol{S}_{\boldsymbol{ij}}:= \left\{\boldsymbol{r}_{\boldsymbol{ij}}{,}\boldsymbol{s}_{\boldsymbol{ij}}{,}\boldsymbol{U}_{\boldsymbol{i}}{,}\boldsymbol{U}_{\boldsymbol{j}}\right\}{,}$$

the following function is sought (Spadon et al. 2019):

$$\textit{weight}\colon \mathbb{R}^{\left| S\right| }\longrightarrow \mathbb{N}$$
Table 1 Overview of the used variables

The co-variates control for infrastructure (e.g. Accessibility of IC/EC/ICE stations), supporting/soft location factors (e.g. Guest overnight stays), and environmental aspects (e.g. Average temperature) and can be defined and statistically described as follows:

Due to the choice of methodology, it is not necessary to normalize the variables accordingly. To be able to calculate the regressor, the data are first divided into a training data set (70%) and a test data set (30%). This is followed by a 5-fold cross-validation and the tuning of the hyperparameters. For this purpose, a grid with the hyperparameters (eta, gamma, min_child_weight, max_depth) is formed and all possible variants are tested so that the Sörensen index is maximized. The hyperparameter eta corresponds to the learning rate and stands for the step size that is used during the update to prevent overfitting. In addition, gamma is adjusted. This parameter stands for the minimum loss reduction that occurs when the nodes are split. Basically, the larger gamma is, the more conservative the algorithm becomes. In addition, max_depth is used to specify the maximum depth of the tree with the aim of controlling overfitting. The fourth parameter that is adjusted is min_child_weight and also aims to minimize overfitting. In this case, the larger the value, the more conservative the algorithm becomes.

In addition, 70% percentage of regional features (columns) is used in the construction of each tree to counteract possible endogeneity problems. This means that each tree is only built with 70% of the columns. Then, the tuned regressor is applied to the test dataset and the goodness of fit is evaluated using various indicators (Spadon et al. 2019).

The widely used parameter to assess interaction is the Soerensen index (Soerensen 1948). It is used to measure fluctuation and indicates the correctly reproduced proportion of pendulum flows in simulated networks. The similarity measure can take values between 0 and 1. Provided a value of 0 is assumed, there is no correspondence with the original pendulum flows. For 1, the empirical network fully corresponds to the simulated network. Comparatively, the advantage of the Soerensen index is that it maintains sensitivity in more heterogeneous data sets and is less sensitive to outliers (McCune and Grace 2002). The measurement indicator is calculated as follows where \({T_{ij}}^{\text{empric}}\) represents the empirical and \({T}_{ij}^{\text{model}}\) the calculated commuter flows (Soerensen 1948):

$$SI=\frac{2{\sum }_{i=1}^{N}{\sum }_{j=1}^{N}min\left({T_{ij}}^{\text{empric}}{,}{T}_{ij}^{\text{model}}\right)}{{\sum }_{i=1}^{N}{T_{ij}}^{\text{empric}}+{\sum }_{j=1}^{N}{T}_{ij}^{\text{model}}}$$

The evaluation is complemented by the Mean Square Error, the Mean Absolute Error (MAE), the Root Mean Squared Error (RMSE), the Pearson correlation coefficient and the adjusted R2 value.

For the evaluation of the XGBoost regressor, a reconstruction of the weighted structure of the mobility network is also performed. For this purpose, the influence of the characteristics on the forecast is determined using the SHAP value (Shapley Additive Explanations value) (Lundberg and Lee 2017; Lundberg et al. 55,56,a, b). As early as 1953 Lloyd Shapley introduced Shapley values in the context of game theory (Shapley 1953). The basic idea is that the prediction of the model is made with and without the feature in question. It should also be noted that the order in which new features are added has an effect on the model, so all permutations of the feature orders must be calculated.

The advantage of this approach is that the effects of the characteristics on the prediction of the individual data become possible, since in the case presented here, the SHAP value measures the contribution or importance of a county to the forecast. For this, a graph (e.g., Figs. 2 and 3) is created based on the SHAP values using the R package SHAPforxgboost to better interpret the results (Liu et al. 2021). The most important characteristic is placed at the top. The SHAP values can also be graphically displayed, so that the effect of each feature can be immediately recognized. For each feature, a dot representing the predicted association is drawn. Thus, it is possible to determine the distribution of the impact of each feature on each interaction. Points that are in the negative range indicate that this predicted association has a negative impact on the model’s prediction performance. Conversely, a point in the positive range is an indication that the prediction is improving. The colors also represent the SHAP value and vary from low (yellow) to high values (purple). In each row the mean value of the amounts of all SHAP values for the respective variable is given. In contrast to standardized beta coefficients in traditional regression analysis, SHAP visual representations can help differentiating non-linear relations between the dependent and the independent variables in an explorative way.

Overall, only a few examples so far use the full range of options of feature extraction and SHAP analysis along with an informative visualization. One of the notable exceptions is Li (2022). However, most authors still use XGBoost as a black box for prediction without addressing the contributions of the features (e.g. Rahman and Chowdhury 2022), thus, somehow violating the requirements for transparent, interpretable and explainable machine learning applications, as discussed in Miller (2019) or Roscher et al. (2020). The approach presented here, comes intuitively close to formal regression analysis and its interpretation. Through the explicit feature extraction, the detailed effect size analysis through the SHAP values and subsequent visualization of the parameter influence, analysts are able to present important drivers of the effects behind the phenomenon under investigation in a comprehensible way.

5 Empirical results

The dataset includes variables for the 401 counties and independent cities in Germany, which serve as the study area for the influence of study motivation on migration behavior in 2016. We omitted all dyads that include counties that do not host a higher education institution, because there is not option for studying and, thus, no migration potential. The data for the regional characteristics (see Table 1) were taken from the database “Indicators and Maps of Spatial and Urban Development” of the Federal Institute for Research on Building, Urban Affairs and Spatial Development (INKAR 2023). In addition, travel time in minutes between all district cities of the counties were obtained from BBSR (2023) and climatic indicators representing mean values of the counties calculated on the basis of raster data of the German Weather Service (Deutscher Wetterdienst 2018, 2022, 2023). The query of migration flows of students from the home county to the university location was made in the research database Frankfurt (FDZ 2019). For data protection reasons migration flows with less than three students are considered with 0 migrations.

All three models can be evaluated, using the proposed model diagnostics. Table 2 shows the Soerensen index with 0.78, the MAE with 1.57 and the adjusted R2 value with 0.81 for model 3, qualifying this model as best model. Overall, results improve gradually from model 1 to model 3, when incorporating more information, which is in line with standard procedures in classical regression analyses. There is a great improvement from model 2 to 3, which emphasizes the importance of including larger numbers of conceptually important co-variates. Thus, different from the experience with formal regression models, the acknowledgement of complex interactions between the observations and the variables and the non-linear learning procedure led to an increasingly good fitting of the machine learning model. In addition to that, the introduction of co-variates greatly improves the interpretability and transparency.

Table 2 Results of the evaluation indicators

Figure 1 shows accessibility and average distances to existing infrastructure facilities (e.g. supermarkets, etc.) are particularly strongly negatively correlated with population density, the number of first-year students and the share of employees with academic degrees. There is a strong positive correlation between the number of first-year students (n) and long-term university expenditures and employees with academic degrees. A strong positive correlation can also be seen for travel time and the number of first-year students between the home region and the university region (s).

Fig. 1
figure 1

Correlation of variables used. Source: own representation based on own calculations (BBSR 2023; Deutscher Wetterdienst 2018, 2022, 2023; FDZ 2019; INKAR 2023)

Looking at the SHAP values for the XGBoost regressor (see Fig. 2), it is clear that the number of first-year students at the university location is the most important characteristic in all three models. Model 3 demonstrates that university locations with a high number of first-year students have a positive effect on predicting student mobility flows. Locations with low numbers of first-year students have a negative effect, thus, the attractiveness of a study location is self-reinforcing and strongly path-dependent. Moreover, very skewed distributions and outliers seem to induce more extreme SHAP values, since all the SHAP values for variable n that are below −5, consists of locations with very low spatial interactions. Another relevant aspect is travel time as especially short travel times have a high impact on the predictive performance of the models, while long travel times decrease the predictive performance. Furthermore, it can be seen that the parameter s, introduced by Simini et al. (2012)—which represents the alternative opportunities for the study location selection with radius of the distance between the home region and the future university location—can also be identified as another important regional characteristic in models 2 and 3 (cf. Figs. 2 and 3). Locations in the vicinity of which there are a large number of first-year students, on the other hand, have a negative influence on the prediction of migration. It becomes clear that the number of first-year students at the place of residence plays a role in all three models and that large locations benefit from their local pool of high school graduates.

Fig. 2
figure 2

SHAP values for the XGBoost regressor (model 1 & 2). Source: own representation based on own calculations (BBSR 2023; INKAR 2023)

Fig. 3
figure 3

SHAP values for the XGBoost regressor (model 3). Source: own representation based on own calculations (BBSR 2023; Deutscher Wetterdienst 2018, 2022, 2023; FDZ 2019; INKAR 2023)

Looking at model 3 (cf. Fig. 3), it becomes apparent that the regional characteristics in the respective university regions are of high importance. Among other things, long-term university funding measures have a surprisingly large influence on the forecast, although a clear direction of effect is not discernible here. Funding may also be interpreted as proxy for other quality related aspects of the students’ decision that are hard to capture otherwise such as quality of teaching and research. Furthermore, it becomes clear that low average distances to the nearest supermarket and high population densities at university locations have a positive effect on the calculation of student flows. Low population densities, on the other hand, have the opposite effect. Also, a high proportion of people employed in the IT sector, high childcare rates for young children and small households improve the prediction. Similar observations can be made for overnight guest stays. Locations with high numbers of overnight stays in the home and university regions positively influence the forecast.

Counties with a high age structure cause a deterioration of the predicted mobility flows. This could also be determined for the population structure in the home county. In addition, high global radiation in the university region and a low proportion of local recreation areas is an aspect that is also of positive significance for the forecast. Fundamentally high employment rates at the place of residence and work tend to lead to a deterioration of the forecast. A more differentiated look at the proportion of employees with an academic vocational degree at the university location shows a high proportion has both a positive and a negative effect. No clear statements can be made for all other characteristics, such as household income.

6 Discussion and conclusion

This article was exploring, what machine learning methods can offer for spatial interaction modeling. We provided evidence that algorithms such as the XGBoost algorithm can deliver comprehensible results that are transparent, interpretable and explainable. We have suggested a set of model diagnostics and visualizations to support the interpretability of the results. What seems most important from the knowledge domain perspective is the need for a comprehensive acknowledgement of independent variables and co-variates. Machine learning techniques are already providing good model fits to the empirical data with few parameters as presented in models 1 and 2 in this study. However, the predictions would be less explainable without a decent amount of additional conceptually derived variables. It must also be discussed at this point that the prediction using the XGBoost algorithm has an explorative character due to the hyper-parameterization. The prediction of the migrations can be optimized through the targeted control and coordination of the parameters through loss functions which guarantee a gradient descent.

Nevertheless, the approach also offers advantages. In particular, the visualization by means of the SHAP-values offers a deeper insight into the black-box of the algorithm. It contributes to the understanding of the individual positive or negative influence of each region. This also enables to measure the respective influence of counties or municipalities in other areas of interest for regional phenomena. This is something, traditional regression methods cannot offer.

Overall, the results from the XGBoost Regression compare very well to the state-of-the-art, presented in the literature concerning our case study on high-school-graduates’ migration to their preferred place of study. The socio-economic structure of the respective university region is of great relevance. The most important aspect, as discussed in the literature is the number of first-year students in the previous year, which can be interpreted as attractiveness of the location for prospective students, and the travel times to the university location. Locations with many students have a positive influence on the forecast.

In addition, interactions with very small migration movements (< 10) have a strong negative influence on the forecast. Migratory movements that are somewhat larger (≥ 20) also have a negative influence on the forecast performance, but this is significantly smaller.

Besides the size of the university location, the proximity to the home region is of particular relevance in the forecast. It becomes evident that the choice of university location is strongly dependent on the number of opportunities in the surrounding of the home location. In regions of origin where there are many first-years and opportunities, the likelihood of choosing a particular college location decreases. Contrastingly, in locations where there are few universities in the immediate vicinity, it can be assumed that these universities will accept many high school graduates.

Among the regional characteristics, aspects related to agglomeration effects are very important. High population densities, a well-developed infrastructure and (e.g. proximity to the nearest supermarket) basic services, a high proportion of employees, large numbers of overnight stays, small household sizes and a high rate of childcare lead to an improvement in the forecast. This is particularly interesting since these regional characteristics are mainly aspects that also play a role for graduates. In other words, regions with such a structure benefit more than average from the immigration of first-year students. This may also contribute to the recent observation of increasing employment opportunities in academia (Buenstorf et al. 2023).

The present analysis does not consider individual factors of the high-school graduates in the decision-making process, such as gender, educational background, and family ties, due to a lack of data. Likewise, indicators that represent qualitative aspects of a study location such as the quality of teaching or the structure of the study program were also excluded. Thus, the attractiveness of the university location remains obscure and hidden behind the residual variable n in our case study. This being said, there is need to further explore the capacities of machine learning for the purpose of the development of new indicators that grasp such fuzzy concepts like the attractiveness of a region. One promising avenue in this respect is discussed in Kriesch (2023), who suggests using machine learning and large language modeling for the classification of website data and the production of new regionalized variables in empirical studies in the field of economic geography and regional sciences. Reflecting the encouraging results from the analysis presented here, it seems worthwhile to further explore what the dynamic field of machine learning has to offer for our knowledge domain of economic geography and regional sciences.