Keywords

1 Introduction

The use of massive data in a digital environment has led to a disruptive change in the developed economies of the world. Before the appearance of the Big Data concept, the amount of data collected already exceeded the ability to process and analyze data. The generation of massive data by the millions of device users and data analysis have created an unsuspected digital economy decades ago [1].

The “Tourism Industry” [2], generates a quantity of data to be analyzed. This sector increasingly has a greater weight in the Gross Domestic Products (GDP) and turn generates externalities in economic agents [3].

This paper introduces a modern unexplored analysis of the data generated on the internet network for the Spanish tourism accommodation market by country of origin. Innovative modelling of data processing from primary data sources (official sources) with secondary sources from Big Data (Google Trends—GT) is introduced following four basic principles of analysis: volume, velocity, variety and veracity. GT analyzes the shift of searches throughout the time and reveal consumer intentions.

The main objective of this paper is to obtain forecasting on Hotel Overnight Demand in Spain (HODS) from January 2018 to June 2019, by establishing a causality model for monthly data. The multivariate method developed of Autoregressive Distributed Lags with seasonal variables (ARDL + seasonality) uses as an explanatory variable for HODS a search interest rate (generated by GT) and seasonal dummies variables for monthly data by country of origin. This second contribution is a very relevant fact since tourism agents will be able to make efficient decisions in the tourism market. To explain causation relations, the Granger-Causality test extended with seasonality is developed and modelling we will be able to identify when consumer interest occurs. Ultimately, a criterion for the selection of new models, such as Matrix U1 Theil, has been developed, and it will be applied in this paper [4]. The forecasting is compared with univariate techniques such as Seasonal Autoregressive Moving Average (SARIMA) and the relatively new non-parametric technique Singular Spectrum Analysis (SSA).

The remainder of this research is as follows: Sect. 2 provides a review of the existing literature on the forecasting of Tourism Demand, influenced by the techniques of every epoch; in Sects. 3 and 4, data analysis is initially carried out along with the methodological development and information criteria. The use of the criterion for the selection of predictive models based on Theil’s index is considered a great contribution to the literature. In Sect. 5 an empirical analysis is carried out verifying the application of the proposed methodology. Section 6 shows the conclusions and future lines of research for Data Scientists and some economic implications. Finally, there is a section for the bibliographical references used.

2 Literature Review

Data science is a fundamental field for the exploitation and generation of knowledge to make decisions in efficiency. In the bibliographic research carried out the appearance of these new datasets from open data such as Google could modify the culture and business in the Tourism field [5].

Tourism Demand is caused by multiple exogenous factors and techniques have focused on obtaining robustness and dynamic modelling, scalability and granularity [6]. The variety of Big Data studies has been applied to Tourism research, making a great improvement in the area [7]. Traditionally these studies have been influenced by the techniques of the moment [8,9,10,11]. However, researchers have found the need for greater integration between computational and scientific fields [12].

In our study, we will carry out an analysis with novel techniques and will be compared with most used techniques, a contribution of this study is the use of Big Data [13], tools summarized in an index of relevance provided by GT.

2.1 Forecasting Methods Using Google Search Engines (Google Trends)

Previous researchers such as Lu and Liu [14], found correlations between Internet search behaviour and the flows produced by tourists. Shimshoni et al. [15] concluded that 90% of the categories analyzed are predictable, making a great contribution to the scientific literature (categories: Socio-Economics fields).

Using the R programming and developing several examples in which the GT tool is used, it is worth mentioning the study of Choi and Varian [16], to analyze the tourism demand in Hong Kong. They obtained models with high explanatory capacity (on average \( R^{2} = 73{\% } \)) using ARDL. Gawlik et al. [17] concluded that the GT search popularity evolution offers a useful predictor of tourism rates for a series of arrivals of Hong Kong. For the Charleston region (USA), practical and interesting applications were found on the use of search engine data. The main limitation is that it was done only in one city [18].

To carry out Chinese Tourists’ forecasting, Yang, Pan et al. [19], proposed and demonstrated the valence of the use of search engines based on web searches comparing Baidu search engines with those of GT. In this sense, with data obtained through GT, comparing purely autoregressive models with ARDL models with seasonal dummy variables, short-term results were obtained for the case of Vienna with data from images, words search or videos on YouTube [20].

Studies from the use of GT have meant an improvement in predictions for the Caribbean area. Autoregressive Mixed-Data Sampling models represent an improvement over SARIMA (Seasonal Autoregressive Integrated Moving Average) and AR for 12-months predictions [21].

The study of the tourist flows from Japan to South Korea has been examined with the construction of the Google variable combining the lowest Mean Square Error (MSE) or the absolute average of forecast errors for monthly data. Finding the best results for the model that uses Google data [22].

In the case of tourist flows from Spain, Germany, UK and France, Google data was used with the construction of indicators through Dynamic and SARIMA models [23]. For tourist arrivals in the city of Vienna [24], Google Analytics data was extracted using Bayesian methods. In the case of Puerto Rico, the volume of searches has been studied to predict the hotel demand of non-residents with a Dynamic Linear Model. The results showed improvements in forecasting time horizons greater than 6 months [25]. Google data has been used for the flow of tourists in Portugal [26] and tourists flow in Spain [27].

Irem Önder [28] compared forecasting models with web and/or image search indices regarding two cities (Vienna and Barcelona) and two countries (Austria and Belgium). Tourist Arrivals in Prague was analyzed by Zeynalov [29], with the objective to assess whether GT were useful for forecasting tourists’ arrivals and overnight stays in Prague with weekly data. The results confirm that predictions based on Google searches are advantageous for policymakers and businesses operating in the Tourism sector.

The online behaviour of hotel consumers for the United States of America was researched with Discrete Fourier Transformation using data from GT, with empirical evidence for its use in marketing strategies [30].

In the case of Amsterdam, it has been investigated by Rödel [31], on forecasting Tourism Demand using keywords related to “Amsterdam” in GT. With the development of Big Data technology in the last decades have emerged collaborative economy companies [32]. They have carried out studies on a vacation rental company that operates worldwide but reducing it to results from the Iberian Peninsula. In 2018, a study was published on the online and offline behaviour of consumers, for US restaurants with Google and Baidu search engine data. [33].

The data provided by Google use an index that summarizes the interest of the search words, in the case of data from Baidu. Li et al. [34], developed an index of interest with data from Baidu. Demonstrating the forecasting capacity of Dynamic Factor Model (GDFM) to forecast tourist demand in a destination for Monthly Beijing tourist volumes from January 2011 to July 2015. A relevant study using Machine Learning algorithms is the one developed by Sun et al. [35], using criteria for the selection of models such as Normalized Root Squared Error (NRMSE) and MAPE, in addition to using the Diebold-Mariano criterion to determine if the prediction differences are significant.

Measures of forecasting. As observed above, the Tourist Industry has had an interest in the past, in the present and in the future, and it will continue to have it. Mainly because it is an industry signal of the evolution of the service economy. So, the modelling used is very diverse, one aspect to be taken into account has been the criteria of information on the selection of models. It has been observed in the literature review the use of Mean Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE); Theil’s index [36,37,38,39]; Symmetric Mean Percentage Error (SMAPE) [40]. Some authors developed the RMSE ratio [41, 42], and in this article, we will develop the Matrix U1 Theil as a criterion for the selection of forecasting models [4]. This method allows quantifying the gain of the use of one methodology versus another.

To summarize the review of the literature, we can say that new models have been used in Data Science. In this work, new methodologies are developed, such as the improved Ganger causality test for seasonal data. Dynamic models have been developed to analyze the forecasting capacity in the short and long-term. Big Data tools have been used from one of the largest search engines in the world and a decision matrix on predictive capacity has been developed for different time horizons.

3 Methodology

In this section, the scheme (see Fig. 1), of the cycle between offer and demand in tourism has been developed under four basic principles of Big Data. Specifically, in our paper, the objective is modelling and forecasting, however, we will suppose ad hoc the data from the Data Warehouse [43]. In this sense, the data will come from official sources of the INEFootnote 1 and GoogleFootnote 2. So, all of the Extraction, Transformation and Loading—ETL [44], work will come from the data engineering of these entities. The main objective is to make efficiencies predictions based on knowledge to improve the user experiences of Tourism Demand and the offers of the stakeholders.

Fig. 1
figure 1

Data life cycle and efficiency decision scheme. Own elaboration

3.1 Modelling and Forecasting Evaluation

In this paper, ARDL + seasonality model is proposed and its application with data from Big Data architectures is analyzed. This modelling allows to know how HODS is generated through the searches of Google users (by country of origin). The purpose of this model is to know the causality relationship and to be able to make forecasts. To analyze the relationship between Granger causality and seasonality a test is developed. To evaluate the forecasting capacity is developed Matrix U1 Theil by country of origin. This matrix is developed to evaluate forecasting capabilities in order to obtain a comparative dimensionless measure among models. For a more in-depth detail of the predictions made, the reader can refer to the references of SARIMA [45] and Singular Spectrum Analysis [46]. All models are made for different scenarios and forecast comparisons are made for different time horizons h = 3, 6, 12, 18.

Granger causality and seasonality testing: ARDL and ECM. We develop the test proposed by Granger [47] and discussed by Montero [48], to detect the causality, since it is not observed with the simple analysis of correlation.

The model considered by Granger is for two variables (yt, xt). Due to the great influence of seasonality [49], in the Tourism sector, the following equation is proposed with HAC covariance method which determines the robust standard error for parameters estimated

$$ \ln \left( {y_{t} } \right) = \beta_{0} \ln \left( {x_{t} } \right) + \sum\limits_{j = 1}^{m} {\beta_{j} \ln \left( {x_{t - j} } \right)} + \sum\limits_{j = 1}^{m} {\alpha_{j} \ln \left( {y_{t - j} } \right)} + \sum\limits_{i = 1}^{12} {\delta_{i} } w_{i} + \varepsilon_{t}^{\prime } $$
(1)

where \( w_{i} \) is a deterministic seasonal dummy (i = 1, …, 12) component and for monthly data is defined as follows:

$$ \begin{aligned} w_{1} = - 1,{for} \, {others} \, w_{i} = 0 \hfill \\ w_{1} = - 1,w_{2} = 1\;{for} \, {others} \, w_{i} = 0 \, \hfill \\ w_{1} = - 1,w_{3} = 1\;{for} \, {others} \, w_{i} = 0 \hfill \\ \, \vdots \hfill \\ w_{1} = - 1,w_{12} = 1\;{for} \, {others} \, w_{i} = 0 \hfill \\ \end{aligned} $$

The use of HAC covariance method guarantees the efficiency of the parameters estimated. Once obtained \( \varepsilon_{t}^{{\prime }} \), this will be distributed as white noise.

The decision of causality with seasonal effects (Testing linear restrictions for parameters of \( x_{t - j} \) and \( w_{i} \)) is asymptotically (\( T \ge 60) \) as Chi-squared [50].

The most general expression of a dynamic model named ARDLFootnote 3 (m, n) with seasonal components is as follows [51, 52]:

$$ \gamma \left( L \right)\ln \left( {y_{t} } \right) = \delta \left( L \right)\ln \left( {x_{t} } \right) + \sum\limits_{i = 1}^{12} {\alpha_{i} w_{i} } + \varepsilon_{t} $$
(2)

With the interest of evaluating the dynamic persistence of an effect on the exogenous variable at a certain moment, the Error Correction Model (ECM regression or ARDL Error Correction Regression) is constructed. The ECMFootnote 4 regression is as follows:

$$\begin{aligned} \Delta \ln \left( {y_{t} } \right) = & \delta _{0} \Delta \ln \left( {x_{t} } \right) + \sum\limits_{{j = 1}}^{n} {\lambda _{j} \Delta \ln \left( {x_{{t - j}} } \right)} + \sum\limits_{{j = 1}}^{m} {\delta _{j} \Delta \ln \left( {y_{{t - j}} } \right)} \\ & - \gamma \left( L \right)\left[ {\ln (y_{{t - 1}} ) - \beta \ln (x_{{t - 1}} )} \right] + \sum\limits_{{i = 1}}^{{12}} {\alpha _{i} w_{i} } + \varepsilon _{t} \\ \end{aligned} $$
(3)

In this model, short-term effect is represented by parameters of first variables differentiated, while long-term effects \( |\gamma (L)| < 1 \) are represented by Correction Error term. According to Zivot [53], if long-term effect is not statically significant, cointegration does not exist. The long-run multiplier is defined as \( \beta = \frac{\delta \left( L \right)}{\gamma \left( L \right)} \)

Forecasting Evaluation: Theil’s measures. To verify the forecasting accuracy of different models, we adopted an evaluation criterion to compare the out-sample forecasting performance. We will work with the inequality index of Theil [36]

$$ U_{1} = \frac{{\left[ {\frac{1}{h}\sum\limits_{h = 1}^{18} {\left( {y_{T + h} - \hat{y}_{T + h} } \right)^{2} } } \right]^{1/2} }}{{\left[ {\frac{1}{h}\sum\limits_{h = 1}^{18} {\left( {y_{T + h} } \right)^{2} } } \right]^{1/2} + \left[ {\frac{1}{h}\sum\limits_{h = 1}^{18} {\left( {\hat{y}_{T + h} } \right)^{2} } } \right]^{1/2} }} $$
(4)

Ratio Theil’s (RT’s) is designed to comparisons between predicted variables with horizons h = 3, 6, 12,18.

$$ RT's_{{y_{it} ,y_{jt} }} = \frac{{U_{1}^{{y_{it} }} }}{{U_{1}^{{y_{jt} }} }} $$
(5)

In the mathematical interpretation of the RT’s, three situations are described according to the predictive capacity of models: if the RT’s is equal to one, both models have the same explanatory capacity; if the ratio is greater than one, this would indicate that the denominator’s model has a better explanatory capacity than that of the numerator; if the ratio is less than one, the numerator’s model has better predictive results than the denominator.

4 Data

The Data of the number of HODS has been collected by INE. For the number of tourists in Spain, by country of origin, the dataset from the first month of 2010 to June of 2019, was obtained. In the grouping of nationalities, the name of “Resident abroad” should be noted. This includes all foreign nationalities except for the 5 main nationalities described in the table (Germany, France, Italy, Netherlands, UK, USA).

According to the data represented in Fig. 2, the average of Residents Abroad was 16,180,005.75 in the period cited. The maximum number of hotel occupancy was recorded in August 2017, with 29,594,071 and the minimum 11.887.105 in January 2010.

Fig. 2
figure 2

number of HODS and keyword “visit Spain” for Resident abroad (Jan. 2010–June 2019). Own elaboration

To obtain data from Google, the Big Data tool called GT has been used. Previously GT tools have been used to make forecasts as is cited in the literature review. The lowest interest occurred in December of the year 2010. Analyzing the data obtained of interest for the keyword or Google Query (GQ) “visit Spain”, the greatest worldwide interest of the word was in May 2017, just with three periods of advance to the maximum historical overnight stays in Spain.

With the observation of the maximum and minimum values of both series analyzed, it is observed graphically that searches on the Internet are made with at least one period in advance.

Table 1 displays a summary of variables selected by nationalities: Hotel demand and GQ. According to the two series selected, it is worth mentioning that only the variable “Google Queries” in the case of Residents abroad (and USA HODS) meets the hypothesis of normality at 95% confidence (Jarque-Bera). As for stochastic trends (ADF test), all nationalities have unitary roots in Hotel demand and only three cases have been found in which there is evidence of unit root: they are the Google Queries of the Residents abroad, UK and USA. Regarding the stationarity in variance (KPSS), a more stationary behaviour is observed in the Hotel Demand variable for all nationalities including Residents abroad. On the other hand, in the Google queries variable, there is a clearly non-stationary behaviour in the series of Residents Abroad, UK and USA.

Table 1 Mean and stationary analysis of HODS and keyword “visit Spain” sample period Jan. 2010–December 2017. P-values in brackets. Own elaboration

5 Empirical Results

The empirical results obtained from the application of the previously proposed methodology section are briefly summarized in the following text. In this paper of predictive techniques, we will focus expressly on the dynamic model with explanatory variables of Internet searches (“visit Spain”) and seasonal factors. The Granger-Causality test extended to seasonality confirms this hypothesis at least within 95% of confidence. As usual in the literature, the forecasting is carried out for time horizons h = 3, 6,12,18 months. Moreover, this article considers the training period from January 2010–December 2017 and out-sample period from January 2018–June 2019.

The results obtained through the Granger causality test including seasonal factors have determined that the number of HODS could be explained by the number of searches generated on the internet and by a systematic seasonality (Fig. 3).

Fig. 3
figure 3

Out-sample forecast HODS h = 18 (Jan. 2018–Jun. 2019). Own elaboration

The ECM with seasonality obtained for residents abroad is as follows (lags selected under Akaike Info Criterion):

$$ \begin{aligned} \Delta \ln \left( {\hat{y}_{t} } \right) = \mathop { - 0.28}\limits_{(0.00)} \Delta \ln \left( {x_{t} } \right) - \mathop {0.13}\limits_{(0.03)} \left[ {\ln \left( {y_{t - 1} } \right) - \mathop {0.55}\limits_{(0.00)} \ln \left( {x_{t} } \right)} \right] + \sum\limits_{i = 1}^{12} {\hat{\alpha }_{i} w_{i} } + \hat{\varepsilon }_{t} \hfill \\ Sample: \, 2010M1 \, 2017M12 \, R^{2} = 0.9888 \hfill \\ \end{aligned} $$
$$ \begin{aligned} \sum\limits_{{i = 1}}^{{12}} {\hat{\alpha }_{i} } w_{i} = & \mathop { - 22.41}\limits_{{(0.03)}} w_{1} \mathop { + 1.86}\limits_{{(0.02)}} w_{2} + \mathop {2.05}\limits_{{(0.01)}} w_{3} + \mathop {2.08}\limits_{{(0.01)}} w_{4} + \mathop {2.23}\limits_{{(0.00)}} w_{5} + \mathop {2.13}\limits_{{(0.01)}} w_{6} \\ & + \mathop {2.11}\limits_{{(0.01)}} w_{7} + \mathop {2.01}\limits_{{(0.02)}} w_{8} + \mathop {1.81}\limits_{{(0.04)}} w_{9} + \mathop {1.65}\limits_{{(0.06)}} w_{{10}} + \mathop {1.16}\limits_{{(0.18)}} w_{{11}} + \mathop {1.48}\limits_{{(0.08)}} w_{{12}} \\ \end{aligned} $$

In the model defined for the HODS resident abroad variable, two aspects stand out (p-values in brackets): firstly, the existence of a cointegration relationship; second, the strong influence of seasonality. Table 2 shows models and results for HODS by country of origin.

Table 2 Summary of ARDL + seasonality models by country of origin for HODS. Sample Jan. 2010–December 2017. The table shows no relevant seasonality (months). Own elaboration

It emphasizes, on the one hand, that all models show a long-term relationship (except for the UK) with a 95% confidence level (USA with 90%). On the other hand, all models are affected by the monthly seasonality, highlighting the fact that the German country of origin every month is significantly different from zero.

Once the results of the three forecasting models cited in the methodology section have been obtained by nationalities of tourists who visit Spain, the RT’s can be applied to quantify which model is better in predictive terms.

The results of the forecasting accuracy (see Table 3), depend on the time horizon used and the country of origin analyzed.

Table 3 Matrix U1 Theil forecasting evaluation (Jan. 2018–June 2019): RT’s by country of origin. Own elaboration

In general, we can say that SARIMA models have obtained better results than SSA models (except the Netherlands with h = 12, 18). On the other hand, when comparing with the ARDL causal models with seasonality, the diversity of the results does not allow us to conclude which model has the best forecasting capacity. With a time horizon of 3 months, SARIMA presents the best results in three nationalities of origin (Residents abroad, France, UK), for the rest they have obtained better results of forecasting with ARDL seasonally. For a 6-month time horizon, the best results of ARDL with seasonality have been obtained for France and the Netherlands, against SARIMA. For the 12-month and 18-month time horizons, the gains from using ARDL models with seasonality are observed in the German and Netherlands nationalities. For the rest of the cases, the SARIMA models are superior to those analyzed in this paper.

6 Conclusions

In this paper, the importance of Forecasting modelling and historical analysis carried out in the literature review has been highlighted. The four dimensions of Big Data have been discussed: volume, the technologies coming from Google tools for data ETL have allowed analyzing the main markets of origin tourism in Spain; velocity, related to the volume of data, the data engineering provided by Google technologies allow us to monitor the Tourism Demand search intentions of the main nationalities who visit Spain; variety, the use of primary data source (INE) and secondary (Google) have allowed build knowledge based on the data. This last one is a novel aspect in the analysis since the users show their interest through the search of information on the Internet; veracity of the data verified through the cointegration contrasts carried out. They have allowed modelling the forecasts of Spanish hotel demand by country of origin.

In addition, this article has used more common techniques (SARIMA or ARDL) with a novel technique named SSA. The contribution, in particular, can be divided into the following points:

  1. 1.

    A Granger causality test extended to seasonality has been developed. In the literature, it was usual to perform only the contrast between endogenous and exogenous variables.

  2. 2.

    A criterion of the model’s selection based on the predictive capacity of the models has been developed (RT´s). In previous literature work, the gain in the use of models has not been quantified. Theil ratio quantifies the gain between pairs of models.

  3. 3.

    Related to the previous point, Econometric modelling with data from Big Data technologies does not guarantee an improvement in forecasting capacity. It has been demonstrated by the main nationalities who visit Spain.

  4. 4.

    Concerning the dynamic models with seasonality, we have empirically demonstrated that hotel demand decisions are made with at least a period in advance.

  5. 5.

    Cointegration relationship has been revealed expressed in the ECM model.

We can conclude that the models used in this work improve the explanatory capacity of causality (R2 close to 1) and cointegration relationships have been demonstrated, provide seasonal knowledge in decision making for the Spanish Tourism Demand. According to the results obtained, it is not possible to conclude that there is a gain in terms of forecasting by the use of tools from Big Data engineering; in contrast to what some authors claim [35]. The econometric interpretation of causality models and the economic interpretation can facilitate an adjustment of the offer in terms of prices or even advertising to the agents interested in visiting Spain. This article has been the basis of future research in which data from Big Data technologies are used to make efficient decisions. The theoretical framework could be developed in fields where online markets are relevant. The preferred frameworks for this type of analysis could be Finance, Automotive, Insurance or any sort of market which implies searches on the internet network and this is translated into a quantification of the final decision of the consumer.