Multivariate Cuban Consumer Price Index Database, Statistic Analysis and Forecast Baseline Based on Vector Autoregressive

Rosado, Reynaldo; González Diéz, Héctor; Toledano-López, Orlando Grabiel; Hernández Heredia, Yanio

doi:10.1007/978-3-031-49552-6_3

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14335))

Included in the following conference series:

International Workshop on Artificial Intelligence and Pattern Recognition

214 Accesses

Abstract

The global Consumer Price Index (CPI) is a monthly multivariate time series, which allows measuring the variation of the final consumer prices of a given set of goods and services of households living in a given geographic region, city or country. The present work addresses the problem of the multivariate time series database of Cuba’s CPI and a respective forecasting model based on Vector Autoregressive to establish a baseline for this dataset. An statistical analysis of the data will allow characterizing each variable of the series in terms of relevance to the multivariate problem, its causal relationships and the respective stationary analysis to evaluate the best lag to be considered in the forecasting model. The main statistics evidences of each test were reported in the paper as starting point for futures researches in the field of deep learning.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Consumer Price Index Forecasting Based on Univariate Time Series and a Deep Neural Network

Forecasting commodity prices: empirical evidence using deep learning tools

Article 20 January 2023

An Extensive Comparative Between Univariate and Multivariate Deep Learning Models in Day-Ahead Electricity Price Forecasting

Index terms—Consumer Price Index || Multivariate Time Series Forecasting || Vector Autoregression

1 Introduction

The Consumer Price Index (CPI) is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services. The CPI is a widely indicator of inflation, and is managed, in Cuba, by the Government of the National office of the Statistic and Information (ONEI) and by similar organizations and institutes in other countries. The CPI is systematically way taken as a reference for decision-making regarding monetary policies by governments and financial entities. It is also used for various aspects of social finance, such as retirement, unemployment and government financing [25]. The CPI is typically calculated using a Laspeyres index formula [5], which holds the basket of goods and services constant over time. However, the ONEI also calculates a multivariate CPI, which accounts for changes in the quality of goods and services over time. This is done using a hedonic regression model, which estimates the value of different product characteristics (such as Food and Drinks, Health, Houses, Transportation, among others) and adjusts prices accordingly [18, 24].

The multivariate CPI is generally considered to be a more accurate measure of inflation than the traditional CPI, as it better accounts for changes in quality. However, it is also more complex to model by the relevance of the goods and service (normally need to define the weight manually by experts). The multivariate CPI accounts for changes in the quality of goods and services over time, whereas the univariate CPI assumes that the quality of the basket of goods and services remains constant. This means that the multivariate CPI provides a more accurate measure of inflation, since it adjusts for changes in quality and captures the true cost of living. The work [15] provides an overview of identification problems in macroeconomics, including those related to constructing price indexes and the main advantage of the multivariate approach.

In the case of Cuba, the weighting reflects the data obtained in the National Survey of Household Income and Expenditures (ENIGH), which was conducted between August 2009 and February 2010. The weights of goods and services are therefore based on the consumption expenditures that households have access to at that time. The goods and services that affect Cuba’s CPI are: 01 Food and non-alcoholic beverages; 02 Alcoholic beverages and tobacco; 03 Clothing; 04 Housing services; 05 Furniture and household items; 06 Health; 07 Transportation; 08 Communications; 09 Recreation and culture; 10 Education; 11 Restaurants and hotels; 12 Miscellaneous personal care goods and services [6].

The Monthly Publication of the CPI from the National Office of Statistics and Information (ONEI), allows to know the average variation experienced by the prices of a basket of goods and services, representative of the consumption of the population in a given period. Approximately 33 596 prices are collected monthly, in 8 607 establishments, located in 18 municipalities throughout Cuba, the urban area of the head municipalities of 14 provinces and 4 municipalities of Havana province, obtaining national coverage. This means that the index to be shown is only representative of the country; it does not exist at the level of regions or municipalities. The basket of goods and services includes 298 items that represent more than 90.0% of household expenditure. The data are published in the form of reports in pdf format, which makes it difficult to process and analyse them because there is no integrated view of the database [18].

Both the prices of the products and services that give origin to the CPI estimate, as well as the CPI itself, are calculated systematically, so they are time series data type. As CPI forecast helps to estimate future trends, it is key for decision making. Moreover, it allows the application of price stabilization policies to reduce the economic impact on the prices of products and services demanded by consumers. In those economies that present instability, CPI data fluctuate over time, which translates into a non-linear and non-stationary behaviour [19].

In general, several approaches in the CPI forecasting field, modelling the problem as a univariate time series, concentrating only on the study of the global indicator. Approaching it as a multivariate problem, taking into account the variation of the prices of each goods or service included in the basket, is not very well treated since the global index is composed of the weighted aggregation of the prices to each products. The most widely used statistical method for forecasting the CPI as a univariate time series problem has been the family of the Autoregressive models [2, 7, 9, 14, 16, 17]. Recently, deep learning techniques for time series forecasting have improved the performance of CPI prediction. Recurrent Neural Networks (RNN) or Long Short-Term Memory (LSTM) architectures have the ability to capture time dependence in data, while handling more than one output variable to estimate more than one time instant. Three examples that show good performance with simple LSTM [25] model and temporal data at different time intervals are from Mexico [11], Ecuador [20, 21] and Indonesia [13]. In spite of the prominent performance of RNN models for time series forecasting, particularly for financial series, they have been studied on the basis of autoregressive models such as VAR. This is due to the limited availability of data in this scenario which is also reflected in the CPI as described in the previously works.

The aim of this paper are to propose a new multivariate time series database of Cuba’s monthly CPI and a respective forecasting model based on Vector Autoregressive model as a baseline follow the statistics methodologies of analysis. A statistical analysis of the data will allow characterizing each variable of the series in terms of relevance to the multivariate problem, its causal relationships and the respective stationary analysis to evaluate the best lag to be considered in the forecasting model.

2 Multivariate Analysis

2.1 Definitions and Notation

A multivariate time series is defined as a collection of the multiples variables spatially related and individually shows a temporal relationship. Classical statistical or machine learning models need to consider the univariate or multivariate problem differently, however deep learning models can handle both indistinctly with high accuracy. Time series are usually characterized by three components: trend, seasonality and residuals [23]. In real-world time series and, in particular the CPI problem, seasonality can be affected by external agents such as the economic and financial crisis, prices of the main products in the world market, and emerging situations such as the COVID-19 pandemic.

In a more formal definition of the Multivariate Time Series we have m variables or observations, each of which has a time series. These variables are correlated in a way that the value of them at time t is related to the temporal window of size p previous values of all other variables including its own past values. We can represent each forecast in the set of variables at time t as a linear combination:

$$\begin{aligned} \hat{y}^1_t=\,&w^1_{0}+w^1_{11}y^1_{t-1}+\ldots w^1_{m1}y^m_{t-1}\nonumber \\ & +\ldots w^1_{1p}y^1_{t-p}+\ldots w^1_{mp}y^m_{t-p}+\epsilon ^1_t\nonumber \\ & \vdots \nonumber \\ \hat{y}^m_{t}=\,&w^m_{0}+w^m_{11}y^1_{t-1}+\ldots w^m_{m1}y^m_{t-1}\nonumber \\ & +\ldots w^m_{1p}y^m_{t-p}+\ldots w^m_{mp}y^m_{t-p}+\epsilon ^m_t \end{aligned}$$

(1)

Finally, the corresponding time series forecasting problem consists of the estimating a predictor $F: \mathbb {R}^{(m+1)}\,\times \, \mathbb {R}^m\,\times \,\mathbb {R}^p\rightarrow \mathbb {R}$ in such a way that the expected deviation between true and predicted outputs is minimized for all possible inputs. The model associated with the Eq. (1) is to known as Vector of Auto-Regression (VAR). In the context of the CPI, this scenario makes it possible to forecast the price of goods and services that contribute to the overall or general CPI.

2.2 Statistical Analysis in Multivariate CPI.

Granger’s Causality Test

The Granger causality test [8] is a statistical hypothesis test used to determine whether one time series is useful in forecasting another time series. The test is based on the idea that if a time series x “Granger-causes” another time series y, then past values of x should contain information that helps predict future values of y, beyond what can be predicted using past values of y alone.

The Granger causality test is commonly used in econometrics, finance, and other fields to investigate causal relationships between time series. Several applications in the multivariate CPI can be found in recently researches [1, 10, 12, 22].

It is also worth noting that the Granger causality test assumes that the time series are stationary, so it is often preceded by a test for stationarity such as the Augmented Dickey-Fuller (ADF) test. Additionally, the test is sensitive to the choice of lag length and model specification, so it is important to carefully choose these parameters based on the data and the research question at hand.

Augmented Dickey-Fuller Test

The multivariate Augmented Dickey-Fuller (ADF) test is an extension of the standard ADF test that allows for multiple time series to be analysed simultaneously, taking into account possible relationships between them.The multivariate ADF test involves estimating a vector autoregressive (VAR) model for the set of time series and testing for the presence of unit roots in the model. Test examines whether the residuals of VAR model are stationary, which is equivalent to testing for stationarity of each individual time series after controlling for the other time series in the model [4].

The multivariate ADF test is useful in identifying whether a set of time series are stationary in a joint sense, which can be important for modelling and forecasting purposes. For example, if a set of economic variables are jointly non-stationary, it may be difficult to develop accurate forecasting models that account for the interrelationships between variables. It is important to note that the multivariate ADF test has some limitations and assumptions. For example, it assumes that the VAR model is correctly specified and that the residuals are normally distributed and free from serial correlation. Additionally, the test can be sensitive to the lag length of the VAR model and the number of variables included in the model [3]. Therefore, it is important to carefully select the appropriate model specification based on the data and the research question at hand.

3 Results and Discussion

3.1 Exploratory Analysis and Dataset

The Cuban Consumer Price Index database was collected from the official website National Office of the Statistic and Information ONEI [18]. This is a monthly time series from January 2010 to December 2020 with very low variability in the data as we can show in Table 1.

Table 1. Characteristic of the Cuban Consumer Price Index dataset.

Full size table

The values are the overall averages of the Cuban CPI for 11 category groupings and almost 298 goods and services. It is necessary to clarify that the data sets in the context of the CPI are very short series where learning models that require a lot of data are not effective in this context. Under these conditions we have modelling an appropriated problem as a time series forecasting. The Fig. 1 show the trends of the series and seasonality for each category (dashed line) respect to the overall.

3.2 Statistics Analysis

Overall, Granger’s causality test is a useful tool for analysing causality between two time series, but it should be used in conjunction with other methods and careful interpretation of the results. The results of the Granger causality test involves assessing the statistical evidence for causality, determining the direction of causality, assessing the strength of causality, and considering the context and theoretical implications of the result. It is important to be cautious in interpreting Granger causality results and to consider other evidence and methods when assessing causality in time series data.

The null hypothesis is that the past values of x do not help in predicting y, while the alternative hypothesis is that the past values of x do help in predicting y. In Fig. 2 we can show that series like Transportation and Food and Non Alcohol Drinks have very low predictive power respect to another series. Also, this two series have similar causality relation respect to General CPI.

The Augmented Dickey-Fuller (ADF) test is a powerful tool used to check the stationarity of the time series. This test can help to choose various parameters such as the optimal lag or the differential order to transform the multivariate series into stationary. The null hypothesis of the ADF test is that the time series is non-stationary. Therefore, if the p-value of the test is below the significance level (0.05), the null hypothesis is rejected and it follows that the time series is truly stationary. In our time series, the result of the ADF test can be find in Table 2. The test result showing that the series its non-stationary while the first differential its stationary. Also, we can report in Table 2 the best lags for each series.

Table 2. Multivariate ADF Test over original series and the first differential.

Full size table

Lag Selection in VAR

The choice of the best metric for lag selection in time series analysis depends on the specific modelling approach and the characteristics of the data. Akaike Information Criterion (AIC): The AIC is a measure of the relative quality of statistical models for a given set of data. It balances the goodness of fit of the model with the number of parameters used. The lower the AIC, the better the model. Bayesian Information Criterion (BIC): The BIC is similar to the AIC but places a greater penalty on the number of parameters used in the model. The BIC tends to favour simpler models with fewer parameters. Finally, Hannan-Quinn Information Criterion (HQIC): The HQIC is another model selection criterion that balances the goodness of fit with the number of parameters used. It is similar to the AIC, but it places a greater penalty on the number of parameters than the AIC [4].

In Table 3 the AIC metric drops to lowest at lag 5, then continue with instability at lag 6 and then continuously drops further.

Table 3. Lags selection in vector autoregressive.

Full size table

Table 4. Report of the forecasting metrics applied to the performance of the VAR method.

Full size table

Forecasting Measures

As in other similar papers, we use the most common metrics for CPI time series forecasting. The Root Mean Squared Error (RMSE), Mean absolute Error(MAE), Mean absolute Percentage Error(MAPE) among others metrics were report in Table 4. It’s important to note that no single metric is universally better than the others, and the choice of metric depends on the specific problem being solved and the context in which the forecasting is being applied. For example, in some cases, minimizing the overall error (as measured by MSE) may be more important than accurately predicting individual values (as measured by MAE or RMSE). Conversely, in other cases, accurately predicting individual values may be more important than minimizing overall error, such as in financial forecasting.

In general the overall performance of the multivariate Cuba CPI show very good adjust in the test set (the last two years) with the mean MAPE in the very low order of the 1.4% in the general CPI. The Fig. 3 we have the forecast results over test set considered in this analysis as a baseline for futures research.

4 Concluding Remarks and Further Work

A new dataset has been proposed for the CPI study in Cuba with a multivariate approach of which there are no references of previous researches in the field of forecasting. In this sense, this work has followed a standard statistical analysis methodology that has allowed establishing a baseline in terms of the VAR models, being these methods, reported in the literature as the starting point in the Multivariate CPI problem. Additionally, statistical tests for the study of the causality have been performed, showing in general strong relationships between the main components of the dataset. Likewise, the ADF test to study stationarity showed that the first differential of the series avoided stationarity with an 95% of the significance.

Future work is being planned in several directions with a view to extending this contribution. On the one hand, work is being done on an in-depth study of deep learning methods based on RNN and auto-encoder models to improve forecasting metrics and take advantage of the capabilities of the DL to handle non-linearity in the feature engineering. In another direction, variable selection should be exploited to achieve learning schemes with greater generalization.

References

Akin, A.C., Cevrimli, M.B., Arikan, M.S., Tekindal, M.A.: Determination of the causal relationship between beef prices and the consumer price index in turkey. Turk. J. Vet. Anim. Sci. 43(3), 353–358 (2019)
Article Google Scholar
Banerjee, A.: Forecasting price levels in India-an Arima framework. Acad. Mark. Stud. J. 25(1), 1–15 (2021)
MathSciNet Google Scholar
Cheung, Y.-W., Lai, K.S.: Lag order and critical values of the augmented dickey-fuller test. J. Bus. Econ. Stat. 13(3), 277–280 (1995)
Google Scholar
Cromwell, J.B.: Multivariate Tests for Time Series Models. Number 100. Sage (1994)
Google Scholar
Diewert, W.E.: Index number issues in the consumer price index. J. Econ. Perspect. 12(1), 47–58 (1998)
Article Google Scholar
García Molina, J.M.: La economía cubana a inicios del siglo XXI: desafíos y oportunidades de la globalización. CEPAL (2005)
Google Scholar
Ghazo, A., et al.: Applying the ARIMA model to the process of forecasting GDP and CPI in the Jordanian economy. Int. J. Financ. Res. 12(3), 70 (2021)
Article Google Scholar
Granger, C.W.J.: Investigating causal relations by econometric models and cross-spectral methods. Econom.: J. Econom. Soc. 424–438 (1969)
Google Scholar
Jere, S., Banda, A., Chilyabanyama, R., Moyo, E., et al.: Modeling consumer price index in Zambia: a comparative study between multicointegration and ARIMA approach. Open J. Stat. 9(02), 245 (2019)
Article Google Scholar
Korkmaz, S., Abdullazade, M.: The causal relationship between unemployment and inflation in g6 countries. Adv. Econ. Bus. 8(5), 303–309 (2020)
Article Google Scholar
Anaya, L.M.L., Moreno, V.M.L., Aguirre, H.R.O., López, M.Q.: Predicción del ipc mexicano combinando modelos econométricos e inteligencia artificial. Rev. Mexicana Econ. Finanzas 13(4), 603–629 (2018)
Article Google Scholar
Mallick, L., Behera, S.R., Dash, D.P.: Does CPI granger cause WPI? Empirical evidence from threshold cointegration and spectral granger causality approach in India. J. Dev. Areas 54(2) (2020)
Google Scholar
Manik, D.P., et al.: A strategy to create daily consumer price index by using big data in statistics Indonesia. In: 2015 International Conference on Information Technology Systems and Innovation (ICITSI), pp. 1–5. IEEE (2015)
Google Scholar
Mohamed, J.: Time series modeling and forecasting of Somaliland consumer price index: a comparison of ARIMA and regression with ARIMA errors. Am. J. Theor. Appl. Stat. 9(4), 143–53 (2020)
Article Google Scholar
Nakamura, E., Steinsson, J.: Identification in macroeconomics. J. Econ. Perspect. 32(3), 59–86 (2018)
Article Google Scholar
Nyoni, T.: Modeling and forecasting inflation in Kenya: Recent insights from ARIMA and GARCH analysis. Dimorian Rev. 5(6), 16–40 (2018)
Google Scholar
Nyoni, T.: ARIMA modeling and forecasting of consumer price index (CPI) in Germany (2019)
Google Scholar
ONEI. Índice de precios al consumidor base diciembre 2010 (2022)
Google Scholar
Qin, X., Sun, M., Dong, X., Zhang, Y.: Forecasting of china consumer price index based on EEMD and SVR method. In: 2018 2nd International Conference on Data Science and Business Analytics (ICDSBA), pp. 329–333. IEEE (2018)
Google Scholar
Riofrío, J., Chang, O., Revelo-Fuelagán, E.J., Peluffo-Ordóñez, D.H.: Forecasting the consumer price index (CPI) of Ecuador: a comparative study of predictive models. Int. J. Adv. Sci. Eng. Inf. Technol. 10(3), 1078–1084 (2020)
Article Google Scholar
Rosado, R., Abreu, A.J., Arencibia, J.C., Gonzalez, H., Hernandez, Y.: Consumer price index forecasting based on univariate time series and a deep neural network. In: Hernández Heredia, Y., Milián Núñez, V., Ruiz Shulcloper, J. (eds.) IWAIPR 2021. LNCS, vol. 13055, pp. 33–42. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89691-1_4
Chapter Google Scholar
Sünbül, E.: Linear and nonlinear relationship between real exchange rate, real interest rate and consumer price index: an empirical application for countries with different levels of development. Sci. Ann. Econ. Bus. 70(1), 57–70 (2023)
Article Google Scholar
Torres, J.F., Hadjout, D., Sebaa, A., Martinez-Alvarez, F., Troncoso, A.: Deep learning for time series forecasting: a survey. Big Data 9(1), 3–21 (2021)
Article Google Scholar
Triplett, J.: Handbook on Hedonic Indexes and Quality Adjustments in Price Indexes: Special Application to Information Technology Products (2004)
Google Scholar
Zahara, S., Ilmiddaviq, M.B., et al.: Consumer price index prediction using long short term memory (LSTM) based cloud computing. J. Phys.: Conf. Ser. 1456, 012022 (2020)
Google Scholar

Download references

Acknowledgement

This work has been partially funded by FONCI through project: Plataforma para el anélisis de grandes volúmenes de datos y su aplicación a sectores estratégicos.

Author information

Authors and Affiliations

Universidad de las Ciencias Informaticas (UCI), La Habana, Cuba
Reynaldo Rosado, Héctor González Diéz, Orlando Grabiel Toledano-López & Yanio Hernández Heredia

Authors

Reynaldo Rosado
View author publications
You can also search for this author in PubMed Google Scholar
Héctor González Diéz
View author publications
You can also search for this author in PubMed Google Scholar
Orlando Grabiel Toledano-López
View author publications
You can also search for this author in PubMed Google Scholar
Yanio Hernández Heredia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Héctor González Diéz .

Editor information

Editors and Affiliations

Universidad de las Ciencias Informáticas, Havana, Cuba
Yanio Hernández Heredia
Universidad de las Ciencias Informáticas, Havana, Cuba
Vladimir Milián Núñez
Universidad de las Ciencias Informáticas, Havana, Cuba
José Ruiz Shulcloper

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rosado, R., González Diéz, H., Toledano-López, O.G., Hernández Heredia, Y. (2024). Multivariate Cuban Consumer Price Index Database, Statistic Analysis and Forecast Baseline Based on Vector Autoregressive. In: Hernández Heredia, Y., Milián Núñez, V., Ruiz Shulcloper, J. (eds) Progress in Artificial Intelligence and Pattern Recognition. IWAIPR 2023. Lecture Notes in Computer Science, vol 14335. Springer, Cham. https://doi.org/10.1007/978-3-031-49552-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-49552-6_3
Published: 20 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49551-9
Online ISBN: 978-3-031-49552-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multivariate Cuban Consumer Price Index Database, Statistic Analysis and Forecast Baseline Based on Vector Autoregressive

Abstract

Similar content being viewed by others

Consumer Price Index Forecasting Based on Univariate Time Series and a Deep Neural Network

Forecasting commodity prices: empirical evidence using deep learning tools

An Extensive Comparative Between Univariate and Multivariate Deep Learning Models in Day-Ahead Electricity Price Forecasting

1 Introduction