Keywords

1 Problem Description

Vivre is a leading online retailer for Home and Lifestyle in Central and Eastern Europe, currently activating in 8 countries from this area. They are offering their customers both limited-time discounts (“flash sales”) called “campaigns” in which are grouped products from similar fields (such as “entrance and bath mats”, “sun glasses” or “LEGO accessories”) and long-term available products, which are grouped in the “product catalog”.

Since 2012, when it was launched, Vivre accumulated an important quantity of historical data regarding sales, providers and customers, information that could be used to improve the company’s activity.

The problem described in this paper derived as an attempt to forecast the daily quantity from each product to be sold by Vivre. These numbers could be very valuable for the company, as they would help maintain in the warehouse only the right stock from each product, under the constraints imposed by the restricted storage area. Another benefit would be that of allowing faster products delivery towards clients, (as the products are readily available in the warehouse and thus one does not have to also wait for their arrival from the providers,) which can be translated in increasing the customer satisfaction and a better positioning of the brand on the market. Finally, good estimation of the daily sold quantities of each product could help improve the flow of the incoming and outcoming deliveries for the products, thus optimizing the usage of the warehouse and eliminating the time lost when trailers are available but there is no storage in the warehouse or there are not enough products to be shipped. This may finally translate in larger profit for the company.

However, even though these numbers are extremely important, it is not trivial to derive them even when enough historical data is available, due to several factors. First, there are over 1 million different product ids in the database, making the task of predicting the daily sales for all of them time and resource consuming, especially considering this a recurrent (daily) task. Secondly, at any moment there are not more than about 50–60.000 products available on the website and thus, predicting the sale of any of the other products would be meaningless, as they cannot be seen by the user (and therefore cannot be bought by them). Thirdly, among all the products, there are numerous that are/were available on the website only for a very limited amount of time (during a campaign, for less than 30 days for a history spreading for more than 6 years), thus not having enough information for predicting their sales. Fourthly, there are products that may be replaced one for the other (e.g. products having different colors or sizes, but serving to the same purpose) and thus their sales are tightly connected. Finally, the sale of a product is highly influenced by the seasonality, marketing budget, existing campaigns on the website, providers of the goods that are available for buying and many other factors, some of them being logged in the system, while for others not having any information at all.

Considering all the above problems, a design decision has been made with the purpose to alleviate some of the issues: instead of trying to forecast the daily quantity to be sold for each product from the database, first a grouping over the similar products was made and then the forecast of the daily quantity to be sold was generated for each such group (called “generic name”). This decision was intended to solve the first four issues, as it seriously reduced the number of (group of) products for which predictions should be provided (from over a million to around 27,000). Moreover, it solved the problem of products not being available on the website, since all the obtained groups had at least one representative on the site. Finally, since similar products were grouped in the same generic name, the problem with rarely available products and with the replaceable products was also solved as they were simply part of the same larger distribution. Another effect of this decision was to diminish the scarcity and variability in the data, which might lead in the end to better predictions. On the other hand, there was also a drawback since, by joining the information related to multiple products in the same generic name, after prediction, the obtained data should be disentangled to obtain the quantities for the initial products.

Thus, the task described in this paper is to forecast the daily quantity to be sold for each generic name that was identified starting from the product ids from the database. Considering the historical data that was aggregated for each such generic name, we used time series analysis (that is described in the next section) for obtaining the predictions. Section 3 presents a use case scenario for one of the generic names that was used, along with the problems and decisions that were made. The obtained results are presented in Sect. 4, while the paper concludes with our observation regarding the used methodology and with some proposed directions for extending this research.

2 Time Series Analysis and Some of Its Applications

A time series is “a set of well-defined data items collected at successive points at uniform time intervals” [1]. Time series analysis represents a class of methods for processing these items to determine the main patterns of the dataset, which then may help predicting future values based on the identified patterns. One of the simplest method from this class is called autoregressive (AR) model and “specifies that the output variable depends linearly on its own previous values” [2]. The notation AR(p) – autoregressive model of order p – means that the current value depends on the previous p values of the time series. The expression of AR(p) is given by (1), where γ1, γ2…, γp are the parameters of the model, c is a constant, and ε1, ε2, …, εt are white noise error terms.

$$ X_{t} = c + \mathop \sum \limits_{i = 1}^{p} \gamma_{i} X_{t - i} + \varepsilon_{t} $$
(1)

In 1951, the AR model was extended by Whitle [3] into the autoregressive-moving-average (ARMA) model, which had two parts: an autoregressive (AR) part, consisting on the regression of the variable on its own past values, and a moving average (MA) part, modelling the error term as a linear combination of the current error term, along with some of the previous ones. The new model, depicted as ARMA(p, q) represents a model with p autoregressive terms and q moving-average terms, given by (2), where γ1, γ2, …, γp are the AR model parameters, θ1, θ2, …, θq are the MA model parameters, c is a constant, and ε1, ε2, …, εt are white noise error terms:

$$ X_{t} = c + \varepsilon_{t} + \mathop \sum \limits_{i = 1}^{p} \gamma_{i} X_{t - i} + \mathop \sum \limits_{i = 1}^{q} \theta_{i} \varepsilon_{t - i} $$
(2)

This new model was further improved by Box and Jenkins [4] to obtain the Autoregressive Integrated Moving Average (ARIMA) model. The difference between ARMA and ARIMA is that the latter’s first step is to convert a non-stationary data to a stationary one by replacing the actual data values with the difference between these values and previous ones (process called “differencing”, that may be performed multiple times). The ARIMA model has three parameters ARIMA(p, d, q), where p is the autoregressive order, d is the degree of differencing (the number of times the data have had past values subtracted) and q is the order of the moving-average model. Its formula is the same as in (2), with the only difference that instead of having the actual values Xi, we work with the difference between Xi and past values. One observation that should be made is that ARIMA(p, 0, q) represents in fact the ARMA(p, q) model, while ARIMA(p, 0, 0) represents the AR(p) model.

Being given the 3 parameters (p, d, q) and the actual data X, the ARIMA model uses the Box-Jenkins method [4] to find the best fit of a time-series model to past values of a time series. Later on, the parameters identified during fitting may be used to generate predictions on the future behavior of the time series. However, choosing the best values of p and q is not easy. Several options were proposed by the researchers. One option was proposed by Brockwell and Davis [5] who state that “our prime criterion for model selection will be the AICc”, which stands for the Akaike information criterion with correction [6]. Another option was given by Hyndman & Athanasopoulos [7] who suggest how to automatically determine both the values of p and q using ACF (autocorrelations function) and PACF (partial autocorrelations function) plots.

Since their inception, the ARIMA models were used to make predictions in various fields: from estimating the monthly catches of pilchard from Greek waters [8], to forecasting the Irish inflation [9], to predicting next-day electricity prices in Spain and Californian markets [10], to estimating the incidence of hemorrhagic fever with renal Syndrome (HFRS) in China during 1986–2009 [11], to foretelling the sugarcane production in India [12], and finally to prognosticate the energy consumption and greenhouse emission of a pig iron manufacturing organization [13]. However, many of the ARIMA uses were in the field of stock forecasting [14, 15], where they were trying to find the best parameters for estimating the stock prices of a particular stock.

In the following section, we will present a case study of applying different ARIMA models for predicting not the stock prices, but the quantities to be sold from different groups of products by Vivre. We could have tried to estimate the amount of sales, but since the product prices vary a lot in time, we decided to have a more stable estimation and opted out for the product quantities.

3 Case Study and Obtained Results

In this section, we will present the experiments that we undertook using the above-mentioned models with the purpose to predict the daily quantity to be sold for one of the generic names that were identified starting from the product ids from the database. Thus, we have chosen the “painting” generic name, having the distribution of daily quantities sold presented in Fig. 1. We chose this generic name because of two reasons: there was enough data available to enable predictions (only few days had zero-counts) and the distribution has some seasonality, but in the same time features some spikes that could be interpreted as outliers. Given this distribution, our task was to use the historical data from January 1st, 2014 to December 31st, 2017 (1461 training samples) for training the model, and the rest of the data (from January 1st to May 7th, 2018 – 127 samples) for testing the quality of the forecasting. The quality of the trained models was tested using 3 different values: RMS (root mean square error) for both the training and testing sets and MAPE (mean absolute percentage error).

Fig. 1.
figure 1

The distribution of the daily quantity sold for the generic name “painting” by Vivre Deco. The values from Jan 2014 to Jan 2018 were used for training, while the ones from Jan 2018 to May 2018 were used for testing.

Therefore, we started building several ARIMA models, with different parameters, and tested the forecasting accuracy on each of them. To start with, we used AR with different orders (p) to generate some initial predictions. The values of p that were used in these tests were influenced by some basic assumptions regarding the data: we assumed that the current data might be influenced by the value from the previous day (AR(1)), by the ones from the previous week (AR(7)), month (AR(31)), or year (AR(365)). The next step was to use various differencing to see if the data seasonality improves the results or not. However, since the results of AR(365) were very poor, we stopped using it for tests. Thus, we tested monthly (differencing of 31), weekly (differencing of 7) and daily seasonality (differencing of 1) combined with the remaining AR orders. We should mention that in all our experiments, by differencing of d, we understand single differencing (not stacked differencing), with lags = d (instead of Xi − Xi−1, we used Xi − Xi−d). The obtained results are reported in Table 1.

Table 1. Results obtained using the AR model. The first value represents the order of the AR model, while the second represents the differencing that was used (e.g. 31, 7 means AR(31), applied not on the real values, but on the difference Xi − Xi−7). A value of 0 means that no differencing was used (the real values were used to train the AR model).

Afterwards, we moved on to ARMA and tested models with different p (2–7) and q (1, 2, 3). To estimate the best values of p and q, we used the Hyndman & Athanasopoulos [7] methodology based on ACF and PACF (see Table 2). Since ARMA could use during training multiple solvers (lbfgs – limited memory Broyden-Fletcher-Goldfarb-Shanno; newton – Newton-Raphson; nn – Nelder-Mead; cg – conjugate gradient; ncg – non-conjugate gradient; and powell) along with 3 different methods (css – maximize the conditional sum of squares likelihood; mle – maximize the exact likelihood via the Kalman Filter; and css-mle – maximize the conditional sum of squares likelihood and then these values are used as starting values for the computation of the exact likelihood via the Kalman filter) for determining the model parameters, we used a grid search to find the best combination of the solver and the method for fitting the most promising ARMA model (ARMA(7, 0, 2)). The results are reported in Table 3.

Table 2. Results obtained using the ARMA model. Except for the last column, the results were obtained using an ARIMA(p, 0, q) model, where p = the first value (AR order), q = the second value (MA order). The last represents the differencing that was used. (e.g. (3, 2), 365 means ARIMA(3, 0, 2), applied not on the real values, but on the difference Xi − Xi−365). A value of 0 means that no differencing was used (the real values were used to train the AR model). The last column corresponds to the results obtained using ARIMA(5, 1, 1).
Table 3. Solver and method influence on the results obtained using the ARIMA(7, 0,  2) model.

Finally, we used the Augmented Dickey–Fuller test to verify if our time series was stationary, to find out if differencing could improve the results. Even though the test showed that the series was stationary, meaning that integration will not help, we still verified a couple of integrated models (with d = 1) to see the seasonality influence on the reference values that were used. The results are presented in Table 4. In most of these tests the differencing was made explicitly before running ARIMA, and thus the value of d = 0. However, we also made a test on a random combination of p and q (5, 1) to see if the implicit integration of ARIMA(5, 1, 1) would yield different results. These are reported in Table 2.

Table 4. Lag influence the results obtained using the ARIMA(7, 0, 2) model.

For each of the above models, we used the ARIMA method from the statsmodels.tsa.arima_model.ARIMA package from python. This package offered two types of predictions, and we used both in our experiments: forecast, that could predict the values for each day from the testing set in a single run based on the parameters that were obtained after training; and predict, that was predicting only the next value in the time series and was run iteratively for each day from the testing set, interleaved with re-fitting the model using the true value that was observed for the previous day. The iterative method always yielded better results than the 1-step forecasting, as it used more information. The results of the best model (ARIMA(7, 0, 2)) using powell solver and css-mle optimizing method are presented in Fig. 2.

Fig. 2.
figure 2

The best results obtained by ARIMA(7, 0, 2). The triangles represent the original values, the circles depict the 1-step forecasting and the ‘x’-s show the iterative predictions.

Since some of the trained models did not converge (such as ARIMA(5, 0, 2)), we also report this fact, along with the phase when they failed to converge (during training or iterative testing). If the model did not converge during the iterative testing, it was unable to generate the predictions for all the 127 samples from the testing set and thus we also report the number of generated predictions.

Finally, as the computation that is done for this generic name (“painting”) should be done for all the generic names, it means that the whole process should be repeated a couple thousand times. Thus, another important element is the time needed to obtain the predictions and therefore, for the most promising runs, this information is also presented.

Besides the ARIMA models, we also tested the FBProphet [16], which is a tool “for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth”. Prophet may be run with or without daily seasonality. Both methods generated very similar predictions (RMS 1 step 33.18/33.21; MAPE 1 step 37.85/37.77; RMS iterative 32.26/32.25; MAPE iterative 41.29/41.35; runtime 1668 s/1380 s), which were poorer than the ones obtained using ARIMA(7, 0, 2). We also tried to model the time series spikes using a different distribution with the help of the holiday option from FBProphet, but the results didn’t improve much (RMS 1 step 33.04; MAPE 1 step 37.43), remaining worse than the ones of ARIMA.

4 Discussion and Conclusion

The AR results showed that the best model was the one involving the previous 7 days. They also revealed that, except for AR(1) with daily seasonality, including differencing according to different seasonality worsen the results.

The table presenting the ARMA scores shows that the methodology based on ACF and PACF does not work in our case, the model with p and q generated by the ACF and PACF being the only one that did not converge during training (ARIMA(5, 0, 2)). Moreover, all the models with p and q chosen this way did not converge during testing. Although ARIMA(5, 0, 2) seems to have the best results (MAPE 24.22), it should be noticed that the model managed to provide only 27 predictions out of 127 required. The best model was ARIMA(7, 0, 2) which converged during both training and testing.

The investigation of different solvers and optimization methods showed that they have a very small influence on the obtained results. However, since some of the methods did not converge, for further experiments was chosen the combination leading to the second-best result (powell solver, css-mle method). This combination also had a reduced running time, which counts when the process has to be repeated 27,000 times.

Finally, choosing different lags only worsen the results, showing that the best solution is to work directly with the data, without differencing.

Even though the obtained results are promising, some of them being even better than FBProphet’s ones, we believe that there are still ways to improve them. They were obtained using only historical information about the daily sold quantities of a single generic names. Still, similar information is available for the other generic names, and could be used to improve the predictions, as the sale of some products also influence the sale of others. However, to use this additional information, the ARIMA model must be changed for one that allows multi-variate dependencies. If such a change happens, other additional information may also be used: marketing budget, sales events, number and type of products on sale in each sale event, number of page views, products availability, etc. In the future, we intend to advance the work presented here by inspecting several such models (logistic regression, random forest, neural nets and deep learning) and including some of the supplementary information. Another possibility to improve the results is by creating an ensemble from different models, built using the above-mentioned techniques, and thus to benefit from the fact that they generate the predictions in different ways, which might help eliminating some of the prediction errors.