1 Introduction

The world population is experiencing an acceleration of its growth rate. It will reach around 8.5 billion in 2030, increase to 9.7 billion in 2050 and 11.2 billion in 2100 [1]. Cities today bring together half of humanity. Coupled with the massive rural–urban migration, urban infrastructure and services are pushed to their limits regarding environment, scalability and security. Municipalities are looking for a sustainable economy to improve energy efficiency, optimize natural resources, effectively manage urban infrastructure and minimize carbon emission levels. They are therefore obliged to capitalize on the fast progress of information and communication technologies and evolve toward a “smart city” [2].

This concept is defined as a city that performs prospectively in its fundamental components like governance, mobility, environment, lifestyle, etc. It connects the IT, physical, social and business infrastructures to leverage the city's collective intelligence [3]. Many cities in Asia, Europe and North America have led their own smart city initiatives, citing for example Yokohama Smart City, SmartSantander, LIVE Singapore and CITYKEYS. These and other projects were realized thanks to the emergence of the internet of things (IoT). It allows easy access and interaction with a wide variety of equipment including household appliances, surveillance cameras, monitoring sensors, parking meters, actuators, displays, vehicles and so on [4]. It generates, as a result, a huge amount of varied data that can be useful in the development of several applications like smart grids, mobile healthcare, traffic management and wearable.

Intelligent transportation systems (ITS) [89], a potential application of IoT, remain a pressing need to manage parking facilities as well as traffic and therefore avoid the challenges caused by their mishandling. ITS improve travel safety, increase transportation system effectiveness, reduce traffic congestion, ensure daily mobility sustainability and enhance drivers’ experiences. They cover a wider range of technological solutions, the main ones being parking assistance, guidance systems and information service.

Finding a parking space in large cities is a real struggle that causes frustration, waste of money, air pollution and greenhouse gas emissions. Cruising for parking contributes to 30% of traffic on average [5] and to about 40% in peak demand [6]. Drivers spend between 3.5 and 14 min seeking for a curbside parking spot [5]. In a large city like Chicago, the number of curbside parking spaces is more than 35 000, resulting in 172 million vehicle miles traveled per year in search for parking [7]. This is equivalent to 8.37 million gallons of gasoline consumption and more than 129 000 tons of CO2 emissions. All these tricky problems have arisen due to missing detailed information on available parking spaces at a given time. This means that their resolution simply requires collecting necessary data from the already installed infrastructure (cameras, sensors, parking meters, internauts, connected vehicles), processing obtained records and communicating results to drivers. However, in order to help drivers identify appropriate parking spots at the beginning of their trip, availability prediction is the only practical way. It effectively taps into real-time data to produce short- or long-term forecasts anticipating available parking spaces.

1.1 Research questions

With our suggestion, we tried to tackle three questions: Does each predictive method require exactly the same feature set to achieve the best accuracy? If not, how does these attributes' contribution to get the outcome variable depend on the applied algorithm? What effect do their type (ensemble versus individual) and principle (trees, neurons or lines) have on the quality of the computed predictions? What elements render decisions made by a model more appropriate than those taken by another one?

1.2 Proposed solution

To answer clearly and precisely these matters, the parking occupancy rate is modeled depending on four different time-series feature sets. Our objective is to assess how the nature, number and precision of independent variables affect the performance of machine learning models. It is forecasted by four nonparametric algorithms belonging to two distinct categories in order to compare the efficiency of the ensemble methods recognized for their broad range of applications versus single ones characterized by their diversity. Furthermore, a simpler explanation model, based on SHAP values [90], is implemented to interpret outputs from the best-performing model, reveal the predominant features and clarify their individual contribution. This is possible because it allows investigating the decision-making process of Artificial intelligence algorithms.

1.3 Research contributions

Our major contribution is to parse the impact that the quality of insights, obtained from the features, has on the quality of predictions. We examined whether there is a dependency between the model type employed and the number of explanatory variables. We also tested if an excess of attributes can reduce the method's capabilities. We were the first construing results generated by a model-based parking occupancy prediction framework using SHAP values, to our best knowledge. We sought to assimilate the inner working of these algorithms perceived as black boxes difficult to explain. We focused on determining which factors most influence a parking lot’s occupancy relative to its peers.

1.4 Outline

The reminder of this document is structured as follows: Sect. 2 summarizes many intelligent parking proposals published in the literature; Sect. 3 defines our overall system framework and lists its various components; Sect. 4 formalizes the problem, decrypts our prediction mechanism architecture and concisely presents the developed machine learning methods; Sect. 5 introduces the real database used, describes the occupancy trend of its car parks and analyzes the calculated error metrics values; Sect. 6 explains the permutation feature importance measurement and assesses explanatory variables' predictive power; Sect. 7 reviews in depth SHAP approach and outlines a comprehensive features interpretation; Sect. 8 briefly recaps our findings and discusses possible improvements.

2 Related works

Intelligent parking systems benefit fully from the technological development and therefore offer a variety of real-time parking space availability information. Parking solutions proposed in the literature can be categorized into two main types: “sensor-based” and “crowdsourcing-based” systems.

Most research works put forward intelligent parking platforms based on different types of sensor networks. The authors of [8] detected parking space occupancy using convolutional neural networks to directly process real-time pictures filmed by a Raspberry Pi camera. Adopting the same principle, the paper [9] developed a decentralized and efficient system for visual parking lot occupancy detection where each smart camera can simultaneously monitor up to fifty parking spaces. PGS is a parking guidance system based on wireless sensor network which guides a driver to an available parking lot [10]. Its architecture contains a Parking information server, T-Sensor nodes (magnetic sensor), T-Sink nodes and a TBS. The authors of [11] conceived a mobile sensing unit that is an ultrasonic sensor mounted on the passenger side of a car to measure the distance from the vehicle to the nearest roadside obstacle. They estimate roadside parking occupancy by a supervised learning algorithm which analyzes the structure of the sonar trace. A smart parking solution that makes use of a protected wireless network and sensor communication is proposed by Yan et al. [12]. It is designed as a continuous-time Markov chain, precisely as a birth–death stochastic process. It broadcasts new business promotions to all vehicles passing by the parking site through wireless networks. An integrated smart parking system is introduced in [13]. It brings multiple parking service providers together under a unified platform using blockchain technology. In [14], two different ways to manage availability information in parking facilities are evaluated: posting a warning on variable message sign panel to indicate that there are no free spaces when occupancy percentages are above 90% or 95% and zoning that places vehicle detection systems at intermediary points around the facility to separate it into internal zones.

Crowdsourcing is a model that consists to solve, in a distributed manner, a complicated problem by involving a crowd of indefinite size [15]. This practice proved its ability to help drivers find appropriate parking spaces through several successful experiences. ParkJam [16] is an Android application exploiting publicly accessible geographical data and car park availability information collected through crowdsourcing. PocketParker is a crowdsourcing system using smartphones to predict parking lot availability [17]. It detects arrivals and departures by leveraging existing activity recognition algorithms. UW-ParkAssist application [18] combines data collected passively using vector-based location mapping on users' phones with crowdsourced information actively supplied by users related to their observed density of parking. It offers real-time information about parking availability in UW-Parkside’s campus lots. CroPark is a user-engaged crowdsourcing parking monitoring system that builds a parking occupancy map [19]. It employs ultrasonic sensors to measure the distance from the vehicle to roadside and uses a supervised learning algorithm to estimate the number of available parking spaces. Pick N Park application gathers and delivers real-time information about a number of car parks in crowded urban areas [20]. It is supplemented by a car park estimation model which provides a good approximation in terms of parking search time. CrowdPark [21] enables users to loosely reserve parking spots. It achieves parking reservation by crowdsourcing information about when parking resources will be available.

However, such collected real-time data is not practical enough as the target places may be occupied just before the drivers arrive. Consequently, they compete for the desired space and, out of obligation, behave violently by verbally arguing or parking in a forbidden space. With the aim to fill this gap, various articles model parking space occupancy by different approaches like Markov chain variants, regression methods and machine learning techniques (Table 1). The authors of [22] employed a discrete Markov chain model to demystify the future state of a parking lot, by the time a vehicle is expected to reach it. Similarly, the writers of [23] described the stochastic occupancy change of a parking facility by employing a continuous-time Markov queue. Such methods, however, can only be efficient for car parks equipped with access control. They also generate long-term forecasts with a high variance and are therefore reinforced by other methods. A hybrid approach combining agent-based with dynamic time varying Markov chain in [24] and with Markov chain Monte Carlo in [25] is proposed. In [26], a short-term prediction of parking space availability is suggested and relied on the wavelet neural network and the largest Lyapunov exponents method. The subsequent occupancies of parking spaces are forecasted, in [27], by the naïve Bayes, C5.0 decision tree, random forest and regression analysis founded on the spatial–temporal features extracted from parking data.

Table 1 Intelligent parking solution for availability prediction

Some researchers seek, besides predicting parking availability, to reduce both the number of required models and training examples by studying the database's characteristics. The high temporal autocorrelation of the on-street parking dynamics is analyzed in [28] and non-Euclidean spatial autocorrelation among parking lots in [29]. For the first paper, recurring patterns in the collected data are exploited by means of clustering and training set reduction techniques. For the second one, global spatial dependencies between parking lots are captured by a contextual graph convolution block and a soft clustering graph convolution and incorporated by a recurrent neural network. Richter et al. [30] presented a back-end-based approach to learn historical models of parking availability per street by utilizing five spatiotemporal clustering strategies. Their goal is to significantly lower the space needed to store these models and it is shared with the authors of [31] who proposed a 2-step approach to predict parking space availability. In the first step, the raw parking data are smoothed by SVR in combination with a specifically defined technique to tune parameters. In the second, a multidimensional SVR model is trained on top of this smoothed trend curve.

Unlike previous research, ParkPGH is a smart parking application designed to supplement the predicted number of available parking spaces at a particular time by a binary dependent variable denoting whether the garage is full or not [32]. The authors in [33] investigated the extension of estimating parking information from areas equipped with sensors to areas that are missing them. They built a parking demand profile, which reflects the time of the day where parking occurs and its duration for a given area, by using complementary city data. Rong et al. [34] estimated the real-time parking availability throughout a city by proposing a deep-learning-based approach, called Du-Parking, that relies on the historical parking availability data and a variety of datasets (meteorology, events, map mobility trace data and navigation data).

According to the literature surveyed, there remain unexplored research topics in relation to parking availability prediction. There are only works dedicated to searching for or proposing very exact algorithms. Moreover, the interaction between a machine learning model's response and a feature combination is rarely approached. In this study, we experimented with several sets of inputs to find the optimal one for each model. We demonstrated that there is no optimal feature subset that works for every method. We aimed to build a general idea about how much input information is required by each model type to get the best performance. We deciphered, for the first time in this context, why certain predictions are made using Shapley additive explanation method. It provides an overview of possible improvements to a model as it clusters the data points depending on the feature importance.

3 Our system architecture

To predict parking occupancy, we adopted the intelligent parking system framework detailed in Fig. 1. It includes diverse data components and combines various information providers. As such, it aims to guarantee higher data precision, continual updating and diversified types of records. It comprises four layers: data collection, communication, middleware and user layer.

Fig. 1
figure 1

Smart parking architecture

The first layer gathers information on parking occupancy using two different manners. It detects vehicle absence or presence thanks to several types of sensors installed at each parking space (magnetic, ultrasonic, infrared and acoustic) or cameras monitoring the entire car park. It determines the number of occupied spaces by means of access control barriers or parking meters. If a parking lot isn't equipped with these stationary sensors, it is possible to make use of mobile sensors as smart drivers, connected vehicles, smartphones and urban authorities. After parking or before moving off, drivers answer a question about parking availability based on their observations and with a manual entry on their device. Smartphones communicate to the server users' geographical position with the help of their geolocation system. These devices can also scan the road being traversed, identify vacant spots and distinguish parking places from other zones. Connected vehicles with all their numerous sensors allow to ascertain when they are running or stationary, to pinpoint unoccupied parking spaces and to recognize congested road stretches. They aid in defining street status by considering a road close to the driver's destination as occupied if it is crossed at low speed without finding a place to park. Urban authorities, in particular police forces and car park supervisors, may adjust the real-time records in the database according to their own data, e.g., patrol observations.

In communication layer, multiple communication protocols can be deployed in order to maintain message exchange between all system constituents: between sensors, from sensors to gateway and from gateway to web server as well as to database. All these entities interconnected or maintained in connection constitute a computer network. Its typology, as its topology, is chosen mainly according to the network size (number of machines), the required data transfer speed and its extent. Three computer network categories are perfectly adapted for this type of application, namely local area network (for parking lots spread over a few meters to a kilometer), a metropolitan area network (for car parks extended over a dozen kilometers) and a wide area network (for parking spaces dispersed over large geographical distances). Several physical configurations are considered to connect the network's different nodes, the relevant ones being the star, mesh, tree, ring, point-to-point and circular topologies.

Middleware layer is responsible for processing collected data of varied nature (coordinates, images, sentence, number, address, date…), filtering misleading detections and deducting consequently the parking availability in certain streets. It directs all the software intelligence embedded in the described parking solution utilizing machine learning algorithms, data mining models and powerful graphics techniques. It streamlines all captured data into a single database which is then hosted with the related servers.

The last layer concerns interaction with users and usage of all services provided via a mobile application or website. It displays online an interactive map showing the real-time occupancy of a city's parking spaces. It guides drivers to designated spots from their current position by sketching out the route to be taken or issuing voice instructions. It also lets users reserve one or more spaces while taking into account fees varying with location, time and parking duration. It proposes promotions to incite drivers for greater involvement depending on their participation rate in data collection.

4 Modeling parking availability

4.1 Methodology

Our real-time approach for forecasting parking space availability provides, as output, the occupancy rate calculated simply by dividing the number of vehicles parked in a car park by its capacity. It considers, as explanatory variables for the algorithms, four sets containing different attributes:

  • X0 = {Parking identifier, year, month, day, day of week, time}

  • X1 = {Parking identifier, year, month, day, time}

  • X2 = {Parking identifier, day of week, time}

  • X3 = {Parking identifier, weekday or weekend, time}

Intuitively, these predictors represent the most influential factors in parking usage. But to demonstrate their statistical significance and to catch the pairwise degrees of relationship between each independent variable and the dependent variable, we elaborated the correlation matrix that refers to the symmetric array of Pearson correlation coefficients (Table 2). This coefficient is measured on a unitless scale and ranges from −1 to + 1 through 0 [48]. A negative sign in front of the coefficient means that the two variables vary inversely, and this is the case for Day and Weekday or weekend. As the coefficient is closer to 1 (in absolute value), the relationship between the variables is stronger. The variables Parking identifier, Day of week and Time explain, respectively, 94.349%, 78.082% and 81.201% of the variance of the occupancy rate variable.

Table 2 Correlation matrix

Figure 2 depicts the computational architecture of our parking availability prediction methodology, which comprises four components:

  • Data preparation which eliminates noise altering collected data, reduces their variability, divides the database into two separate sets (four-fifths for training against one-fifth for test) and partitions randomly the training set into 10 equally sized subsamples by means of k-fold cross-validation;

  • Models training which runs, for every algorithm to be built, 10 iterations of training and validation so that in each iteration a different fold of the data is considered as the validation set while the remaining 9 folds make up the learning one;

  • Model performance testing which generates predictions on the test set and measures the predictive accuracy of each machine learning algorithm by three error metrics that are R2, MAE and RMSE;

  • Model explanation which computes how much each predictor contributes, positively or negatively, to the target variable by plotting four different graphs (local interpretability, standard summary, summary and dependency plots).

Fig. 2
figure 2

The framework of our proposed parking availability prediction system

4.2 Developed algorithms

To forecast the availability of car parks, four machine learning techniques are selected, namely multivariate adaptive regression splines (MARS), support vector regression (SVR), artificial neural network (ANN) and extremely randomized trees (ERT). These algorithms are chosen primarily because they are judged as better in the literature [35,38,41,47,52], their internal principles are largely dissimilar, their parameters are significantly different and their learning samples are noticeably distinct.

SVR is a nonlinear kernel-based regression method which locates a regression hyperplane with smallest structural risk in a so-called high dimensional feature space [49,50]. Given a set of training data \(\left\{\left({x}_{1},{y}_{1}\right),\dots ,\left({x}_{p},{y}_{p}\right)\right\}\), where \({x}_{i}\subset {\mathbb{R}}^{n}\), \(i=1,\dots ,p\), denotes the input vector and \({y}_{i}\subset {\mathbb{R}}\), \(i=1,\dots ,p\), designates the corresponding target value, the SVR estimating function takes the following form:

$$f\left(x\right)=\sum_{i=1}^{p}\left({\alpha }_{i}-{\alpha }_{i}^{*}\right)\langle \phi \left({x}_{i}\right)\bullet \phi \left(x\right)\rangle +b=\sum_{i=1}^{p}\left({\alpha }_{i}-{\alpha }_{i}^{*}\right)k\left({x}_{i},x\right)+b$$
(1)

where \(b\subset {\mathbb{R}}\) is an offset, \(k\left({x}_{i},x\right)\) is a kernel function which represents the inner product \(\langle \phi \left({x}_{i}\right)\bullet \phi \left(x\right)\rangle \), \({\alpha }_{i}\) and \({\alpha }_{i}^{*}\) are nonzero Lagrange multipliers. The most commonly employed kernel function is the radial basis function (RBF) defined, in [51], as:

$$k\left({x}_{i},x\right)=\mathrm{exp}\left(-\gamma {\Vert {x}_{i}-x\Vert }^{2}\right)$$
(2)

where \(\gamma \) is the width parameter of the RBF kernel function and chosen thanks to heuristics.

ANN is a computational model inspired by the organization and functioning of biological neurons primary to the parallel nature of the human brain [54,55]. The multi-layered perceptron, which is composed of one input layer, one or multiple hidden layers and one output layer, is its widely used variant [56]. Each layer comprises several neurons and each neuron \(i\) in a layer sends outgoing signals to every node \(j\) in the next layer. Neuron \(j\) gets an effective signal \({S}_{j}\) resulting from the weighted sum of all received signals [57].

$${S}_{j}=\sum_{i=0}^{p}{w}_{ji}{x}_{i}$$
(3)

where \({x}_{0}\) is the bias and \({w}_{j0}\) is the bias weights. The output \({y}_{j}\) of neuron \(j\) in the hidden layer is set by passing the signal \({S}_{j}\) through an activation function that aims, furthermore to introduce nonlinearity into the neural network, to bound the value of the neuron so that the neural network is not paralyzed by divergent neurons [58]. The rectified linear unit function (Relu) is the most often utilized activation function and is given by the formula below.

$${y}_{j}=f\left({S}_{j}\right)=\mathrm{max}\left(0,{S}_{j}\right).$$
(4)

MARS is a nonparametric regression technique that Friedman developed in [59]. It models relationships implying interactions with less variables and those being relatively additive. It performs piecewise linear regressions for constructing basically flexible models [60]. Consequently, it fits distinct linear regression slopes within separate feature space intervals in order to estimate a model's nonlinearity. It operates a quick- but cost-consuming search process in identifying which variables to employ and the upper limits of their intervals [61]. It further looks at the interactions between the variables to take their degrees into account [62]. The equation hereafter stands for the general MARS function.

$$f\left(x\right)={a}_{0}+\sum_{m=1}^{M}{a}_{m}\prod_{k=1}^{{K}_{m}}{\left[{s}_{km}\left({x}_{v\left(k,m\right)}-{t}_{km}\right)\right]}_{+}$$
(5)

where \({a}_{0}\) and \({a}_{m}\) are parameters, \(M\) is the basis functions number, \({K}_{m}\) is the nodes number, \({s}_{km}\) specifies the corresponding step function sense by taking 1 for right and −1 for left, \({x}_{v\left(k,m\right)}\) is the dependent variable and \({t}_{km}\) states the nodes location.

MARS builds the optimal model in two steps. Firstly, several basis functions are designed to overfit the data. Secondly, they are selected in order of highest contribution by means of the generalized cross-validation (GCV) criterion expressed as follows [63]. It addresses missing value problems with dummy variable capabilities.

$$ {\text{LOF}}\left( {f_{M} } \right) = {\text{GCV}}\left( M \right) = \frac{1}{p}\mathop \sum \limits_{i = 1}^{p} \frac{{\left[ {y_{i} - f_{M} \left( {x_{i} } \right)} \right]^{2} }}{{\left[ {1 - C\left( M \right) / p} \right]^{2} }} $$
(6)

where \(C(M)\) is the cost-penalty measures of a model containing M basis functions.

ERT is a tree-based ensemble method that creates a set of unpruned regression trees according to the classical top-down procedure [64]. It splits nodes by choosing cut-points fully at random rather than computing local optimal ones and grows the trees with the whole learning sample instead of a bootstrap replica [65]. It is parameterized regarding how many trees to build, their depth, number of features picked at random for every node and smallest sample size required to divide a node [66]. The score in ERT is measured by the relative variance reduction. For a sample \(S\) and a split \(s\), it is calculated by the formula (7).

$$\mathrm{Score}\left(S,s\right)=\frac{\mathrm{var}\left\{y|S\right\}-\frac{\left|{S}_{l}\right|}{\left|S\right|}\mathrm{var}\left\{y|{S}_{l}\right\}-\frac{\left|{S}_{r}\right|}{\left|S\right|}\mathrm{var}\left\{y|{S}_{r}\right\}}{\mathrm{var}\left\{y|S\right\}}$$
(7)

where \(\mathrm{var}\left\{y|S\right\}\) is the variance of the output \(y\) in the sample \(S\), \({S}_{l}\) and \({S}_{r}\) represent the two subsets of cases from \(S\) corresponding to the two outcomes of a split \(s\).

It compensates the randomization with a forest of trees and computes the average of the constituent tree outputs [67].

4.3 Hyperparameters of exploited algorithms

Table 3 summarizes the most relevant hyperparameters of the four algorithms that should be tuned to comparatively analyze their performance. It outlines their respective meanings and specifies the values used for the dataset implementation. Model parameters are those parameters updated during the data learning process, unlike hyperparameters that determine model architecture, are defined before training [4].

Table 3 Hyperparameters of our four machine learning techniques

Hyperparameters demonstrate their interest by improving model performance, e.g., complexity or learning rate. Each model is characterized by a large number of hyperparameters but looking for the best combination leading to the best performance can be tackled as a search problem.

The best parameters for the four algorithms described above are obtained using the GridSearchCV library developed by Sklearn. This library is designed to identify the optimal parameters for any type of machine learning algorithm. It takes a dictionary of parameters to fine-tune and finds the best-performing ones according to the evaluation measure set at the computation time. Furthermore, in doing so, it leverages the k-fold cross-validation technique that divides a data set into \(K\) equal sets. Among these \(K\) sets, each set is utilized once as test data and the other sets are employed as training data. This approach determines the single most effective hyperparameter combination considering the \(K(K-1)\) training sets and \(K\) test samples.

To enhance SVR model accuracy, several parameters need to be tuned. Three major ones are: Kernel, C and Gamma. The main role of the kernel is to transform a low dimensional input space into a higher dimensional space, it is always set to 'rbf' kernel. The penalty parameter C, or regularization, stands for the misclassification or error term. Gamma, the kernel coefficient, defines to what extent the calculation of the plausible separation line is influenced. For RBF kernel, the typical values of these parameters are ranging from 0.001 to 1000. The results of a tenfold cross-validation are illustrated in the figure above.

Figure 3 clearly shows that the lowest value of MAE is obtained with the combination of 100 and 1.0 for the parameters C and gamma, respectively: {‘C’: 100, ‘gamma’: 1}.

Fig. 3
figure 3

Optimizing parameters of SVR

learning_rate, max_iter and hidden_layers_sizes are three hyperparameters that determine our MLP model convergence speed. They are optimized utilizing GridSearchCV with activation function set to “Relu” on account of its common use, implementation simplicity and effectiveness in overcoming limitations. Optimization will be run considering two learning rates (constant and adaptive), three values for max_iter (100, 200 and 400) and three different combinations of hidden_layers_sizes ((120,), (150,) and (200,)). These configurations are iterated with a tenfold cross-validation and evaluated by means of the mean absolute error score. Figure 4 represents all three plots for the three different values of the hidden_layers_sizes parameter.

Fig. 4
figure 4

Optimizing parameters of ANN

Fig. 5
figure 5

Optimizing parameters of MARS

Visualized results indicate that the best parameter combination leading to the best score consists of: {‘hidden_layer_sizes’: (150,), ‘learning_rate’: ‘constant’, ‘max_iter': 200}.

MARS as well has some parameters to tune, four of which are relevant: max_terms, max_degree, penalty and endspan. To avoid overfitting our model, only three of them are employed. The best-performing combination of the parameters is determined by exploiting once again the GridSearchCV optimizer and testing several values, from 1 to 4 for max_degree as penalty and ranging from 5 to 150 for max_terms. The results of a tenfold cross-validation are depicted in the following graphs (Fig. 5).

Recommended parameters for predicting parking availability with MARS are: {‘max_degree’: 1, ‘max_terms’: 4, ‘penalty’: 5}.

The main parameters to adjust when applying ERT method are n_estimators and max_features. The former is the number of trees in the forest. The larger the better, but also the longer it will take to compute. The latter is the size of the feature random subsets to consider when splitting a node. The smaller the size, the greater the variance reduction, but also the greater the increase in bias. Good default empirical values range from 1 to \(\sqrt{n\_features}\) for max_features and from 5 to 1000 for n_estimators. These values are generally not optimal and can result in models that require significant memory resources. The best parameter values would be cross-validated with the help of GridSearchCV, the findings of which are displayed in Fig. 6.

Fig. 6
figure 6

Optimizing parameters of ERT

Given this graph, small values for max_features and n_estimators do not fit best, validating our assumed hypothesis: {‘max_features’: 5, ‘n_estimators’: 50}.

5 Implementation and results

The suggested framework is deployed with python 3 using JupyterLab web-based interactive environment. It is evaluated by leveraging historical parking information in the city of Aarhus, Denmark.

This interpretive language is very cross-cutting, with multiple modules covering a wide range of domains. These modules are Python programs that include functions to carry out an extraordinary number of tasks. To implement the selected machine learning algorithms, three libraries are exploited:

  • Scikit-learn: A state-of-the-art tool for artificial intelligence problems incorporating large set of machine learning algorithms for supervised and unsupervised medium-scale problems [91,92]. It employs a high-level generic language with minimal dependencies. It features a clean, uniform and streamlined API while benefiting from helpful online documentation. Here, predictive models developed by SVM algorithm are implemented with the sklearn.svm module for regression (SVR). The MLPRegressor class implements a multi-layer perceptron (a type of artificial neural network) algorithm that is trained using backpropagation. The sklearn.ensemble module which contains ensemble-based learning methods addressing classification, regression and anomaly detection is utilized to deploy the ERT method. Sklearn's “pyearth” library is employed to roll out the MARS method. In addition, our models' performance measurements are performed by the sklearn.metrics module.

  • Pandas: open-source data processing and analysis tool, widely chosen for its power, flexibility, practicality and especially for its speed [53].

  • SciPy: open-source library coded in Python principally dedicated to scientific or technical computing [93]. It provides algorithms for optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations…

Source code and datafile can be downloaded from https://github.com/hanaeers/aarhus_parking.

5.1 Experimental dataset: Aarhus parking data stream

The city of Aarhus supplied a data stream with parking data for eight parking lots over a period of five months from May 22, 2014, to November 4, 2014. These parking lots are equipped with sensors and able to indicate the number of vacant spaces [68]. Data features comprise Garage code, total spaces, vehicle count, update time and stream time (Table 4).

Table 4 Database attribute description

In total, the database contains 55 264 data points and is available from [69]. Before developing the selected machine learning models, we treated the data to restructure values, complete any missing ones and eliminate incoherent lines that distort the findings. For this purpose, we conducted multiple corrective operations:

  • Delete rows with negative numbers of occupied parking spaces, remove useless columns (features) and aggregate data per parking lot;

  • Attribute a distinct integer number, between 1 and 8, to every car park name;

  • Get the day, month, year and time from the variable “Update time”;

  • Retrieve the day of week from the same variable and allocate an integer from 1 (if Monday) to 7 (if Sunday) for each day;

  • Insert a column for a binary variable that indicates whether the day of week is a weekday or a weekend;

  • Calculate the percentage of spaces occupied by dividing the number of vehicles parked in a parking lot by its capacity and multiplying this ratio by 100;

  • Set 100% as the upper limit for parking lot occupancy rates that exceed this value;

  • Replicate, if data is missing for a certain hour, the previous hour's value if not the previous week's value (same day of preceding week).

After data pre-processing and transformation, our database contained 11 053 records relating to 8 car parks covering 167 days. Figure 7 illustrates parking occupancy distribution over the days of a month as a line plot. We can clearly notice that the occupancy rate varies greatly during the month, is quite similar for all the car parks and peaks on days 3, 14 and 19. The first few days of the month are characterized by a high occupancy of the parking spaces in contrast to its end.

Fig. 7
figure 7

Daily distribution of parking occupancy

The bar chart (Fig. 8) shows the monthly occupancy distribution of the 8 parking lots in Aarhus. The car park occupancy differs from month to month. Parking lot 8 is more occupied in month 11 than in month 5, its availability is high and at least stationary. Parking lot 5 is less available in month 8 compared to the other months and its occupancy fluctuates by a margin of 50%. Parking lots 1 and 6 are fully available in months of May and June, therefore their unavailability rate in this period is zero.

In general, the occupancy rate of parking lots is higher on weekdays rather than weekends (Fig. 9). It is significantly dissimilar from one parking lot to another, e.g., parking lot 2 is distinguished by a low occupancy rate while parking lot 5 has a high percentage. Parking lots 3, 4, 7 and 8 are filled to within 25% on weekends.

Fig. 8
figure 8

Monthly availability of parking lot

Fig. 9
figure 9

Availability of car parks on weekends and weekdays

The line chart (Fig. 10) hereunder presents the parking lot occupancy rate by time of day. Occupancy of most parking lots is subject to a major peak at 1 p.m. when demand increases. The parking supply in car park 1 is typically stable with an occupancy rate around 50%, except for a drop to 20% between 5am and 8 pm. The variation in parking availability over the day is practically significant for parking lot 5: parking demand decreases slightly in the morning, increases rapidly at noon and diminishes slowly in the evening. In contrast, the occupancy rate of parking lot 6 is less dispersed, i.e., it is almost constant.

Fig. 10
figure 10

Availability data after processing

5.2 Experimental results

To evaluate the efficiency of the developed models, the accuracy and precision of their predictions, three error metrics are calculated: root mean square error (RMSE), mean absolute error (MAE) and coefficient of determination (R2).

RMSE represents the residual variance square root. This latter is calculated by averaging the squared deviations between predicted values (\(\widehat{y}\)) and observed ones (\(y\)) [70]. It evaluates all ratings inexactness, either positive or negative [71]. The lower this metric's values are, the better the prediction model performs.

$$RMSE=\sqrt{\frac{1}{p}\sum_{i=1}^{p}\Vert {y}_{i}-{\widehat{y}}_{i}\Vert }.$$
(8)

MAE measures the closeness of the prediction to the eventual outcomes [70]. It denotes the ratio of the error vector \({y}_{i}-{\widetilde{y}}_{i}\) one norm to the number of samples \(p\) (9) [72]. The deviations between the model and the observations increase with the values of this metric.

$$MAE= \frac{1}{p}\sum_{i=1}^{p}\left|{y}_{i}-{\widehat{y}}_{i}\right|.$$
(9)

R2 quantifies the proportion of variance explained by a model [73]. It examines the goodness of the technique’s fit by its ability to predict the response variable [74]. This coefficient is between 0 and 1 and increases with the regression adequacy to the model. A good fit is indicated by an R2 close to 1. Conversely, an R2 close to 0 reflects a poor fit, but does not imply that no relationship can be established between the variables.

$${R}^{2}=1-\frac{\sum_{i=1}^{p}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}}{\sum_{i=1}^{p}{\left({y}_{i}-{\overline{y} }_{i}\right)}^{2}}$$
(10)

where \({\overline{y} }_{i}\) is mean values.

By analyzing the values of these indicators (Table 5), it is clear that MARS is the least performing model with a R2 less than or equal to 0.061, a MAE greater than or equal to 30.31 and a RMSE larger than or equal to 1203.624 for all four input sets. ERT provides the best results with a R2 that exceeds 0.98, a MAE that reaches 2.021 and a RMSE that attains 15.756. ANN and SVR come in second rank by a R2 of 0.55, a MAE of 13.49 and a RMSE of 571.49.

Table 5 Metrics values measuring model performance

From R2, ERT based on feature set X0 is the most exact method scoring a coefficient of 0.988, followed by ANN relying on feature set X3 and SVR founded on feature set X2 with performance decreases of 43.927% and 45.242%, respectively. Considering RMSE, ERT using input set X0 performs best and surpasses ANN employing input set X3 by about 3527% and SVR utilizing input set X2 by around 3629%. In terms of MAE, ERT with explanatory variable set X0 is the most precise algorithm that is 567% better than SVR with explanatory variable set X3 and 715% better than ANN with explanatory variable set X3.

Predictions are more accurate if the model is developed with feature set X0 containing more details apart from SVR which generalizes better on unseen data with feature set X2 and ANN which gets the best performance with feature set X3. For input sets X2 and X3, the three models (ERT, SVR and ANN) give more or less the same accuracy of forecasts with a R2 that varies from 0.517 to 0.595 (fluctuation band of 27.876%), a MAE that ranges from 13.495 to 17.257 (variation margin of 13.109%) and a RMSE that goes from 518.579 to 619.345 (percentage difference of 19.431%).

Concretely, ERT with all four data sets outperforms the other regressors with any input set. Figure 11 graphically depicts predictions made by ERT versus observed values. The expected curve is a straight line through the origin frame. Using feature sets X0 and X1, ERT yields very interesting outcomes with a R2 near to 1 (\(\approx \) 0.99) and a MAE not more than 2.33%. Its efficiency increases with the accuracy, number and information quality brought by the independent variables, unlike ANN which require as little data as possible to achieve its best performance. SVR is a special case that differs from the two previous ones by necessitating moderately complete information to deliver its maximum performance. The level of detail and accuracy of the information that the independent variables provide does not have a direct effect on models' performance, which depends mainly on their tendency to model one particular distribution rather than another.

Fig. 11
figure 11

Comparative plot between ERT predictions for each input set and real values

The above results suggest that some algorithms perform better than others based on the different metrics values obtained. This means that the algorithm with the best average performance should outperform those with the worst. But to what extent is this difference true and not just the consequence of a statistical fluke? To test the significance of the average score difference, a statistical analysis is carried out, specifically the 5 \(\times \) 2 cv paired Student's t test.

This statistical hypothesis test, proposed by Dietterich [97], consists in comparing two models (classifiers or regressors). It relies on splitting the database in 2 (50% of the data for training and 50% of the data for testing) five times. In each of these five iterations, it fits two regressors R1 and R2 to the training set and evaluates their performance on the test one. Then, it alternates the training and test sets such that the training dataset becomes the test dataset and vice versa. Finally, it calculates anew each model's performance, resulting in two performance difference measures:

$${ACC}^{\left(1\right)}={ACC}_{R1}^{\left(1\right)}-{ACC}_{R2}^{\left(1\right)}$$
(11)
$${ACC}^{\left(2\right)}={ACC}_{R1}^{\left(2\right)}-{ACC}_{R2}^{\left(2\right)}.$$
(12)

It ends by estimating the mean and variance of the two differences:

$$\overline{ACC}=\frac{{ACC}^{\left(1\right)}+{ACC}^{\left(2\right)}}{2}$$
(13)
$${\sigma }^{2}={\left({ACC}^{\left(1\right)}-\overline{ACC}\right)}^{2}+{\left({ACC}^{\left(2\right)}-\overline{ACC}\right)}^{2}.$$
(14)

The t statistic is computed using the variances of the two differences calculated for the five iterations.

$$t={ACC}_{1}^{(1)}\sqrt{\frac{5}{\sum_{i=1}^{5}{\sigma }_{i}^{2}}}$$
(15)

where \({ACC}_{1}^{(1)}\) is the \({ACC}^{\left(1\right)}\) obtained from the first iteration.

The t statistic is assumed to approximately follow a t-distribution with 5 degrees of freedom, under the null hypothesis postulating that the two models R1 and R2 have equal performance. Given the t statistic, the p value can be estimated and compared to a previously chosen significance level, in our case \(\alpha =0.05\). If the p value is less than α, the null hypothesis is rejected and thus a significant difference exists between the two models.

Our four algorithms are compared 2 by 2 with two explanatory variable sets X0 and X3. X1 and X2 are not used because they are included in X0. The results are summarized in Tables 6 and 7. The p values when comparing MARS with any other algorithm are well below 0.05, which means that any discernible difference between these algorithms is most likely real. It is therefore possible to confirm that MARS is the least efficient model among the four tested. In contrast, ANN and SVR trained with either X0 or X3 result in a p value significantly greater than 0.05. Consequently, it is impossible to reject the null hypothesis. It could just as easily apply SVR or ANN, and the results would be similar. In addition, the difference between mean performance of ERT and remaining algorithms is most likely significant since the p values are less than 0.05. An obvious result is that ERT effectively outperforms the rest of the three models. An exception occurs if ERT is tested against ANN using feature set X3, where the p value informs that for this variable set, the two algorithms perform alike. This finding may be related to the algorithm stochastic nature, assessment methodology or differences in numerical accuracy. In general, the average performances computed with the three error metrics are confirmed by the statistical test executed. This shows that model selection based on mean performance alone may be sufficient in some cases.

Table 6 Results of the paired t test for X0
Table 7 Results of the paired t test for X3

Figure 12 displays the training time of each regressor for each feature set. At first glance, it is remarkable that these times belong to an interval of great length whose upper limit is 199.513 s and the lower limit is 0.449 s. ERT and MARS are characterized by very low computation times that do not exceed 5 s for the first algorithm and 1 s for the second one. Training SVR employing any input set takes 104.901 s on average with a standard deviation of 10.527 s, which is not surprising considering that all four models consistently generate estimates of roughly identical quality. ANN's training durations are highly dispersed, for example using input set X3 it needs about 200 s while with input set X1 it requires a little more than 5 s.

5.3 Cross-validation impact on prediction accuracy

Before analyzing in detail how cross-validation impacts our models' efficiency, it was necessary to start by examining the effect of fold number on their overfitting, consistency and reliability. This procedure was applied to our database with several numbers of folds, namely 3, 5, 7 and 10. Figure 13 presents the RMSE metric derived from the four models with the different cross-validations. For SVR, RMSE decreases by increasing the number of cross-validation folds. Therefore, the best results in terms of prediction accuracy are obtained with the tenfold cross-validation. In contrast, RMSE values related to MARS models increase with the number of folds. Accordingly, the best performance of MARS is achieved by dividing the data into 3 equal folds. ANN is characterized by a non-monotonic RMSE evolution. This indicator reaches its maximum value when the number of folds equals 5 and its minimum value when this number equals 7. As for ERT, RMSE trend depends on the set of explanatory variables used. For X2 and X3, RMSE increases rapidly and becomes constant from 7 folds. RMSE of ERT models based on X0 and X1 first increases, then decreases and finally stabilizes. The predictions made by ERT are more accurate if the models with X0 and X1 are validated by tenfold cross-validation. Yet, those built on X2 and X3 are evaluated by threefold cross-validation. Typically, the worse performance of ERT models is much better or at least comparable to the best performance of other algorithms. Roughly speaking, the tenfold cross-validation for most models yielded best performance with respect to the four k-fold cross-validations tested.

Fig. 12
figure 12

Training time for each machine learning algorithm

Fig. 13
figure 13

RMSE results of the different cross-validation techniques

Table 8 is mainly intended to compare predictive performance of trained models with and without cross-validation. Its values indicate that the R2, MAE and RMSE indices tended to improve for all four algorithms as the number of folds in the cross-validation increased. A first observation concerns predictions estimated by MARS, SVR and ANN, which are almost identical for all four feature sets. Thus, the three metrics' values are very close in the four cases. It seems therefore that cross-validation unifies these models' performance for any combination of explanatory variables. By contrast, ERT's effectiveness differs widely from one feature set to another.

Table 8 K-fold cross-validation results

Considering the three performance indicators, the top scoring model is X3-based ERT with an R2 of 0.852, a MAE of 7.64 and an RMSE of 171.84. This finding highlights a considerable improvement in this model's performance, which reaches an average of 56% thanks to the cross-validation. In fact, ERT performance with X2 and X3 evolved positively when they were cross-validated 3 or 5 times. However, they maintained the same accuracy level when evaluated by tenfold or sevenfold cross-validation. Even worse, ERT's efficiency with X0 decreased remarkably, as this drop is at least 100%. ERT using X0 and X1 without cross-validation showed excellent performance but became less than average when the database was divided into any number of equal parts. The most negatively influenced metric is RMSE which rose from an average of 18.5 to an average of 800. Moreover, the cross-validation did not bring anything to MARS efficacy since the metrics values are not really changed. It was unable to improve the prediction accuracy of this machine learning method and therefore improve its ranking.

Similarly, sevenfold or tenfold cross-validation did not impact significantly the results of ANN with X0, X2 and X3. Instead, it minimized the error committed by this predictor based on X1: R2 increased by 28%, MAE decreased by 23% and RMSE diminished by 24%. In contrast, ANN's prediction precision with any combination of explanatory variables is remarkably reduced due to the threefold or fivefold cross-validation. The scores obtained by SVR with X0 are slightly better thanks to the tenfold and sevenfold cross-validation. Yet, these techniques together with the fivefold cross-validation contributed to keeping the performance of the other models. Further on, threefold cross-validation has an opposite effect on all SVR models, a decrease in efficiency of at least 31% on all error measures is noted.

All these results clearly underline that cross-validation is able to improve as well as to degrade the predictive capacity of machine learning algorithms. Its effect depends mainly on several parameters: underlying model, regularization level, number of nearest neighbors… Problems also arise if the assumptions explicitly made by the cross-validation are violated:

  • When the data are dependent, the training and validation samples are no longer necessarily independent, which can induce a rather strong bias for the risk estimation of a learning rule;

  • When the time series is non-stationary, simple cross-validation does not guarantee a reliable prediction of the “future” from the “past”;

  • When estimator selection is performed from an exponential collection [95], the principle of unbiased risk estimation no longer works.

Overall, algorithm training times and efficiency are closely related and vary proportionally, i.e., a model is more accurate if its training time is longer. ERT is superior in terms of accuracy and computing time compared to other machine learning techniques. To understand why it is the best-performing algorithm and how each feature affects its outputs, counterfactual and contrastive explanations are generated by implementing SHAP method.

5.4 Prediction benchmark

Each prediction work mainly targets to obtain values as close as possible to reality. Thus, we compared our best predictions with those presented in [75]. In that paper, smart parking is modeled with multiagent system using long short-term memory neural network. It is evaluated based on the same sequential real data provided online by Aarhus City Council. Its authors have adopted a prediction accuracy metric equivalent to our R2. After 400 iterations, this indicator converges to its maximum and achieves 0.97%.

ERT with feature set X0 reaches a R2 of 0.988, implying that our model improves prediction accuracy by about 1.821%. Even better, ERT is a simple algorithm that doesn't require reinforcement by any other method to excel. In contrast, the approach proposed in [75] is considerably complex because it incorporates an advanced deep learning model into an artificial intelligence system. As a result, ERT is faster than the other paper model and needs a very low computation cost.

6 Feature importance

Machine learning models involve numerous complex models known as “black box” ones, for which understanding how they combine explanatory variables to make predictions is exceedingly challenging. Deep learning models such as artificial neural networks and ensemble models like random forests, gradient boosting learners or stacking algorithms constitute examples of highly talented black box models. These methods are known for their remarkably accurate output in a wide variety of fields ranging from urban planning to computer vision. Hence, the greatest task facing researchers is to interpret the way in which predictors influence predictions, notably with conventional statistical methods.

The following paragraphs outline an alternative approach to interpreting black box models, labeled Permutation feature importance [96]. This is a robust tool for detecting features in a data set that have predictive power, no matter which model is chosen. It relies on randomly changing each column's (predictor variable) values, one column at a time. It then evaluates the model. That is, it first calculates the baseline error rate, next computes the same indicator by neutralizing each predictor variable in turn and finally forms the ratio between the two values. These returned scores represent the change in performance of a trained model, after permutation.

This technique refers to a decrease in model score given a single attribute value is randomly shuffled. It consists in indicating to what extent the model depends on the feature by breaking its relationship with the target. It is implemented through the permutation_importance() function of Scikit library which takes as input a fitting model, a data set and a scoring function. In our case, ERT without cross-validation is used as fitting model and the three error metrics as scoring function. Not using cross-validation can be explained by our intention to highlight the algorithm's intrinsic principle without any other technique intervention. The results are plotted in the graphs below.

By inspecting the three graphs, it is easily seen that the three data-generating predictors (Garage code, Month and Hour) have relatively large values for all three scores, meaning that they have significant predictive power in our model. On the other hand, the other predictors show relatively low or zero scores, implying that they are not as relevant to model decision making. Garage code is the explanatory variable that provides the most valuable information for predicting parking occupancy. It helps to increase R2 value by 1.6, decrease MAE value by 30 and reduce RMSE value by 2000. Month and Hour also appear to be important predictors as they improve R2 score by more than 0.5, MAE score by at least 14 and RMSE score by no less than 600. The importance of the Day and Day of the Week features is relatively small given that randomly shuffling their values only degrades model performance by 0.2 in R2, 5 in MAE and 250 in RMSE. Year does not attribute to the model's decision process and therefore has no role in estimating predictions. This little or no impact on ratings prompts consideration of feature selection to remove these irrelevant or redundant predictors in future analyses, thus saving time and resources without sacrificing accuracy (Fig. 14).

Fig. 14
figure 14

Permutation feature importance results

Fig. 15
figure 15

Mean feature importance and related standard deviations

Permutation feature importance results can also be visualized by means of a detailed boxplot that presents the average feature importance and the corresponding standard deviations. The graph's X-axis indicates the importance score (RMSE, MAE or R2) and the Y-axis illustrates as many boxplots as features sorted by importance. The top variable being the most important, and the bottom being the least important. Each boxplot displays the importance values (minimum, maximum, median, first quartile and third quartile), although they are not clearly visible due to small deviations.

The three plots in Fig. 15 assert that Garage code strongly contributed to improving all three scores, an enhancement of 2089 with a standard deviation of 23 for RMSE score, a decrease of 31 with a standard deviation of 0.25 for MAE score and an increase of 1.6 with a standard deviation of 0.02 for R2 score. Second place belongs to Month feature whose contribution is estimated to be an improvement of 837 with a standard deviation of 16 for RMSE, a diminution of 13 with a standard deviation of 0.18 for MAE and an augmentation of 0.65 for R2 with a standard deviation of 0.013. For the third influencing parameter, Hour, its contribution is moderate with a decrease in RMSE values by 569 (\(\pm \) 8.5), a reduction in MAE values by 12.7 (\(\pm \) 0.13) and a small increase in R2 values by 0.44 (\(\pm \) 0.007).

As a conclusion, ERT's outstanding performance based on X0 and X1 is mainly due to modeling parking occupancy by the three most important explanatory variables, namely Garage code, Month and Hour. ERT's prediction accuracy with X0 slightly exceeds that of ERT's predictions with X1 owing to Day of Week's contribution which belongs to X0 and not to X1.

7 Model prediction interpretation

Shapley additive explanations (SHAP) is an approach that unifies methods like LIME [7677] and DeepLIFT [7879] under the class of additive feature attribution methods [8081]. It describes the performance of a machine learning model based on game theory and local explanations [82]. It distributes the total gain or payoff among players, depending on the relative importance of their contributions to the final outcome [8384].

It approximates an original model \(f\) with input variables \(x=\left({x}_{1},\cdots , {x}_{m}\right)\), where \(m\) is the number of independent variables, by an explanation model \(g\) with simplified input \({x}^{^{\prime}}\) expressed as:

$$f\left(x\right)=g\left({x}^{^{\prime}}\right)={\phi }_{0}+\sum_{i=1}^{p}{\phi }_{i}{x}_{i}^{^{\prime}}.$$
(16)

\({\phi }_{0}\) represents the constant value when all inputs are toggled off. Inputs \({x}^{^{\prime}}\) and \(x\) are related through a mapping function, \(x={h}_{x}\left({x}^{^{\prime}}\right)\). Equation 16 admits a single unique solution (Eq. 17) that satisfies three axioms: local accuracy, consistency and missingness [85]. The first property implies that the sum of individual feature attributions must be equal to the original model prediction, i.e., the explanation model should match the original model. The second one states that changing a larger impact feature will not diminish that input’s attribution and the last one guarantees that no importance is accorded to the missing features in the original input.

$${\phi }_{i}\left(f,x\right)=\sum_{{z}^{^{\prime}}\subseteq {x}^{^{\prime}}}\frac{\left|{z}^{^{\prime}}\right|!\left(p-\left|{z}^{^{\prime}}\right|-1\right)!}{p!}\left[{f}_{x}\left({z}^{^{\prime}}\right)-{f}_{x}\left({z}^{^{\prime}}\backslash i\right)\right]$$
(17)

where \(\left|{z}^{^{\prime}}\right|\) is the number of nonzero entries in \({z}^{^{\prime}}\), \({\phi }_{i}\) is the Shapely values and \(f\left({h}_{x}\left({z}^{^{\prime}}\right)\right)\approx g\left({z}^{^{\prime}}\right)\) is ensured by local methods when \({x}^{^{\prime}}\approx {z}^{^{\prime}}\).

In [85], Lundberg and Lee suggested SHAP values, which is a standardized measure of feature importance, as a solution to Eq. 17, where \({f}_{x}\left({z}^{^{\prime}}\right)=f\left({h}_{x}\left({z}^{^{\prime}}\right)\right)=E\left[f\left(z\right)|{z}_{S}\right]\) and \(S\) is the set of nonzero indices in \({z}^{^{\prime}}\).

TreeSHAP is a variant of SHAP proposed in [86] and dedicated to tree-based machine learning models. And since it is the case of ERT, we applied it in our study. SHAP values are estimated thanks to the Shap Python library and represented graphically by means of four plots: variable importance, summary, dependency and individual force.

In variable importance plot, features are vertically sorted by their average impact on model output, allowing for a global interpretability and clarifying the whole structure of the model. Based on Figs. 16, 17, 18 and 19, the most important feature for predicting parking availability with an average impact greater than 20% is Garage code followed by Hour in second position and Month in third rank, and this is for the four sets of independent variables. Year is the least powerful indicator with a mean absolute SHAP value of almost zero. Day and Day of week have more or less the same importance, varying the absolute parking occupancy rate on average by 2.5% and slightly exceeding Weekend variable which improves predictions by about 2.2%. Except for Day feature, whose average impact differs from ERT model with X0 to ERT with X1, the other variables retain the same importance.

Fig. 16
figure 16

Variable importance plot of ERT with X0

Fig. 17
figure 17

Variable importance plot of ERT with X1

Fig. 18
figure 18

Variable importance plot of ERT with X2

Fig. 19
figure 19

Variable importance plot of ERT with X3

A summary plot combines feature importance with feature effects. It orders independent variables based on their importance to propel the model's predictive capability [82]. Every dot represents the SHAP value of an input variable and an instance [87]. The color maps the value of the features, from low to high. ERT with X0 and ERT with X1 models ignored Year input, i.e., it has no effect on their outputs. Figures 20 and 21 reveal that the closer is the year-end, the higher is the SHAP values of these months, and the greater is their positive impact on the predicted parking occupancy rate. Parking lot identifier has no clear behavior regarding its values, high positive SHAP values correspond to parking lots with high identifiers and low negative ones correspond to parking lots with low identifiers but medium values refer to both parking lots with high and low identifiers. The early hours of the day negatively impact the model's output and cause low predictions contrary to the last hours which lead to high forecasts. The last-days of the week are typified by a decrease in the parking occupancy rate, the beginning and the middle of the week are characterized by an increase in this rate (Fig. 22). In other words, parking lots are less busy on weekends than on weekdays as predicted by the model (Fig. 23). The influence of Day variable (days of the month) is not evident enough, the only conclusion to be drawn is that low values (its beginning) are associated with negative impacts on parking availability forecasts.

Fig. 20
figure 20

SHAP summary plot of ERT with X0

Fig. 21
figure 21

SHAP summary plot of ERT with X1

Fig. 22
figure 22

SHAP summary plot of ERT with X2

Fig. 23
figure 23

SHAP summary plot of ERT with X3

A dependence plot is a scatter plot that shows the effect of a single feature on the predictions made by a machine learning model. It traces a feature’s value relative to its SHAP value across many samples. Each point is a single prediction in the data set. The color corresponds to a second variable that can have an interaction effect with the feature that has been plotted [88]. The following figures display the marginal effect of Garage code feature on the expected outcome of ERT models as well as its interaction with Month, Day of week and Weekend variables. The first attribute is picked since it is the most important in line with the above and we needed to further highlight the spread and variation of its SHAP values. The second features are selected automatically as they interact the most with the chosen variable. The impact of Garage code is highly nonlinear, non-monotonic and very complex for all four curves, demonstrating that examining a single parameter, whether it is positive or negative, is often not conclusive. Its Shap values are dispersed in a significant way and vary from one parking lot to another, i.e., it strongly influences predictions performed for some instances and not for others. In ERT models with X0 and with X1, its impact on the predicted output is positive for car park 5, both positive and negative for parking lots 1 and 6, and negative for the rest. When Garage code is equal to 6, SHAP values for May, June and July range from -20% to 20% and increase up to 55% for August, September, October and November (Figs. 24 and 25). Despite some noise, SHAP values of the parking lot identified by 2 are less than −20% for the last five months and under 0% for the first two ones. Disregarding graphics coloring, the variation amplitude of the SHAP values is clearly lower in ERT models with X2 and X3 than with X0 and X1. Furthermore, their nature has not changed unless for parking lot number 6 where it has become purely positive (Figs. 26 and 27). Globally, there is no explicit relationship between the SHAP values of Garage code and the values of the covariates but their interaction is always important.

Fig. 24
figure 24

SHAP dependence plot of ERT with X0

Fig. 25
figure 25

SHAP dependence plot of ERT with X1

Fig. 26
figure 26

SHAP dependence plot of ERT with X2

Fig. 27
figure 27

SHAP dependence plot of ERT with X3

A force plot illustrates how the features in the model contribute to pushing its output from the base value (mean prediction) to the actual outcome, thus providing a local interpretability [88]. Features marked in red force the prediction upward, while those in blue force the prediction downward. The graphs above explain the prediction for the first example of the testing dataset. Its base value is equal to 1%, whereas the prediction is equal to 0.65% for ERT models with X0 and X1, since their performance is practically identical, 0.98% for ERT with X1 and 0.99% for ERT with X0. The observed value for that instance is equal to 0.63%. Garage code and Hour are the features with the biggest impact as the visual size indicates the magnitude of the feature's effect. They resulted in a decrease in predictions and this is also the case for Month and Day variables. Unlike general rule established from Figs. 16 and 17, Day attribute's impact reduced when calculating the ERT model's prediction based on X0 than on X1. Day of week weakly affects ERT models with X0 and X2, it drives predictions toward higher values for the first model and lower values for the second one. The positive contribution of Weekend binary feature is very small to zero, which means that the predictions made by ERT model based on X3 are not influenced at all by this variable' values (Figs. 28, 29, 30 and 31).

Fig. 28
figure 28

SHAP force plot of ERT with X0

Fig. 29
figure 29

SHAP force plot of ERT with X1

Fig. 30
figure 30

SHAP force plot of ERT with X2

Fig. 31
figure 31

SHAP force plot of ERT with X3

Above force plots are for individual observations only, if all observations plots are combined and sorted by output value, an interactive plot is derived, called global force plot. This graph shows SHAP value make-up for the entire dataset. The X-axis of this graph represents all observations in the dataset sorted by similarity, by output value, by original sample ordering or by each feature value. The Y-axis is simply an individual force graph for that observation showing by default its output value. This chart provides a quick summary of record position, output value and relevant parameters that push the prediction up (in red) or down (in blue).

Figures 32, 33, 34 and 35 represent a feature's responsibility for changing model output while sorting the samples by output value. At a high level, the diagram indicates that the model placed a lot of emphasis on Garage code, which is consistent with each parking lot being characterized by its location-dependent attendance level, accounting for wide variation in car park occupancy rates. Hovering over the area highlighted in red, it is apparent that most observations pushing predicted values upward are for records with 5 and 6 as Garage code values. By contrast, data vectors with values below 8 for Hour tend to generally decrease predicted values, which is expected because parking lots are not very busy in night time unless they are residential parking lots. Other variables' values do not follow a given trend, thus not understanding their influence on the models' outputs. Let A, B, C and D be four randomly selected records from our database, with:

  • \(\text{A = }{(\mathrm{Garage code }= 7,\mathrm{ Month }= 9,\mathrm{ Day}=5)}^{\mathrm{t}}\)

  • \(\text{B = }{(\mathrm{Garage code }= 1,\mathrm{ Month }= 7,\mathrm{ Day}=27,\mathrm{ Hour}=14)}^{\mathrm{t}}\)

  • \(\text{C = }{(\mathrm{Garage code }= 1,\mathrm{ Day}=29,\mathrm{ Hour}=15)}^{\mathrm{t}}\)

  • \(\text{D = }{(\mathrm{Garage code }= 5,\mathrm{ Day}=1,\mathrm{ Hour}=0)}^{\mathrm{t}}\)

Fig. 32
figure 32

Global force plot of ERT with X0

Fig. 33
figure 33

Global force plot of ERT with X1

Fig. 34
figure 34

Global force plot of ERT with X2

Fig. 35
figure 35

Global force plot of ERT with X3

In the first two vectors, the variable Day drives predictions upward, while in the last two ones it drives them downward, even though their values are almost successive.

Another interesting plotting function from SHAP library permits an easy and global interpretation of feature contribution in vertically stacked 2D matrices. These are associated with a bar chart on the right displaying the feature importance in descending order and a curve on top depicting the prediction output for all instances. Each 2D matrix is a hot to cold color scheme indicating positive (red pixels) or negative (blue pixels) contribution for each feature.

To note that for explanation purpose, only 1000 data points were captured in the heatmap plot as there is a homogeneity in the overall dataset (Figs. 36, 37, 38 and 39).

Fig. 36
figure 36

Heatmap of ERT with X0

Fig. 37
figure 37

Heatmap of ERT with X1

Fig. 38
figure 38

Heatmap of ERT with X2

Fig. 39
figure 39

Heatmap of ERT with X3

The four graphs below emphasize the strong contribution of Garage code and two other features (Hour and Month). Garage code variable improves prediction for more than 200 observations, resulting in a contribution greater than 20% with SHAP values of over 40. In second place comes the contribution (10%–18%) of Hour feature with SHAP values ranging between 20 and 30. A small contribution of Month attribute not exceeding 8% follows in third rank with a SHAP value just below 20. The rest of the features' contribution is not relevant as SHAP values are close to 0.

In addition, above the maps, model output for each instance is plotted. These curves reveal that output value variation is more or less identical for the four feature sets: parking occupancy rates are high for the first few instances, decrease afterward and return to increase slightly again from the 800th instance. The heatmaps also exhibit that high predictions (high values in \(f(x)\)) are coupled with high SHAP values of the variable Garage code. This again confirms the strong correlation between this feature and parking occupancy rate.

In short, Garage code has the greatest influence on parking availability prediction and strongly interacts with Month, Day of week and Weekend features. Its SHAP values do not evolve in line with a marked trend, this may be because it is a categorical variable and its values do not have a numerical meaning. Thus, this predominant feature is the most correlated variable with the parking occupancy rate, which may explain ERT's performance. Hour and Month notably influence predictions produced by the concerned models and their correlation coefficient is close to 1, improving further their precision.

8 Conclusion

In this paper, we suggested a model-based practical framework for predicting parking occupancy rate to help drivers quickly find where to park. We particularly interested in the way input variables influence the accuracy of predictions. We used, for this purpose, four feature combinations. We compared the ability of four different machine learning algorithms in modeling our target from these features. We got as results that ERT built on X0, the least computationally intensive algorithm compared to ANN and SVR, is the most efficient method. We then tried to understand why it performs best and analyze how each variable contributes to its outputs by applying SHAP values. We discovered, through visualizations, important nonlinear and non-monotonous relationships between independent variables that heavily affect predictions.

As a constraining factor, in order to maximize the exactness of predictions, it would be preferable to shorten prediction time to less than 10 min instead of one hour. Promising improvements may also involve supplementing historical data with real-time information and events calendar. Given that an event in the vicinity will inevitably increase parking demand compared to usual. Having a general idea about parking availability is necessary but not sufficient to eliminate problems associated with searching for vacant spaces. It can be complemented by more precise information such as the probability of a place be left vacant until the driver arrival and the coordinates or identifiers of available spaces.