1 Introduction

A new notion of accessing database and software application referred to as Internet of Things (IoT) has emerged. It is the concept in which information is gathered from sensors and software platforms to store on cloud servers. The information on the cloud servers can be accessed by human and automatic systems through software tools. The concept further gives the differences between smart cities (SC), information and communication technology (ICT) and smart grid (SG). The SC renew both the horizontal and vertical service of the systems; this is also referred to as the “system of the systems”. While the ICT system is a set of point service for management and controlling applications, the SG is a system that makes all other systems to work in the SC [1].

The IoT applications in SC include smart home that enhanced personal home lifestyle, making it more convenient, easy to monitor and operate home appliances and systems. Industrial automation with minimal human involvement, the machines operations, functionalities, and productivity rates are automatically controlled and monitored. Smart healthcare with the applications of IoT is improved by embedding sensors and actuators in patients and their medicine for monitoring and tracking patients. The SG improves the energy consumption of houses and buildings which the SC makes it more convenient and easy for the residents to obtain information of interest [2]. Zhu et al. [3] identified the key elements in IoT as identification, sensing, communication, computation, services and semantics.

The IoT devices such as sensors, actuators, and smartphones in the SC generate data as a result of activities within the SC. The data generated from the SC are subjected to analytics to gain insight and discover new knowledge for improving the efficiency and effectiveness of the SC. As such, it can be used for further development of the SC. The data analytics can be considered as the beginning of the SC. The SC data can be collected directly from variety of sensors, smart phones, citizens and integrated with city data repositories to perform analytical reasoning and to generate needed information for decision making and for better urban governance [4]. The data generated from the SC are from different domains within the SC such as transportation, energy, weather, and air pollution.

To analyze the SC data for discovering new insight, different researchers from the machine learning community applied artificial neural networks (ANN). It was found in the literature that both shallow ANN and deep learning where been applied for data analytics in SC. However, recently, deep learning has gained a serious attention from the research community because of its ability to handle large volume of dataset, less human intervention in data engineering, learn complex structure in large-scale datasets and performs better than the shallow ANN. The deep learning gains prominence as a result of its unique features in dealing with large amount of data without requiring different algorithms for dimensionality reduction, effective work on natural data, and absence of tedious human effort required in data engineering when dealing with conventional techniques in deep learning. The penetration of deep learning in SC has made remarkable achievements in different domains of the SC as listed in the preceding paragraph.

However, despite the popularity and achievements made by deep learning in SC, no survey has been dedicated mainly on deep learning for ANN in SC to show recent progress and direction for future development of the research area. In this paper, we propose to present recent development, challenges and future research direction on the applications of deep learning for ANN in SC. The shallow ANN will be considered in the paper to show its relevance in the SC but emphasis is on the deep learning.

The contributions of the paper can be summarized as follows:

  • The literature review presents a comprehensive work on deep learning solutions in SC extracted from various projects trending in the literature.

  • The survey has organized the different deep learning solutions in SC projects into different architecture of the deep learning for easy identification of the type of architectures already applied for SC project. The projects are organized based on the following deep learning architectures: Convolutional Neural Network; Deep Belief Network; Deep Reinforcement Learning; Recurrent Neural Network.

  • The literature review created a new taxonomy based on the deep learning solutions in SC depicting the main deep learning architectures and their variants.

  • A taxonomy on the domain of the SC that the deep learning algorithms are found to be applied is created for researchers to see the pictorial view of the SC domains already have deep learning solutions.

  • We depicted publications trend on the research conducted on the deep learning solutions in SC. The trend clearly indicated that the research area is attracting unprecedented attention from both the industry and the academia. This is because the number of publications in the last 3 years, 2017, 2018 and 2019 has increased as the number of publications within these 3 years exceeded the publications that pre-date 2017.

  • The sources of the datasets, deep learning frameworks and hardware used in different projects were extracted and presented in a tabular form to make the literature review a self-contained for new researchers intending to start work in this research area.

  • A discussion on the challenges militating against the future development of the research area is outlined in the literature review. The major challenge identified remains the known open problem of lack of systematic procedure for estimating the proper parameters of the deep learning algorithms. However, efforts made by researchers in automatic optimization of the deep learning parameters through meta-heuristic nature inspired algorithms are not applied in this domain to investigate its efficiency and effectiveness in SC. Secondly, the data generated, acquired from internet of vehicles within a SC and analyze based on deep learning architecture is scarce.

  • Directions for future research opportunities for solving the identified challenges are outlined for easy identification by researchers in proposing a novel and robust deep learning solutions in SC.

Other sections of the manuscript are as follows: The methodology of the literature review is presented in Sect. 2. Section 3 presents the background of deep learning for ANN and the taxonomy. Section 4 presents the concept of the SC including its layers and architecture. Section 5 highlights the applications of the deep learning for ANN in SC. Section 6 presents the deep learning platforms. Section 7 highlights dataset from different projects. Section 8 presents general discussion, domains of application and trend of publications. Section 9 discusses the challenges and future research work while the paper is concluded in Sect. 10.

2 Methodology

The methodology explains how the literature review was conducted. The main stages involve in the methodology are as follows: formulation of keywords used, article identification from the academic databases, screening, eligibility, inclusion and exclusion criteria. The keywords used for the search were formulated and refined before using those keywords to search the databases. Example of the keywords includes *smart city, deep reinforcement learning*, *passenger, convolutional neural network*, *air pollution, deep belief network*, *parking, convolutional neural network*, *traffic, neural network*, *vehicle detection, convolutional neural network *, *parking, deep recurrent neural network*, etc. The keywords were used to search the following relevant academic databases ScienceDirect, ACM Digital Library, Springerlink, IEEE Explore, Scopus, DBLP, and Web of Science. Articles were retrieved from the databases, we excluded non-English-language articles, non-peer-reviewed articles such as short communication, editorial, keynotes, book, descriptions and technical reports. The articles considered for inclusion were peer-reviewed from reputable journals, conferences or edited book written in English and report empirical works on deep learning solutions to solve problem in SC. The articles were screened based on titles, abstract, duplicate, conclusion and some of the articles by comprehensively going through the full content of the article. The full texts of the articles were assessed for eligibility to be included for the quantitative analysis.

3 Deep learning background

This section presents background information about deep learning including taxonomy, motivation, benefits, different architectures of deep learning algorithm, discussion on hyperparameters and lastly the benefits of deep learning and its advantage over traditional approaches in the context of SC are discussed.

3.1 Taxonomy, motivation and benefits of deep learning

Figure 1 presents a comprehensive taxonomy of the deep learning architecture and the shallow ANN. The taxonomy is created based on the deep learning and shallow ANN found to be applied for data analytics in the context of SC. The basics of the deep learning for ANN basic concept on how they operate to achieve their goals are provided in the section for interested readers to understand the algorithms operates. The deep learning architecture and other shallow ANN discussed in this section are found to be applied for data analytics in SC environment. The differences between the shallow ANN and the deep learning including advantages and limitations are presented in the section.

Fig. 1
figure 1

Taxonomy of the deep learning for ANN in smart cities

The shallow machine learning algorithms have been adjudged to work successfully in solving variety of problems. The shallow ANN architectures prove to perform good in many of the common machine learning problems. The shallow ANN are still active in the majority of the machine learning applications today [5]. However, the shallow machine learning algorithms have not recorded significant success in solving core artificial intelligence problems such as the speech and object recognition [6]. The shallow ANN process raw data in a limited way because certain level of domain expertise is required when constructing pattern recognition system or machine learning system. In addition, the shallow ANN require hand engineering for feature extraction in transforming the raw data to feature vectors before feeding the data into the shallow ANN algorithm for solving classification or pattern recognition problems. On the other hand, deep learning architecture does not require extra feature extraction step before feeding the data to the deep learning architecture, thereby eliminating the extra feature extraction stage required by the shallow ANN. The features are automatically extracted by the deep learning algorithms, and it is done in much better way than the hand-coding feature extraction [5].

The shallow machine learning algorithms relied heavily on the prior beliefs such as the smoothness prior to generalize well. The shallow algorithm typically fails in mitigating statistical challenges that is involved in solving artificial intelligence task. On the other hand, the deep learning introduces explicit and implicit priors to improve the generalization ability when dealing with complex artificial intelligence task. Curse of dimensionality: In terms of dataset, the shallow ANN works well on small dataset, as the amount of the data increases the performance of the shallow ANN decreases. However, the deep learning algorithms perform very well when the dataset is large, as the amount of the data increases the performance of the deep learning algorithm increases [6]. The deep learning algorithms do not perform well on small amount of datasets. Table 1 presents the summary of the strength and weaknesses of the shallow ANN and the deep learning.

Table 1 The summary of the strength and limitations of the deep learning and shallow ANN

It is well known in the research community that the parameters of the ANN remain an open issue as it lacks systematic procedure to get the optimum parameters. Typically, the ANN parameters are optimised at the training phase. The performance of the ANN heavily depends on the parameter settings, architecture of the ANN, choice of training algorithm and training features. However, several ways are available to get the optimum estimated parameters of the ANN. First, preliminary experimentation: in this case, preliminary experiment is conducted with different ANN architecture, training algorithm, parameter setting and training features to find out the optimum ANN. After the training, the performance of the ANNs models is validated by comparing the error function of the ANNs architecture and the architecture with the lowest error function is selected showing that the selected ANN has the best estimated parameters. Subsequently, the ANN with the best estimated ANN parameters is deployed in the large-scale experiment. Secondly, application of nature inspired meta-heuristic algorithms, such as the genetic algorithm, differential evolution, artificial bee colony, and particle swarm optimization, can be applied to train the ANN in search of the best estimated ANN parameters, such as weights, number of hidden neurons, and bias. Such type of algorithms starts by generating population of ANN as the initial solutions; the ANNs models generated compete among themselves. The ANN with the optimum objective function survive to the end and are selected as the ANN with the best estimated parameters.

3.2 Convolutional neural network

The convolutional neural network (ConvNet) as shown in Fig. 2 involves a mathematical operation named ‘convolution’. The convolution refers to operation that consists of many applications of convolution in parallel. It is a special type of linear operation that uses convolution instead of general matrix multiplication in its layers. The ConvNet describes the first variable as the input, and the second variable as the kernel and the output is termed as the feature map.

Fig. 2
figure 2

A typical convolutional neural network

The input is the multidimensional array of data and the kernel as the multidimensional array of parameters. These two multidimensional arrays are referred to as tensors; this is because both the elements of input and kernel must be kept differently. The key motivation of ConvNet has three essential elements: sparse interactions, parameter sharing and equivariant representation. The sparse interaction makes the kernel smaller than the input, parameter sharing utilized in a model. The equivariant representations mean if an input changes, an output will also change in the same way [6].

The ConvNet components have two layers, the left and the right. The left layer is seen as the small number of complex layers in which each layer is having several stages. These layers can also be considered as the input layer, output layer and several hidden layers in which the hidden layers are further categorized into convolutional, pooling, fully connected and normalization. A pooling layer contains three steps: The first step executes many convolutions one after the other to give out a set of linear activations. In the second stage, each linear activation is executed via a nonlinear activation function (also refer to detector layer). The third step uses the pooling function to change the output of the next layer. The pooling improves the computational and statistical efficiency of the network and reduces the amount of storage to keep the parameters because of its ability to use the fewer pooling units than the detector units. The ConvNet has “receptive unit” for the insertion space in a neuron. Regularization changes the algorithm to reduce only the generalization error without the training error to solve the problem of underfitting and overfitting [6]. Each of the neurons in the feature maps of the ConvNet is connected to the receptive unit in the preceding layer. The subsequent new feature map is produce by the convolution of the input vectors with the learning kernel. The convolved result obtained is mixed with the activation function—nonlinear. The kernel is shared by all the spatial location of the input vectors to generate each of the feature maps. Different kernels are applied to get the complete feature maps. The feature value of each location in the feature map is computed as follows:

$$ z_{i,j,k}^{l} = w_{k}^{l\,T} x_{i,j}^{l} + b_{k}^{l} $$

where \( (i,j) \) is the feature value at location \( i \) and \( j \) in the kth feature map of the ith layer. The feature map value is \( z_{i,j,k}^{l} \). The \( w_{k}^{l} \) is the weight vector and \( b_{k}^{l} \) represent the bias term of the filter on the layer. The input patch center location is \( x_{i,j}^{l} \) at the locations \( i \) and \( j \). The \( w_{k}^{l} \) generate the feature map to be shared. The mechanism that shares the weights typically reduces the complexity of the ConvNet and fasten the training speed of the ConvNet. The nonlinearity of the ConvNet is introduced by the activation function to detect nonlinear features. The activation of the convolutional feature can be expressed as:

$$ a_{i,j,k}^{l} = a\left( {z_{i,j,k}^{l} } \right) $$
$$ a_{i,j,k}^{l} = a\left( {w_{k}^{l\,T} x_{i,j}^{l} + b_{k}^{l} } \right) $$

where \( a_{i,j,k}^{l} \) is the activation value of the convolution feature. The activation function is typically sigmoid, tanh and ReLu. The softmax is mostly used at the output layer to solve classification problem. The pooling achieves the shift variance and is always located between two convolution layers.

The pooling function for each of the feature map is computed as:

$$ y_{i,j,k}^{l} = {\text{pool}}\left( {a_{m,n.k}^{l} } \right),\,\forall \left( {m,n} \right) \in \Re_{ij} $$

The local neighborhood is \( \Re_{ij} \) around the location of \( (i,j) \). Several pooling exists but the commonly used pooling in the literature is max pooling and average pooling. The kernel in the first level of the convolution detects the edges and curves which are low features. The higher-level features are extracted by the stack of convolution layers and pooling layers.

The loss function of the ConvNet can be computed as follows:

$$ \tau = \frac{1}{N}\sum\limits_{{}}^{N} {t\left( {\theta ;y^{(n)} ,o^{(n)} } \right)} $$

where N is the input and output desired relationship \( \left\{ {\left( {x^{(n)} ,y^{(n)} } \right);n \in \left[ {1, \ldots ,N} \right]} \right\} \), \( x_{{}}^{(n)} \) represent the input sequence and the corresponding target label is \( y_{{}}^{(n)} \), \( \theta \) is the parameters of the network and lastly, \( o^{(n)} \) is the output of the ConvNet. The best of the \( \theta \) is obtained by the minimizing the lost function through training [16].

3.3 Recurrent neural network

The recurrent neural network (RNN) is the most general and efficient network that presents sequence of learning method [17]. The RNNs are dynamical systems with temporal state structures. They are relatively strong and play a vital role in various temporal processing scenarios and digital applications. The popular RNN model is the Hopfield model which stores information in a dynamical standard feature and the Cohen–Grossberg model used as an associative memory for keeping information and finding way out for optimization problems. The RNN is categorized into global and local. The global RNN agreed feed forward connection between each neuron, while local RNN strict feed forward connection which structured with dynamic neuron models. Two static time models exist for RNN such as the time-delayed and the simultaneous RNNs, the first one is trained to reduce prediction error and the second is learnt to produce general function approximation abilities [4]. The deep RNN comprises of multiple hidden layers as shown in Fig. 3 unlike the shallow RNN.

Fig. 3
figure 3

Deep recurrent neural network

The RNN is a dynamic system that has input and output; the dynamic system can be expressed as:

$$ h_{t} = f_{h} \left( {x_{t} ,h_{t - 1} } \right) $$
$$ y_{t} = f_{o} \left( {h_{t} } \right) $$

where xt is the input, yt is the output and ht represent the hidden state. The fh is the output function parameterized by θh and lastly fo is the state transition function parameterized by the parameter θo.

The data set N for the training \( D = \left\{ {\left( {X_{1}^{(n)} ,\,y_{1}^{(n)} } \right), \ldots ,\left( {X_{Tn}^{(n)} ,\,y_{Tn}^{(n)} } \right)} \right\}_{n = 1}^{N}. \)

Optimizing the RNN parameters can be performed by minimizing the objective function expressed as:

$$ J(\theta )\frac{1}{N}\sum\limits_{n = 1}^{N} {\sum\limits_{t = 1}^{Tn} {d\left( {y_{t}^{(n)} ,f_{0} \left( {h_{t}^{(n)} } \right)} \right)} } $$

where the predefined divergence between a and b is given as follows:

$$ h_{t}^{(n)} = \, f_{h} \left( {x_{t}^{(n)} ,h_{t - 1}^{(n)} } \right)\;{\text{and}}\;h_{o}^{(n)} = \, 0.\;d\left( {a,b} \right) $$

The construction of the RNN is accomplished by the following transition and output functions:

$$ h_{t} = \, f_{h} \left( {x_{t} ,h_{t - 1} } \right) = \phi_{h} \left( {W^{T} h_{t - 1} + \, U^{T} x_{t} } \right) $$
$$ y_{t} = f_{o} \left( {h_{t} ,x_{t} } \right) = \phi_{o} \left( {V^{T} h_{t} } \right) $$

where W is the transition, U is the input and V represents the output matrices. The \( \phi_{h} \) and \( \phi_{o} \) represent the nonlinear functions called the element-wise. The hidden layer function is typically sigmoid [18].

3.4 Deep belief network

A deep belief network (DBN) as shown in Fig. 4 can be built as a stack of Boltzmann machine (RBM) in which the hidden layer of the RBM in the stack is the input to the visible layer of the subsequent RBM in the stack. Training of these RBMs occurs one after the other beginning with the initial RBM in the stack. When the training is done for the initial RBM in the stack, features of the input are produced at the hidden layer. These features send on to the next RBM in the stack and the RBM is trained. This process continues until it reaches the last RBM in the stack. The training of each RBM is achieved with the aid of positive and negative phases; the DBN picks up the features of the original data in its layers of the network. A DBN is a directed acyclic graph comprises of stochastic variables. Features of the DBN include simplicity, biologically plausible form, and it can find solutions to various problems [19].

Fig. 4
figure 4

Structure of deep belief network

The RBM is applied for the training of the DBN. The RBM is a pre-training technique for training each layer of the DBN. The visible and hidden layers of the RBM carry only two values which is 0’s and 1’s. For example, if the visible unit is visible while the hidden unit is hidden, then the visible = [20] and the hidden = [20]. The computational energy in the visible and hidden nodes is expressed as follows:

$$ E\left( {vis,hid;\theta } \right) = - \sum\limits_{i = 1}^{D} {b_{i} vis_{i} - } \sum\limits_{j = 1}^{F} {a_{j} hid_{j} - \sum\limits_{i = 1}^{D} {\sum\limits_{i = 1}^{F} {w_{i} vis_{i} hid_{j} } } } $$

From the equation we can deduce that the \( w_{ij} \) is the weight that connected the visible node to the hidden node for the \( vis_{i} \) and \( hid_{j} \). The bias for the visible nodes and hidden nodes are represented by \( b_{i} \) and \( a_{j} \), respectively. The join distribution of the network is expressed as:

$$ p\left( {vis,hid;\theta } \right) = \frac{1}{Z(\theta )}\exp \left( { - E\left( {vis,hid;\theta } \right)} \right) $$
$$ Z(\theta ) = \sum\limits_{v} {\sum\limits_{h} {E\left( {vis,hid;\theta } \right)} } $$
$$ p\left( {vis,hid;\theta } \right) = \frac{1}{{\sum\nolimits_{v} {\sum\nolimits_{h} {E\left( {vis,hid;\theta } \right)} } }}\exp \left( { - E\left( {vis,hid;\theta } \right)} \right) $$

The normalized constant is \( Z(\theta ) \). The function of the energy can be used to get probability on the input vector. The energy is minimize by improving the probability and \( \theta \). The hidden and visible nodes conditional distribution can be expressed by the logistic function as follows:

$$ p\left( {hid_{j} = 1|vis} \right) = g\left( {\sum\limits_{i = 1}^{D} {W_{ij} vis_{i} + a_{j} } } \right) $$
$$ p\left( {vis_{j} = 1|hid} \right) = g\left( {\sum\limits_{i = 1}^{D} {W_{ij} hid_{i} + b_{j} } } \right) $$
$$ g(x) = \frac{1}{1 + \exp } $$

Immediately the hidden nodes have been chosen; the input vector is reconstructed by setting each of the \( vis_{i} \) to 1. Subsequently, the nodes in the hidden states are updated which represent the reconstruction of the network features. The network is trained using contrastive divergence techniques and the weight also changed [21].

3.5 Deep reinforcement learning

The reinforcement learning operates in environment to take action with the goal of maximizing rewards and enhance the efficiency of the learning algorithm [22]. Figure 5 is the block diagram of the reinforcement learning.

Fig. 5
figure 5

Reinforcement learning

In the reinforcement learning environment, agent interacts with the environment \( \varepsilon \) based on some number of discrete time stages. In each of the time stage t, state st is received by the agent. Subsequently, action at is selected from the available possible actions A based on the policy \( \pi \) of the A. The \( \pi \) is maps from the st to at. In the next stage, the agent received the subsequent st+1 and scalar reward rt. The agent continues the process to the point of termination state before the process restarts again. The total accumulated return from t with \( \gamma \in (0,\,1] \) as the discount factor has a return \( R_{t} = \sum\nolimits_{k = 0}^{\infty } {\gamma^{k} r\,t + k} \). The agent goal is to maximize the expected returns from each of the st.

The \( Q^{\pi } (s,\,a) = {E}[R_{t} |s_{t} = s,\,a] \) is the action value which is the expected return from the selected action in a state following a \( \pi \). The \( Q^{\pi } (s,\,a) = \max_{\pi } (s,\,a) \) is the value of the action that gives maximum action value for s and a which can be achieved by any of the \( \pi \). The value of s under \( \pi \) can be defined as \( V^{\pi } (s) = {E}[R_{r} |s_{t} = s] \) representing the expected return for following \( \pi \) from s.

In the value-based free reinforcement learning approach, function approximator such as ANN is used to represent the action value function.

The approximation action value function (Q) with the parameter \( \theta \) is given as \( Q(s\,a;\,\theta ) \), variety of reinforcement learning algorithms such as the Q-learning is used to update the \( \theta \), and the Q-learning target to directly approximate the optimal AVF is expressed as:

$$ Q^{*} \left( {s,a} \right) \approx Q\left( {s,a;\theta } \right) $$

In a single-step Q-learning, the \( \theta \) of the \( Q(s,\,a;\,\theta ) \) are learned by the iterative minimization of a loss functions expressed as: \( L_{i} (\theta_{i} ) = {E}(r + \gamma \mathop {\hbox{max} }\nolimits_{{a^{'} }} Q(s^{{\prime }} ,\,a^{{\prime }} ;\,\theta_{i - 1} ) - Q(s,\,a;\,\theta_{i} ))^{2}, \) after the s, s′ is encountered. This is a one-step Q-learning [23]. The combination of the principles of reinforcement learning and deep neural network creates a powerful algorithm referred to as the deep reinforcement learning (DeepRL) [24].

3.6 Artificial neural network

In an ANN, the processing elements are organized in layers. Neurons in one layer accept signals from neurons in the next layers and send signals to neurons in the next layer, no connection between neurons in the same layer. The total input to each node is the weighted sum of outputs of the nodes in the previous layer. Each node is activated based on the input to the node and the activation of the node [4]. The backpropagation (BP) [25] is an optimization technique commonly used for the training of ANN parameters for better performance of the ANN. The BP is a supervised learning technique that uses a gradient descent in an error variable. The error is initiated by looking at the similarity between output values to a target value [26]. The BP can be used to explain the training of multilayer perceptron with the help of gradient descent evaluating the derivatives of the error function with respect to the weights in the network using good computational technique [19]. The BP is the most popular gradient-descent based algorithm which is derived from the entropy criteria to minimize the training time, and decease the problem of getting local minima by reducing the density of local minima [27]. Figure 6 depicts the typical structure of an ANN.

Fig. 6
figure 6

Structure of artificial neural network

The BP is a gradient descent algorithm that minimizes the objective function that is the error function. Let’s take the following as the training data: \( (I_{1} ,T_{1} ), \ldots ,(I_{n} ,T_{n} ), \) in which the \( I_{s} , \) \( 1 \le s \le n \) where \( I_{s} \) is the input and the target is \( T_{s} ,\;1 \le s \le n, \) for neurons at the hidden layer, the objective function which is the least square error in the ANN weight is expressed as:

$$ E = \frac{1}{{nZ_{M} }}\sum\limits_{s = 1}^{n} {\left[ {T_{s} - O_{s}^{M} } \right]\left[ {T_{s} - O_{s}^{M} } \right]} $$

where \( Z_{M} \) and \( O_{s}^{M} \) represent number of neurons on the output layer and the output vector of the ANN with \( I_{s} \), respectively. Assuming that \( w \) is the vector that is formed by all the weights in the ANN and the gradient \( \nabla E(w(k)) \) of the \( E \) at \( w = w(k) \) where \( k = 1,2,3, \ldots ,n \) represents the number of iterations of the ANN weight.

The ANN with a momentum can be expressed as:

$$ \nabla w(k) = \alpha \left( { - \nabla E\left( {w(k)} \right)} \right) + \beta \nabla w\left( {k - 1} \right) $$

The learning rate and the momentum factor are represented by \( \alpha \) and \( \beta \), respectively. The \( \nabla w(k) = w\left( {k + 1} \right) - w(k) \).

The computation of the ANN with the input vector feed to the input layer is expressed as:

$$ o_{s,i}^{m} = f\left( {\left[ {w_{i}^{m} \left( {k + 1} \right)} \right]^{\tau } O_{s}^{m - 1} } \right) $$

where \( o_{s,i}^{m} \) is the ith output of the ANN layer m, \( 1 \le m \le M. \) The activation function is represented as \( f( \cdot ) \) and the sub-vector of \( w(k + 1) \) is \( w_{i}^{m} (k + 1) \) comprising all weights from the neurons of the m − 1 layer to \( o_{s,i}^{m} \) and the vector formed by all the output layer of the layer m − 1 is \( o_{s,i}^{m - 1} \) including the unity output that is used as the reference to the ANN bias of the subsequent layer in the ANN and is expressed as [28]:

$$ o_{s,i}^{m - 1} = \left\{ {\begin{array}{*{20}l} 1 \hfill & {\left. {O_{s,1}^{m - 1} \ldots O_{{s,z_{m} }}^{m - 1} } \right]^{\tau } } \hfill & {{\text{for}}\,m > 1,} \hfill \\ 1 \hfill & {\left. {I_{s}^{\tau } } \right]^{\tau } } \hfill & {{\text{for}}\,m\, = \,1.} \hfill \\ \end{array} } \right. $$

The deep ANN (D-ANN) architecture is unlike the ANN architecture in Fig. 6 with a single hidden layer. The deep neural network has multiple hidden layers, at least 3 or more than 2.

3.7 Support vector machine

The support vector machine (SVM) is shown in Fig. 7; it implements the structural risk minimization procedure which reduces the upper bound of the generalization error. The learning process for building the SVM is fast; its parameters can be handled in different case with the unique solution that reduces the empirical risk and is immune to local minima. The major concept of SVM is to convert the training set in a high-dimensional space into a lower-dimensional feature space by means of a set of nonlinear kernel functions, where the conversions of the training sets are always in the feature space [4]. The SVM is basically a two classifier which classifies the data from the class as positive and groups the other data from the class as negative which is called one-versus-the-rest approach [29].

Fig. 7
figure 7

A sample of support vector machine

The SVM is a decision machine which solves the problems of classification, clustering, regression using sinusoidal data set, and finds the solution for an unsupervised learning problem related to probability density estimation. The amount of root functions in the models is relatively smaller compared to the training points, as well as adding the size of the training set which yields probabilistic results [30]. The description of the SVM starts with the two linear separable classes. To illustrate the operation of the SVM, let us assume that we have a data set \( D = \{ (X_{i} ,y_{i} )\}_{i = 1}^{\ell } \) where \( y_{i} \in \) [31]. Suppose that we are to determine the number of the classifiers that separate the data set to determine the one that will produce the lowest generalization error. The hyperplane is a good choice intuitively; it can be able to leave a maximum margin between the classes. In the case the two classes are none separable, the hyperplane that maximizes and minimizes margin and misclassification error, respectively, can be considered. The positive constant C controls the balance between the margin and the misclassification error. The value of the positive constant C is chosen before the start of running the SVM. Therefore, it can be concluded that solution to this problem is a linear classifier expressed as:

$$ f(x) = sig\left( {\sum\limits_{\ell = 1}^{\ell } {\lambda_{i} y_{i} X^{T} X_{i} + b} } \right) $$

where \( \lambda_{i} \) is the coefficients which are the solutions to the quadratic programming problems:

$$ {\text{Minimize}}\;\;W(\Lambda ) = -\Lambda ^{T} 1 + \frac{1}{2}\Lambda ^{T} D\Lambda $$
$$ {\text{Subject}}\;{\text{to}}\quad \begin{array}{*{20}c} {\Lambda ^{T} y\, = \,0} \\ {\Lambda - C1\, \le 0} \\ { -\Lambda \le 0} \\ \end{array} $$

where \( (\Lambda )_{i} = \lambda_{i} ,(1) = 1 \) and \( D_{ij} = y_{i} y_{j} X_{i}^{T} X_{j} \).

Only a small fraction of the coefficients \( \lambda_{i} \) are different from 0. The solution to the problem is determined by the data point associated with the coefficients that are none 0 in view of the fact that every coefficient corresponds to a certain data point. The data points in questions are referred to as the support vectors. These support vectors are the only once relevant to the problem solutions while other data point can be discarded from the set of the data. The number of the support vectors is typically small in number and those support vectors are normally at the border between the classes. The kernel functions play role on the performance of the SVM and its choice is not straight forward. The famous among the kernel functions includes Gaussian RBF, polynomials, and multi-layer perceptron [32].

3.8 Estimation of deep learning architecture hyperparameters

Hyperparameters of the deep learning architecture, regularization and optimization are sacrosanct in determining the performance of the deep learning algorithms, meaning that the performance of the deep learning algorithm is sensitive to hyperparameters setting. The performance of deep learning hyperparameters varies across different datasets. In practice, hyperparameters optimization is a challenging task which makes it a difficult problem as follows [33]:

  1. 1.

    Large models such as deep learning make function evaluation to be extremely expensive.

  2. 2.

    It frequently has a complex configuration space.

  3. 3.

    With limited size of data, direct optimization for generalized performance cannot be achieved.

  4. 4.

    Convexity, smoothness, etc., of the target function used in the classical optimization are not usually applied.

Bergstra et al. [34] reported that in the domain of hyperparameters, a hyperparameter can take different data type. For example, learning rate is real number, number of layers is integer value, training stoppage criterion or the choice of optimizer is binary or categorical.

The systematic approach for determining the exact hyperparameters of the deep learning algorithms remains an open research problem. Therefore, a number of approaches for estimating the deep learning algorithms hyperparameters exist. Firs, manual method: many of the human experts still depend on manual procedure of determining model hyperparameters [35]. In manual approach, preliminary experiments are conducted with different hyperparameters settings of the deep learning algorithm using a small subset of the dataset to estimate the best hyperparameters setting. This is because increasing data size increases the complexity of the model which makes the optimization of the hyperparameters very difficult in large data size [33]. In manual method of setting hyperparameters values, the running of the model is terminated prematurely if it is expected to produce poor output so as to save time [35]. For example, ConvNet requires the following parameters: learning rate, momentum, activation function, pooling size, convolution, drop out, number of dense layers, etc., to be set for running the ConvNet to solve machine learning problems. So during the preliminary experiment, different settings of the hyperparameters will be tried on small subset of the dataset and the settings that produce the best result are adopted for use in the main experiment. The manual method is tedious, expensive and time-consuming. In some cases, the hyperparameters of the deep learning algorithms are estimated during conducting the main experiment without preliminary experiments but is highly challenging.

Secondly, the application of global optimization algorithms are cuckoo search algorithm, artificial bee colony, GA, firefly algorithms to automatically estimate the hyperparameters of the deep learning algorithms. In this approach, any of the suitable global optimization algorithm can be applied to automatically determine the best estimated hyperparameters of the deep learning algorithms. In this approach human intervention in estimating the hyperparameters settings is highly minimal. For example, Baldominos et al. [36] applied GA for automatic estimation of the complete architecture of the ConvNet: number of layers, connectivity, batch size, learning rate, learning rule, dropout, weight regularization, activation function, number of neurons on each layer, number of convolution layer, number of kernels on each convolution, pooling size, etc. Huang et al. [37] applied PSO to estimate the hyperparameters of DBN by optimizing the number of input neurons, number of hidden neurons and learning rate of the DBN. Liu et al.[38] used GA for the optimization of DBN hyperparameters. Similarly, Papa et al. [39] applied Harmony Search algorithm for the estimation of DBN hyperparameters of a Discriminative Restricted Boltzmann Machines.

Thirdly, deep learning hyperparameters can be adopted from a study that solves relatively similar problem or hyperparameters suggested in the literature. However, it is not necessary that the adopted estimated hyperparameters settings will produce the desired output in solving another similar problem. The default settings of the hyperparameters provided by the deep learning library can be adopted for training the deep learning model. However, the tuned hyperparameters are mostly found to be better than the default settings provided by the machine learning libraries [33].

Fourthly, Grid search or full factorial hyperparameter optimization, the modeler or user supplied pre-define sets of values for each of the algorithm hyperparameter before the grid search evaluate the Cartesian product of the defined set of the values supplied by the modeler or user. This approach has limitation arising from the dimensionality curse in view of the fact that the number of function required to be evaluated increases exponentially with the dimension. Fifth, random search, in this approach of hyperparameter optimization, the configuration of the hyperparameters are randomly search continuously until the budget of the search is exhausted. It works well in a situation where some hyperparameters are more important than others [33].

Sixth, Bayesian optimization is an optimization framework used for the optimization of deep learning hyperparameters. It gains attention in recent times in the tuning of deep learning architecture for image processing, speech recognition and natural language processing [33]. Type of Bayesian optimization includes: sequential model-based Bayesian optimization [40], surrogate models, configuration space and constrained Bayesian optimization [33]. Ilievski et al. [41] propose deterministic surrogate with radial basis function as the surrogate for deep learning hyperparameter optimization that requires the evaluation of few functions for the hyperparameters optimization.

3.9 Motivation and advantages of deep learning in smart city over traditional techniques

The motivation and benefits of the deep learning over traditional approaches are presented in Sect. 3.1. In the context of SC, deep learning has recently gained popularity in SC applications due to its ability to handle large-scale amount of datasets with high accuracy compared to the traditional techniques. The data generated from SC is complex in nature and very large because it comes from different devices which can be images, videos, text and speech. The data comes in different forms such as structured, semi-structured and unstructured. Therefore, it can suitably be analyze using the deep learning depending on the data analytic problem intended to solve in the SC. The deep learning can be applied to any of the following data analytics: classification, clustering and regression. Object recognition and speech recognition are task that can be accomplished in the SC using the deep learning architecture. As already discussed in Sect. 3.1, these categories of data can be processed better with deep learning than the traditional techniques. For example, Amato et al. [42] used ConvNet for the classification of images in parking space for smart cameras. It was found that the ConvNet outperforms the traditional technique. Similar study by Valipour et al. [43] shows that the ConvNet performs better than the traditional ANN. Mynhoff et al. [44] present comparative analysis between deep learning methods and traditional techniques: Hidden Markov, SVM and Factored Hidden Markov to predict electrical energy consumption in SC environment. The analysis of the day-ahead and week-ahead energy prediction demonstrates that different prediction methods present significantly different levels of accuracy, with the DBN offering the most consistent performance over the compared traditional techniques. Similarly, recurrent neural networks have shown promising performance in natural language processing and language modeling tasks compared to traditional machine learning algorithms [45]. Van der Veen et al. [46] used a 3D ConvNet validation to semi-automate delineation of OARs with respect to delineation accuracy. The result shows that the automated ConvNet OAR delineation in head and neck cancer was shown to be more efficient and consistent compared to manual delineation.

The deep learning has the capability to effectively recognize objects and speech within the SC environment. The data acquired from the SC means nothing if the data is not processed to uncover important new information in the acquired data been generated from different devices in the SC. That is where the deep learning plays a critical role in utilizing the data for solving classification, regression and clustering problems. Since the deep learning is a machine learning algorithm, it has to learn from the data set to infer from it. So, if the deep learning is to be used in the SC for any of the problems listed previously, the data has to be acquired in line with the intended problem to solve.

The domain of application in the SC also differs because the deep learning need data from the domain it intended to solve the problem. For example, the deep learning can perform passenger prediction only if it is trained with the data set from the transportation domain of the SC. Similarly, applying deep learning for solving problem in energy, rail system, air quality, traffic flow, sewage, accident, health, agriculture will definitely require different data set depending on the domain of application. The data analytics in the SC will help the decision makers to fully understand the effective and efficient ways for proper planning, sharing of resources, formulation of regulatory framework, taking of critical decision and monitoring in the SC.

4 The smart city initiative and architecture

The SC was referred to a digital city, but later, it was gradually replaced by SC [1]. A digital city refers to a connected community that gathers broadband communications service, a pliable, service-oriented computing service based on open industry standards. It also encompassed innovative services to comply with the needs of Governments’ employees, citizens and businesses [47]. Figure 8 shows the layers of the SC in which several technologies are connected to illustrate the relationship between various sensor devices, wireless sensor networks and the city infrastructure. The first layer introduced the interactions among various city infrastructure and sensors devices. This interaction demonstrates how sensors are embedded within the city. The second layer illustrates how sensor networks, internet, and cloud services are used for smart applications.

Fig. 8
figure 8

The basic architecture of the smart city

With the rapid development of science and technology, the world is suiting “smart”. Living in such a smart world [3], people will automatically work together with the smart devices (e.g., watches, mobile phones, etc.), computers, smart transportation (e.g. cars, buses, trains, smart environments e.g. homes, offices, factories etc.). The key notion of the SC is to find the right information at the exact place and on the valid device to make the city connected with simplicity [48]. A SC function as an ICT to lift its survivability, effort and sustainability [1]. The city takes care of the conditions of the basic facilities and services such as roads, bridges, subways, tunnels, communication, airports, seaways including building to improve its resources and security. Giffinger and Pichler-Milanović [49] identified four areas in realizing the SC: industry, education, participation, and technical infrastructure. [50] explored six dimensions of the SC and their related aspect of urban life: smart economy (industry), smart people (education), smart government (e-democracy), smart mobility (logistics and infrastructures), smart environment (efficiency and sustainability) and smart living (security and quality).

From the technological surface, Eremia et al. [1] defined SC as a city with a great achievement of ICT technologies. These had spread through the commercial application of intelligent acting products and services, artificial intelligence, and thinking machines. The examples of systems dressed up with number of mobile terminals and embedded devices including connected sensors and actuators are smart homes and smart buildings. The SC becomes smarter depending on the nature of the digital technology which is needed by different electronic services used by the various applications such as the street cameras for surveillance system, sensors for the transportation system. However, several initiatives utilize objects to provide different value added services, such as Google street view and global positioning system (GPA) [48].

The common characteristics of SC as reported in Nam and Pardo [51] include:

  • The utilization of networked infrastructure to improve economic and political efficiency and enable social, cultural, and urban development.

  • An underlying emphasis on business-led urban development.

  • A strong focus on the aim of achieving the social inclusion of various urban residents in public services.

  • A stress on the crucial role of high-tech and creative industries in long-run urban growth.

  • Profound attention to the role of social and relational capital in urban development.

  • Social and environmental sustainability as a major strategic component of the SC.

Zygiaris [52] identified six layers of SC as presented in Table 2.

Table 2 The six layers of Smart cities with their corresponding concept

Sullivan [50] identified the key parameters that will change the concept of SC in 2020: smart energy, smart building, smart mobility, smart technology, smart healthcare, smart infrastructure, smart governance and education, smart security and smart citizens. Bergh and Viaene [53] highlight the key challenge for the SC: Ecosystem position, SC leadership locus, internal coordination mechanisms, business IT alignment, shaping organizational culture and going beyond experiments. Therefore, cities should found their SC models on three main pillars: infrastructure, human capital and information [54].

At the layer 0 of the SC, operations and processes in the conventional city are tailored toward achieving the SC initiatives in form of responding to the city challenges. The challenges include embedding of the urban planning with smart operations, infrastructure intervention, identity of the smart priorities that is compatible with the city planning and innovations, etc. In layer 1, these challenges outlined in layer 0 are moderated into the green city priority and it requires innovation in green governance, integration of policies and financial resources allocation for developing the appropriate and sustainable green urban ecosystem. The challenge of the policy makers in layer 1 is the issue of CO2 emission reduction and alternative form of energy sources. The applications in any of the SC layer should have the capability for intercommunication, layer 4 to share data, services, content, etc. The SC monitoring processes have the ability to moderate, integrate and ensure the availability of digital resources in the SC. For instance, layer 4 provides the enabling environment for platforms to reside on fiber-optics network, the layer 2 that houses pipe lines of water embedded with sensors in the layer 3 allows the flow of real-time data of optimal water usage applications in layer 5. The effective and efficient integration of different SC applications for openness of resources is a challenged. The real-time devices in the SC are connected onto the fiber-optics through layer 2 interconnection, monitored through the layer 4 integration layer and subsequently, layer 5 nourishes the intelligent applications with real-time data. Lastly, layer 6 of the SC creates the enabling environment for new opportunities of doing business and makes the SC fertile for new innovations [52].

From the interaction of the SC layers from layer 0 to 6, it clearly pointed out that data can be generated, acquired and analyze. The deep learning algorithms can be applied for data analytics for the data acquired from the SC as a result of interaction of the various layers. Analyzing the data by applying deep learning can provide an insight and uncover new knowledge which can be used to enhance decision making which in turn improves the efficiency of the SC in general.

For example, result of applying deep learning for data analytics in layer 1 can be used for planning the reduction of CO2 emission because of the deep learning prediction of CO2 emission and energy consumption. As such, improve energy efficiency. Deep learning analysis of data from layer 2 has the potential to indicate where deployment of smart systems is needed or otherwise. In layer 4, the deep learning can be applied for analytics to provide information on where resources are concentrated which can help the policy makers to device ways for diversification.

The real-time data can be analyze with the deep learning for adaptation of the intelligent applications in layer 5. The deep learning data analytics result can help the decision makers in formulating policy and regulatory framework for digital inclusive in layer 0. Therefore, improve community involvement in the SC initiatives. The deep learning analytics output in layer 3 can be helpful in providing the decision makers with insight on usage of the communication technologies such as but not limited to sensors, RFID and activators. Therefore, aid decision makers to take decision on effective sharing of digital resources. The deep learning when applied for analytics on data from business operations in the SC, it can give understanding on new business opportunities and areas that require innovations in layer 6.

4.1 Data generation from the smart city

As a result of interconnected devices in the SC, a large amount of data has been generated from the devices as a result of activities. Mazhar Rathore et al. [55] pointed out that the interconnected thousands of the IoT devices while communicating with each other over the Internet to establish a smart system results in the generation of large amount of data. The results of the data analytics can be useful for many purposes and benefited the society, businessman and the government authorities. Kök et al. [56] with the increase in number of IoT based SC applications, the amount of data produced by these applications is increasing tremendously. Governments and city stakeholders take early precautions to process these data and predict future effects to ensure sustainable development. Smart meters have become one of the main components of the SC strategies. Smart meters generate large amounts of fine-grained data using pressure and released model that is used to provide useful information to consumers and utility companies for decision making [57]. The core of it is the collection, management, analysis and visualization of large amount of data that is generated every minute in an urban environment due to socioeconomic activities [4].

4.1.1 Big data in smart city

With the adoption of IoT in the urban cities, a huge amount of data is being generated by the smart city applications is rising, but companies and governments are utilizing only a part of the data generated. There is a need for possibilities to improve the efficiency of the city services. AI-based analytic applications are rapidly offering various numbers of opportunities to individuals and industrial organizations. Such applications can be classified into smart homes, smart transportation, smart healthcare, smart wasted management, etc. Such applications are the key requirement nowadays, and it can be considered as a primary requirement in big data generation. Kitchin [58] provided a review on the use of digital devices in SC to produce big data. The collected data through these devices enable real-time analytics, new modes of urban governance and enacting more efficient sustainability. Moreover, the author discussed critical reflection on the implications of big data and smart urbanism, examining five emerging concerns. Al Nuaimi et al. [20] presented the applications of big data to support SC. The authors compared various definitions of the SC and big data to identify some of the opportunities and challenges with respect to big data application for SC. Sood et al. [59] used big data and high-performance computing (HPC) to propose smart flood monitoring and forecasting architecture from the social collaborative IoT. The purpose of the study is to categorize the geographical areas into a web of hexagonal for effective installation of energy efficient IoT devices. The authors utilize a clustering algorithm like K-mean to predict the current state of flood and flood rating in any location. Studies by Krieg et al. [60] have shown that urban data gathered can be used to develop a parking system to reduce the traffic congestion in cities. In this regard, the authors developed and implemented SmartPark in San Francisco cities. The system depends on pervasive Wi-Fi and cellular infrastructure, which is capable of providing drivers with real-time parking availability information. Applying big data technologies in the SC allow the storage and processing of the data to provide new information. Subsequently, the information can be utilized to improve the distribution of services in the SC [61].

5 Deep learning applications in smart cities

This section presents deep learning solutions in SC from various projects. The section is classified based on deep learning architecture and shallow ANN. For the deep learning architecture, it includes: ConvNet, DBN, DRL, DRNN and Hybrid. From the perspective of the shallow ANN, the architecture involved SVM, BPNN and others. This section comprised of Tables 3, 4, 5, 6, 7, 8, 9, 10 and 11 that presents the summary of the projects discussed in each of the section. The tables have 4 columns each. The first column is the references of the project, second column is the algorithm that is propose in the project to solve problem in the SC. Third column presents the algorithms used to evaluate the effectiveness and efficiency of the algorithm propose in the project as it is the typical practice in machine learning. In machine learning, any algorithm can be proposed to solve a problem in a project; its performance must be compared with the classical algorithms to prove that the algorithm is suitable for solving the problem. The comparison will show the advantage or otherwise the proposed algorithm in solving that particular problem. This is necessary as the performance of machine learning algorithms is not uniform across different domain and problems. Many factors determine the suitability of algorithm in solving a problem such as nature of the data, size of the data, objective function, computing time, problem to tackle, dateline to deliver result and hardware availability. The last column is the contribution of the project showing the advantage of the propose algorithm.

Table 3 The summary of the ConvNet application in smart city
Table 4 The summary of the DBN application in smart city
Table 5 The summary of the application of RNN in smart city
Table 6 The summary of the ANN application in smart city
Table 7 The summary of the SVM applications in smart city
Table 8 The summary of the hybrid neural network as applied in smart city
Table 9 The summary of the EANN application in smart city
Table 10 The summary of the DRL application in smart city
Table 11 The summary of the applications of other deep and shallow neural networks in smart city

5.1 Convolutional neural network

The ConvNet has been found to be effective in the SC, for example, Valipour et al. [43] propose deep ConvNet (D-ConvNet) to detect vacant and occupied parking slots in smart parking stall. The D-ConvNet is used in classifying vacant and occupied parking slots. The result shows that D-ConvNet is more accurate and robust than the ANN. Amato et al. [42] propose D-ConvNet to classify images of parking space occupancy for smart cameras. The study exploited large number of ground-truth label data to discover which features to discriminate for each class of the objects to be recognized, and builds a combined feature extraction and classification model based on the D-ConvNet. The D-ConvNet is applied in classifying images of the parking space occupancy. It was found that the D-ConvNet outperforms the SVM.

Amato et al. [62] propose ConvNet to detect car parking occupancy. The ConvNet is used in classifying the car parking occupancy. The ConvNet has good generalization capabilities than the SVM as shown by the experimental results. Giyenko et al. [63] propose ConvNet to predict atmospheric visibility from weather monitoring system of a SC. The ConvNet has three convolutional layers and max pooling layer that act as the convergence of the network. The ConvNet is used to predict the atmospheric visibility. The result shows that the ConvNet has good accuracy for predicting the atmospheric visibility. Rajput et al. [64] propose a deep ANN (DNN) with convolutional layer (DNN-C) to model the human driver capabilities for autonomous taxi in SC. The DNN-C is responsible for navigating the vehicles autonomously and determining whether the road is straight or turning or intersection. The DNN-C classifies the road for determining the movement of the autonomous vehicle. The result of the proposed model shows that it is more robust and efficient than the canny edge detector. Liu et al. [65] propose ConvNet to recognize plate number and characters without segmentation. The ConvNet derives effective representations of the original image, which makes it to identify the visual rules directly from the original pixels. The ConvNet is applied in training number plate samples and recognize characters without segmentation. The result shows that ConvNet has better accuracy than the ConvNet with dropout layer.

Kuang et al. [66] propose ConvNet to detect pedestrian in images. The ConvNet is responsible for analyzing the different region of the images to extract feature information. The ConvNet is used in the detecting of the pedestrians. The result shows that ConvNet provides superior detection rate than the recurrent ConvNet (R-ConvNet). Jiang et al. [67] propose D-ConvNet to detect vehicle in satellite images. The D-ConvNet is applied in classifying vehicle in satellite images. The result shows that D-ConvNet outperforms the superpixel segmentation algorithm. Li et al. [68] propose ConvNet to predict road region. The ConvNet captures big hold of local structures of road. The ConvNet is utilized in predicting the road region. It has been found that the ConvNet has good detection rates than the multiscale structural features and SVM. Aqib et al. [73] propose D-ConvNet to predict urban traffic behavior in a SC. The data was collected on traffic and extracts the flow, speed, occupancy and density. The D-ConvNet is applied to predict traffic flow. Result indicated that D-ConvNet outperforms the disaster management methods. Huh and Seo [74] propose ConvNet to classify shoe, provide automatic storage and recommend shoe in a shoe cabinets for a smart home. The ConvNet was trained by shoe image data. The ConvNet was found to be a good algorithm for the task.

Mauro et al. [123] propose D-ConvNet to detect the availability of parking stall occupancy. The D-ConvNet extracts space and checks whether the parking space is available or empty. It was found that the D-ConvNet has higher accuracy than the VGG16 (VGG16 is ConvNet with 16 convolution layers that run on ImageNet to win image processing competition in 2014) and AlexNet. Lwowski et al. [69] propose D-ConvNet to detect pedestrian in smart communities. The D-ConvNet is used to notify the driver if a pedestrian is on the road or not. The result shows that D-ConvNet outperforms fast feature pyramids algorithm. Guo et al. [70] propose ConvNet to detect traffic signals and light in an area. The ConvNet uses small filter kernel to increase the rate of computation, decrease the feature maps’ thickness and number of parameters. The ConvNet is applied in classifying traffic signals and light in an area. The result shows that ConvNet outperforms the SVM. Liu et al. [71] propose ConvNet to detect pedestrian heads and compute the number of the pedestrian heads. The ConvNet performs feature extraction, selection and applies SVM to count the number of people in the pedestrian crowd. The result shows that ConvNet outperforms the traditional human hand crafting method.


Yeshwanth et al. [72] propose D-ConvNet to detect traffic network to decrease congestion and increase the total travel experience of passengers. The D-ConvNet classifies the vehicles into best resolution region and estimates the traffic density. The D-ConvNet has better accuracy than the background subtraction (BS) method. Chen et al. [96] propose D-ConvNet to detect vehicles from satellite images. The division maps of the last convolutional layer and the max pooling layer of D-ConvNet into multiple blocks of variable receptive field sizes to enable the proposed method to extract variable scale features resulted to D-ConvNet. The D-ConvNet is used to detect vehicles from satellite images. The D-ConvNet significantly outperforms the traditional deep neural network (DNN) on vehicle detection. Zhang et al. [10, 11] propose D-ConvNet to detect vehicle type in traffic videos. The D-ConvNet extracts features and separates the various types of vehicle into different classes. The result shows than the D-ConvNet has better accuracy than the sparse coding method. The summary of the application of ConvNet in analyzing data extracted from SC is presented in Table 3.


Use case


Recent years have perceived an increasing trend in image processing due to convolutional neural network, which has become a significant approach for handling large amount of images produced by the smart cities. An example of use cases is presented in Lwowski et al. [69], which introduced a pedestrian detection system for smart communities using deep convolutional natural network. The idea is to overcome the challenges of autonomous driving, search and rescue, surveillance and robotics by recognizing the pedestrian. The study is based on fast regional detection cascaded with deep convolution networks which enable real-time pedestrian detection. The result shows that using convolutional neural network the system is able to accurately detect the pedestrian by 95.7% with a processing rate of about 15 frames per second on a low performance system without a graphical processing unit (GPU). Moreover, convolutional neural network has been constantly used in medical analysis as report in Anwar et al. [124]. The study concluded that convolutional neural network based on deep learning is being utilized in multiple fields of medical image analysis including segmentation, classification and detection.

5.2 Deep belief networks

The DBN is among the deep learning architecture found to be applied in the SC for data analytics. The DBN is not heavily used in the SC compared to the ConvNet. However, the DBN has been successfully applied in SC and found to be effective and efficient, for example, Huang et al. [37] propose DBN to predict traffic flow. The DBN is for unsupervised feature learning which has the learning unit activations act as the data. The DBN is used to predict the traffic flow. It was found that the DBN improved the accuracy of traffic flow prediction over the support vector regression (SVR).


Han and Sohn [75] propose DBN to cluster zones of Seoul metropolitan area in travel patterns transit passengers. The DBN separates an input space into a predefined number of clusters, in such a way that distances within clusters are minimized while distances between clusters are maximized in a feature space. The DBN is applied in clustering zones of Seoul metropolitan area in travel patterns of transit passengers. The result shows that DBN has more realistic information than the k-means algorithm.


Use case


A great success has been observed in many studies utilized deep belief networks approach for applications such as speech recognition and image classification (Dedinec et al. [125]). For instance, Dedinec et al. [125] have applied deep belief network to short-term electricity load forecasting based on the Macedonian hourly electricity consumption data in the period 2008–2014. The study shows that deep belief network provides a better result compared to traditional methods used in forecasting electricity. In addition, in monitoring flood in smart city the deep belief network has been used. Mishra et al. [126] used IoT enabled camera and deep learning to monitor gullies and drainages based on image classification. The purpose of the study is to classify blockage images into different class labels based on the severity. The result obtained shows an improvement in the classification accuracy.

5.3 Deep recurrent neural network

The deep structure of the RNN has been used in the smart city for analyzing data to improve the concept of the SC. For instance, Song et al. [76] propose deep long short-term memory (D-LSTM) to predict human mobility and transportation mode. The D-LSTM has multiple hidden layers. The D-LSTM is utilized in predicting human mobility in transportation system. The result shows that D-LSTM is more accurate than the shallow LSTM. Qolomany et al. [79] propose D-LSTM to predict the number of occupants at a given location and time in a smart building. The D-LSTM is trained using Wi-Fi time-series data. The D-LSTM is used in predicting the number of occupants at a given time and location. Results of the experiment indicated that the D-LSTM outperforms the auto regression integrating moving average (ARIMA). Other forms of the RNN such as LSTM are found to be applied in the SC, for example, Kök et al. [56] propose LSTM networks to predict quality of air in a SC. The LSTM is a special type of RNN for supervised temporal sequence learning. The proposed algorithm is applied in predicting future values of air quality in the SC. The result shows that the LSTM-based model has a better performance than the SVR.

Pan [78] proposes LSTM to predict short-term traffic flow in a SC. The LSTM is for memorizing long historical input data and automatically determines the optimal time lags desirable in short-term traffic flow prediction. The LSTM is used in predicting the short-term traffic flow. The result shows that the LSTM achieves higher accuracy than the random walk (RW), SVM, single layer feed forward NN (FFNN) and stacked autoencoder. Fu et al. [80] propose LSTM gated RNN (LSTM-G-RNN) to predict short-term traffic flow. The LSTM-G-RNN is applied in predicting short-term traffic flow. The LSTM-G-RNN is responsible for language models and has the ability to memorize long-term dependencies. The result shows that the LSTM-G-RNN performs better than the ARIMA. Salvador et al. [82] propose D-RNN to automatically map a GPS trace onto a transport network. The samples of a space of all possible journeys within the network were acquired to get large amount data to train D-RNN. The D-RNN is applied for the mapping as classification problem. The performance of D-RNN is compared with heuristic algorithm (HA). The result shows that the D-RNN performs better than the HA. Camero et al. [83] propose hybrid of RNN and deep neuroevolutionary (RNN-DNE) to predict the filling level of the waste containers in a real city. The RNN is designed with deep neuroevolutionary (DNN) to create RNN-DNE for predicting the waste containers filling level. The performance of the RNN-DNE is compared with the linear regression (LR), Gaussian processes (GP), and SVM. The result shows that the RNN-DNE has a higher predictive accuracy than the LR, GP and SVM.


Alahi et al. [81] propose LSTM to predict people and their future trajectories in crowded area. The LSTM learn position of the person and predicts their future positions to incorporate within the trajectories of the pedestrians. The LSTM is applied in predicting human mobility and their future trajectories in crowded area. The result shows that LSTM performs better than the hand-crafted function method. Struye et al. [77] propose gated RNN (GRNN) to predict interference data from the city environment of a SC. The RNN recognizes temporal patterns by remembering aspects of past inputs in their internal memory, the gates were classified into reset gates which decide which aspects of the state to forget, and the update gates decide how to incorporate the new data into the state. The GRNN is applied in predicting the interference data of city lab testbed. It was found that the GRNN is more accurate than the naïve predictor (NP). The summary of the application of RNN in SC is presented in Table 5.


Use case


Because of increasing progress analytics for smart city, it has become necessary to develop applications based on deep recurrent neural network. Lin et al. (2020) presented study on the implementation of deep recurrent natural network to forecast the opening price in the Financial Index. The authors compared the results with the previous studies used machine learning; it has been found that the deployment of deep recurrent natural network on some case such as S&P500 (Standard & Poor’s 500) and the Dow Jones stock indices can show better performance. In Taiwan due to climate conditions issues (Wei and Cheng 2020) two-step wind-wave prediction is proposed. The Guishandao Buoy Station located off the northeastern shore of Taiwan was used for a case study. The results achieved have shown that the proposed approach is more accurate compared with other approach such as MLP network.

5.4 Backpropagation neural network and deep neural networks

The BP algorithm for ANN is still active in the literature for training the ANN. The BP is applied in ANN for application in the SC. For example, Mehmood et al. [88] propose deep ANN (D-ANN) to detect and control teaching and learning in SC. The ANN is trained using BP and is responsible for identifying user components of ubiquitous teaching and learning. The result shows that the D-ANN has superior accuracy than the Naïve Bayesian (NB), k-nearest neighbor (KNN), decision tree (DT), (SVM) and ANN. Balchandani et al. [89] propose D-ANN to detect litter object on the smart street cleaning place. The D-ANN checks the street images and finds out whether the streets are dirty or not. It was found that D-ANN has better accuracy than the conventional ML algorithms. Jain and Shah [84] propose ANN to detect the anomalous location of air pollution based on air quality index in a SC. The ANN is used to classify the data as either normal or anomalous. The result shows that ANN has better accuracy than the SVM. Yuan et al. [127] propose BPNN based on double-fed induction generators (BPNN-DFIG) integrated with supercapacitor energy storage system (SCESS) to reduce power instability. The BPNN-DFIG transmitted sensor node to the aggregation, the aggregation node passed to the gateway and the data is uploaded to the cloud environment through the internet by the gateway. The BPNN-DFIG is applied for clustering the architecture to enhance stability of wind turbines. The ANN-DIFG is evaluated and found that it has higher stability improvements.


Gupta et al. [85] propose ANN to detect the possibility of waterlogging and its intensity in the future areas of a SC. The ANN is used in predicting the future intensity and possibility of waterlogging areas. It was found that the ANN has good accuracy in determining the intensity of the waterlog. Vlahogianni et al. [86] propose ANN to predict available parking space in short- and long-term period. The ANN is used in predicting available parking space in short- and long-term period. The result shows that the ANN performs better with accuracy. Sharad et al. [87] propose ANN to predict an estimate of arrival time to commuter using smart bus. The ANN is trained using BP to estimate the arrival time of the bus. The result shows that ANN outperformed the standard multiple linear regression (MLR). Table 6 presents the summary of the BP-ANN in processing SC data.


Use case


Audio data classification based on deep neural networks and multilayer perceptron neural networks have been implemented in Zamil et al. [128]. The idea is to propose an affective network topology using hidden layers, neurons and fitness function approaches. The results have produced a high-performance classifier to improve the accuracy of large audio datasets that collected from different modalities in a smart city. Moreover, network traffic prediction is designed for smart cities using back propagation and neural network (Pan et al. [129]). The concept is based on impact factor of network as the input layer and the network traffic as the output. The results show that the proposed approach has accurately predicted network traffic in smart city environment.

5.5 Support vector machine

The SVM is used in the SC for data analytics in different domain within the SC community. For example, Gu et al. [92] applied SVM to detect road safety and driving patterns. The SVM differentiates between the bad nodes and the good nodes in vehicular network. The SVM is applied to classify the road safety and driving pattern. The result shows that the SVM has better detection rate than the secure key-based, resource testing-based, reputation system and position verification. Similarly, Dambhare and Karale [91] propose SVM for classification in mapping the world. The SVM is responsible for the classification whether the object is positive or negative. The result shows that the SVM has better performance than the sentiment analysis.


Yan and Yu [93] propose an improved SVM to predict short-term traffic congestion. The improved SVM computes the total speed of each region in the urban road network and divide the rod congestion into different traffic congestion levels. The improved SVM is used in predicting the short-term traffic congestion. The result shows that the improved SVM outperforms the conventional SVM and twin bounded SVM (TBSVM). Hanifah et al. [94] propose SVM to detect twitter information. The SVM extracts the information of traffic congestion and classifies the tweet as either the positive or negative. It was found that SVM has better accuracy than the feature selection and weighting methods. Table 7 presents the summary of the SVM in SC.


Use case


Support vector machine algorithm is considered one of the important algorithms in many applications related to image processing. Hossain et al. [130] have designed a framework to deal with media-related healthcare data for handling healthcare in smart city. The authors introduced a cloud-oriented smart healthcare monitoring framework that interacts with surrounding smart devices, environments, and smart city stakeholders for affordable and accessible healthcare (Hossain et al. [130]). Moreover, remote sensing extraction techniques based on support vector machines are introduced in order to improve the soil salinization maps. The results show that the proposed method outperformed the existing methods in extracted information of soil salinization.

5.6 Hybrid neural networks

In this section, the hybrid of the deep learning architecture, shallow ANN and other classes of ANN are presented in relation to their applications in SC. Artificial intelligent algorithms are hybridize to avoid their weaknesses and strengthen performance. In the SC, many hybrid systems are found to be applied for processing the SC data. For example, Niu et al. [98] propose a hybrid of RBM and SVM (RBM-SVM) to predict traffic flow. The RBM is trained to reduce dimensional feature, whereas the SVM performs feature extraction. The RBM-SVM is applied in predicting road traffic flow. The result shows that the RBM-SVM outperforms temporal supplementary and deep learning approaches. Aryal and Dutta [99] propose a hybrid of DNN based on restricted Boltzmann machine and conventional multi-layer perceptron network (DNN-RBM-CMPN) to detect objects from images. The DNN-RBM-CMPN extracts features and classifies the object from the image. The result shows that DNN-RBM-CMPN classifies much better than the ARIMA. Zhu et al. [100] propose a hybrid of ConvNet and RNN (ConvNet-RNN) to predict the number of people and count the number of pedestrians passing through a place in a specific time. The ConvNet learns the body features and distinguishes between the people close to the camera and the people that are far away. The RNN counts the number of people passing through a place in a specific time. It was found that the ConvNet-RNN outperforms the CrowdNet. Dinh et al. [101] propose a hybrid of ConvNet and recursive NN (ReNN) (ConvNet-ReNN) to detect pedestrian. The ConvNet learn low level features and recognize image color. However, the ReNN learns high-level features and classifies the present of pedestrian. The ConvNet-ReNN outperforms the sliding window method.

Jindal et al. [102] propose a hybrid of ConvNet and SVR (ConvNet-SVR) to predict the total load consumption of all smart homes in a SC. The ConvNet learns the hidden patterns in the data and outputs different load curves while the SVR trained the load curves. The hybridized model is compared with single ConvNet and single SVR. The result shows that ConvNet-SVR does much better than the single ConvNet and single SVR. Chackravarthy et al. [103] propose a combination of ANN with hybrid of DL (NN-HDL) algorithms to detect crime activities in a SC. The NN-HDL algorithms analyze video stream data in order to control criminal activity which in turn helps to minimize workload on the supervising officials. The model is compared with crime forecasting and mapping methods. It was found to be more accurate than the compared methods. Camero et al. [104] propose D-RNN with EA (D-RNN-EA) to predict car park occupancy rate. The D-RNN-EA aggregates parking lots and predicts car park availability. The model is evaluated and compared with other predictors. It was found to be more accurate than the compared methods. Atef and Eltawil [105] propose a hybrid of support vector regression (SVR) and DL (SVR-DL) to predict the hourly price for electricity price forecasting in Smart Grids. The SVR-DL was trained with a data that represents one-hour observation and indicates the changing price range based on the power consumption. The hybridized model is compared with SVR. The result indicates that SVR-DL is more accurate than the SVR.


Liang et al. [95] propose a hybrid of vanilla-RNN and hyper-RNN (VH-RNN) to predict the number of metro passengers entering stations. The VH-RNN is responsible for capturing key information of previous states from different input sources; VH-RNN performs weigh sharing scheme among the models which resulted to VH-RNN. The VH-RNN is applied to predict the number of metro passengers. The results show that the performance of the VH-RNN is better compared with the nonlinear autoregressive ANN with external input (NARNNX). Zheng et al. [97] propose a hybrid of regression tree (RT), ANN and SVR (RT-ANN-SVR) to predict the parking occupancy rate. The RT-ANN-SVR forecasts the occupancy rate of parking lots given a specific date and time. It was found that RT-ANN-SVR performs better than the ANN and SVR. Table 8 presents the summary of the application of hybrid algorithms in SC.


Use case


Lee et al. [131] used deep hybrid neural network for event-based Optical Flow estimation with energy. The authors developed Spike-FlowNet system to overcome the issue of handling the performance due to spike vanishing phenomenon. Spike-FlowNet outperforms its corresponding ANN-based method in terms of the optical flow prediction capability while providing significant computational efficiency. Moreover, Sokolov et al. [132] introduced the results of simulation which include an assessment of the reliability of calculating the Earth’s magnetic field parameters of the magnetic control system for the spacecraft momentum.

5.7 Evolutionary artificial neural network

The evolutionary artificial neural network (EANN) is the application of evolutionary algorithms such as genetic algorithm (GA) for the optimization of ANN parameters. The EANN has started penetrating the SC; few studies were found to apply the EANN in the SC. For example, Feng [106] propose the optimization of ANN through GA (GANN) to predict and model network traffic in SC. The GA is applied to optimize the ANN weights and threshold. The GANN is used in predicting traffic network. Its performance is compared with that of wavelet NN (WNN). The result shows that the GANN has good performance than the WNN. Lei and Shangzheng [107] propose GANN to predict network traffic flow. The GA is used to optimize the parameters of radial bases function (RBF) NN. The GA-RBF is applied in predicting the network traffic flow. The result shows that the GA-RBF is better than the standard GA.

Ikram Belhajem proposes a hybrid of extended kalman filter (EKF) and ANN based on GA (EKF-ANN-GA) to predict the real-time car positioning in a SC. The EKF estimates the current state of a system. The ANN acts as a massively parallel distributed processor that uses flexible computing paradigms to solve complex and uncertain problems. The GA facilitates the connection weights optimization in the training phase of the ANN. This hybridization resulted to EKF-ANN-GA. The EKF-ANN-GA is applied in predicting the real-time car positioning in a SC. The result shows that the EKF-ANN-GA outperforms the simple EKF. The summary of the evolutionary ANN is presented in Table 9.

5.8 Deep reinforcement learning

DeepRL is the hybrid of reinforcement learning and deep learning architecture. It has been found to be applied in SC to solve problems. For example, Heo et al. [109] propose DeepRL to monitor ventilation system for air quality in a smart building. The DeepRL monitors the dynamic properties of the control variables and reduces the energy usage. The proposed algorithm is compared with quality learning (Q-L) and found that DeepRL does better than the Q-L. Jang et al. [110] propose DeepRL to reduce traffic-jammed areas in a SC. The number of existing vehicles and the average speed of vehicles passing through the intersection were gathered. The DeepRL T is applied to predict the traffic. Result shows that the DeepRL achieves a good performance.

Wu et al. [111] propose DeepRL to improve fuel economy of hybrid electric vehicles. The algorithm was trained with a large amount of driving cycles that was generated from traffic simulation. The result shows that DeepRL performs much better than the conventional reinforcement learning (ConRL) approach. Zhao et al. [112] propose DeepRL to reduce the network congestion and balance the network load for supporting SC services. The DeepRL improves the crowd management of various sections in SC by learning the request distribution pattern of the crowd and select the right decision for it. The DeepRL is found to be better than open shortest path first (OSPF) and enhanced-OSPF (EOSPF)

Ali et al. [31] propose DeepRL-Q-learning (DeepRL-QL) resource allocation for accessing channel in a dense wireless local area network in a SC. The DeepRL uses intelligent Q-learning to learn unknown wireless network platform and resource allocation in small heterogeneous networks. The performance of the DeepRL-QL in terms of throughput, access delay and fairness are better than the media access control (MAC) protocol and binary exponential backoff (BEB) mechanism. Mohammadi et al. [113] propose semi-supervised deep reinforcement learning (SSDRL) model to improve the accuracy of learning agent of indoor localization in smart building of a SC. The SSDRL maximizes supervised learning task using unlabeled data. The SSDRL is applied in classifying the learning agent of indoor localization. The result shows that the SSDRL has higher performance than the supervised DRL (SDRL) model. Table 10 presents the summary of DeepRL applications in SC.

5.9 Other types of deep neural networks

In this section, the ANN that could not be classified in the previous section because of uniqueness is presented in this section. The section comprises of different ANN both deep and shallow. For example, Xu and Gade [115] propose structured deep NN (SD-NN) for the prediction of house prices in SC. The SD-NN is for predictive analytics which contains only the needed connections among neurons and also match with the layered knowledge graph. The SD-NN is applied in predicting property values. The result shows that the SD-NN outperforms conventional multivariate linear regression (CMLR) model. Ghoneim and Manjunatha [117] propose DNN to predict ozone pollution. The DNN is trained with the help of grid search technique. The DNN is applied in predicting the feature level of the ozone pollution. The result shows that the DNN outperforms the SVM, ANN and GLM models. Bura et al. [118] propose DNN to detect the time and place of parking a car. The DNN checks the availability of the parking space. The DNN is applied to classify the time and place of parking a car. The result shows that the DNN has better accuracy than the AlexNet and mAlexNet.

Elsaeidy et al. [119] propose restricted Boltzmann machines (RBM) to detect distributed denial of service (DDoS) attacks in SC applications. The RBMs are applied to detect distributed denial of service attacks. The results obtained indicated that the RBM is accurate in detecting denial of service than the SVM. Li et al. [120] propose intrusion detection system (IDS) based on deep migration learning model (DMLM) (IDS-DMLM) to detect anomalies in the network. The IDS aggregates sensor node by taking the collected data through the wireless transmission while the DMLM unifies the data definition domain and reduces the high-dimensional data. The IDS-DMLM performance is compared with evolutionary learning model (ELM) and BPNN. It was found that the IDS-DMLM is more accurate than ELM and BPNN. Serrano and Bajo [121] propose DNN to predict drug response in cancer and provide automatic notifications of social exclusion in a SC. The DNN is applied for the prediction of the drug response. The result shows that the DNN outperforms the random forest, K-NN, Logistic, C4.5, AdaBoost and RIPPER. Zhang et al. [122] propose DNN based on fine-grained architecture to improve the efficiency of building energy. The DNN was trained with comfort factor of temperature and combined the factor values of fine-grained models to evaluate the comfort score. The fine-grained DNN outperforms the coarse-grained DNN in terms of energy efficiency.

Sun et al. [114] propose sliding window ANN (SWANN) to predict future energy consumption at both household level and the community level in SC. The SWANN is applied in predicting future trend of energy consumption. The result shows that the SWNN is more accurate than the conventional ANN. Gupta et al. [116] propose Hopfield NN (HNN) to predict intelligent control of traffic lights. The HNN provides the optimal sequence for the traffic lights. The HNN is applied in predicting the intelligent control of the traffic lights for traffic management. The HNN outperforms the GA, particle swarm optimization (PSO) and differential evolution (DE). Table 11 presents the summary of the other deep learning and shallow ANN applied for data analytics in SC.

6 Deep learning platforms

The survey on deep learning for the ANN in SC shows different platforms were used both software platforms and hardware platforms. Table 12 presents the summary of the platforms found during the survey. The list of the platforms in Table 12 is not exhausted because we only present those found to be used in the studies surveyed. We only present projects that clearly state the software or hardware used for implementation. It was found that Tensorflow, Keras, MATLAB and Caffe are the most frequently used platforms for implementing deep learning in SC. Nevertheless, many projects did not disclose the platforms used to implement their projects. The major features of the four main platforms are briefly discussed as follows:

Table 12 Summary of the implementation tools found in the study

6.1 MATLAB for deep learning

The MATLAB for deep learning platform has the ability to build deep learning models with few lines of codes without the pre-requisite requirement to be an expert. The MATLAB for deep learning can create modify and analyze deep learning model based on visual tools and applications. The MATLAB has a tool for pre-processing of data and tool for automatic labeling of images, video and audio. It can also accelerate algorithms on NVIDIA, GPU and datacenters using applications without programming specialization. Collaborative framework of the MATLAB for deep learning allows collaboration with TensorFlow, PyTorch and MxNet. There is reinforcement learning in the MATLAB for simulating and training of a dynamic system. Data set can be generated from the MATLAB for deep learning and Simulink for training and testing the deep learning models to model physical system. In addition, the MATLAB for deep learning support interoperation with deep learning framework is an open source using the ONNX import and export features. The deep learning developed in MATLAB for deep learning can be deployed anywhere including CUDA, C code, cuDNN, TensorRT and ARM compute library. The MATLAB for deep learning covers speech processing, computer vision and reinforcement learning [133]. However, MATLAB for deep learning requires license for each of the modules; it is not an open source software. Cross compiling or the conversion of MATLAB code to other language is difficult and requires deep level of MATLAB coding skills to fix error. Application deployment task is lacking in MATALAB [134].

6.2 Tensorflow

The Tensorflow is a platform that offers a paradigm for deep learning technologies. The Tensorflow is the product of Google developed in 2015. The Tensorflow includes Java, Python, API, C++ and Go. The developers of Tensorflow design it for computation on the flow graph. The nodes of the graph represent the operations, and the graph edges are the multidimensional array commonly referred to as the Tensors. The Tensorflow has compatibility capabilities because it is compatible for computation on multi-processor systems with either CPU or GPU and it has option of CUDA [135]. However, the TensorFlow has the following disadvantages: It doesn’t support GPU other than the Nvidia. The TensorFlow is lagging in terms of computational speed compared to Torch, CNTK, Caffe and Theano. It doesn’t support windows environment except installation through Conda or Python and symbolic loop lack support in TensorFlow [136].

6.3 Keras

The Keras is deep learning library written in Python and is a high-level ANN API. The Keras has the capability to run on TensorFlow, Theano and Microsoft cognitive toolkits. It allows fast and easy prototyping because of its friendliness: User experience and centeredness are at the core of Keras. Its API is simple and consistent; with regard to user error, it gives an unambiguous action feedback modularity: It has different modules that can be combined to create a new model of deep learning models. For instance, the Keras has cost function, optimizers, activation functions, regularization, and neural layers each as a module. These modules can be combined to create a new model of deep learning. It supports the combination of ConvNet and RNN as well as running of the constituent algorithms. Extensibility: It allows the addition of new modules in a simple way without difficulties. In addition, new modules can be created easily and this makes Keras suitable for conducting high-level research. As regard to hardware platform, Keras runs on both CPU and GPU systems [137]. However, it lacks flexibility and not optimal for studying new deep learning model. Yet, it is not 100% effective and efficient in multi-GPU environment [138].

6.4 Caffe

The Caffe is a deep learning framework based on C++ and CUDA with capability to operate using the interface of command line, Python and MATLAB. The Caffe runs on mobile platforms and has the ability to be extended to work on Apache Hadoop ecosystem with Spark. Caffe2 is based on the Caffe and is part of Facebook open source project which has additional Python API and runs across different OS platforms, such as Mac, Windows, Linux, IoS, Android [139]. The Caffe2 is a modular framework for deep learning that comes with the capability of training deep learning architectures and deployment. The mobile feature of the Caffes makes it fit to support the new generation hardware. It powers one of the largest mobile deep learning that has more than a billion devices. Facebook supports Caffe2 and PyTorch for variety of artificial intelligence applications. The major focus of the Caffe2 is performance and multi-platform compatibility. On the other hand, the PyTorch mainly focuses on flexibility for the purpose of fast prototype and research. The Caffe2 is now part of PyTorch [140]. However, the Caffe activeness in development has reduced. It only supports C++  when it comes to custom layers. The static model graph of the Caffe doesn’t support many of the RNN applications that require inputs with variable size. It is cumbersome to define very large deep model and modular DNN model in Caffe [138].

Different project different hardware and software were used for the implementation; the last column of Table 10 presents the hardware configuration used in each of the projects. The computational time of algorithm is affected by hardware because of the differences in configuration of the hardware. For example, the convergence time of the algorithm that run on CPU system cannot be compared with the convergence time of the same algorithm that run on GPU system despite the fact that the algorithms are the same. This is because the hardware configuration of the two systems differs.

7 The smart city datasets

The purpose of this section is to provide the readers with the sources and nature of the datasets used in different projects. Readers interested in a similar research can easily identify the sources and nature of different SC datasets in proposing a novel deep learning model for further development of the SC concept.

The main factor in data analytics is the dataset. Some projects used real-world datasets while others applied benchmark datasets. Real-world datasets are generated by performing experiment. Benchmark datasets is the dataset extracted from public repository and mostly freely available for public use. Table 13 presents different types of datasets used in different project; only the project that reveled their datasets and source were reported in the table. Most of the project used real-world datasets collected from the SC.

Table 13 The summary of the datasets found in different project

Benchmark datasets are typically clean, very easy to obtain, not expensive to gather and ready to apply algorithm. On the other hand, real datasets are expensive to gather, require data engineering before the application of algorithm, difficult to gather and contain missing points. Algorithms can work perfectly on benchmark datasets, but the performance of the algorithm can be otherwise on real-world datasets because the two categories of the datasets have different behavior and scale. As such, the algorithm intended for application in real-world scenario should be tested on real-world datasets not only on the assumption of the algorithm performance on only benchmark datasets. A good algorithm is expected to have a very good performance on both benchmark and real-world dataset.

It is observed in Table 13 that the main datasets used in the projects includes: Traffic, video, image and text datasets. Brief comments on the different datasets presented in Table 13 are as follows: Traffic data Inductive loop for identifying and counting the number of vehicles flowing in the SC in every 5 min. Similarly, pneumatic tube counts the number of vehicles in every 15 min. In each second, GPS in the SC provide information about traffic density, speed of vehicles and direction of movement. The video vehicle detection equipment classified and count vehicles at thirty frames per second. Magnetometer provides information about vehicles estimated speed and traffic density [142]. Image data In SC, images can be collected from different sources within the SC. For instance, images can be collected from satellite meant for surveillance and unmanned Aerial vehicle as an autonomous vision system equipped with cameras for collecting images. Video data Different sources of video data are available in SC because of the technologies embedded in the SC for different purposes. The SC is equipped with video related technology such as the CCTV, surveillance camera, video vehicle detection, unmanned aerial vehicles and drones. These equipments collect video data within the SC for various purposes such as security, tracking traffic flow. Textual data Communication by citizens in SC is in natural language through different emerging technologies. The textual data in SC can be collected from different sources such as the mobile phone short message service, social network such as Twitter, online forum, and online customer conversation. These platforms generate a lot of unstructured text data, the unstructured data need to be structured before processing to provide new information.

8 General discussion

This section presents general discussion about the survey, the strength of the deep learning in SC, domain of application including taxonomy, publication trend and graphical representation of the algorithms used in the SC.

8.1 Strengths

The survey revealed that the deep learning has being successfully applied in SC for data analytics in different domain of applications. Remarkable achievements were recorded by the deep learning models (see Tables 3, 4, 5, 6, 7, 8, 9, 10, 11). Different architectures of the deep learning were applied to solve machine learning problems such as classification, prediction and clustering. It was found from the survey that the deep learning architecture is powerful, robust and effective in dealing with large volume of data generated from activities within the SC compared to the shallow ANN. It is believed that the deep learning models are the compatible analytic tools suitable for handling data generated from the SC. The deep learning models have been applied for image processing, network traffic management, pollution, parking space, detection of vehicles among many others. The performance of the deep learning models was compared with the conventional ANN and other machine learning algorithms; it was consistently found that the deep learning always performs better than the conventional ANN and other machine learning algorithms. It was observed that deep learning is a very good candidate for data analytics in SC. However, there are few studies that do not present performance comparison in their studies, as such, it is difficult to ascertain the effectiveness of the algorithm propose in their study. The survey indicated that the shallow ANN are still active in SC for data analytics including evolutionary ANN. The ConvNet has an inbuilt data engineering mechanisms—convolution and pooling layers; therefore, ConvNet doesn’t require manual data engineering or separate technique for feature selection unlike the shallow ANN. These capabilities of the ConvNet make it suitable for image processing in the SC. There are problems in the SC where the output of the systems depends on the computation previously performed, the DRNN is a perfect match for such problems because the DRNN has the ability to memorize events sequentially. The DBN is generative model with a capacity to solve both supervise and unsupervised learning problems which can make the DBN fit for developing cybersecurity solutions in the SC.

8.2 Domain of application

Figure 9 presents the taxonomy of deep learning application domain in SC. It is found that the deep learning architectures were applied across different domain within the SC. The taxonomy was created based on the survey conducted. In each study, the domain of application in the SC was extracted, compiled and the taxonomy was created to give the readers easy representation of the domains within the SC where deep learning models have been applied for data analytics. The taxonomy can easily allow the reader to know the domain that has not been exploited.

Fig. 9
figure 9

Taxonomy for the domain of application in SC

Figure 9 shows the diversity of the domain of deep learning applications in the SC. As shown in Fig. 9, it is clear that transportation in general received more attention on the applications of deep learning compared to other domain in the SC. This is likely because of the significant of transportation systems in supporting innovations and development of economy in a SC. The transportation system eases movement of humans, goods and services. Effective transportation systems encouraged the location of industries, thereby, open job opportunities. The domains that require more investigation on the applications of deep learning models to enhance their effectiveness and efficiency are visually shown in the taxonomy. This can help researchers to easily track the domain that has not received attention or less attention regarding the application of deep learning.

8.3 Publication trend

Recently, almost 6 years, many projects on the application of deep learning and shallow ANN in SC across different domains have been published. The publications trend is depicted in Fig. 10 showing the publication trend from 2014 to 2019. The survey has been conducted up to 2019 as it is the current active year. However, we included future edition, 2020 project to show the activeness of the research area. The data used to depict the publication trend as shown in Fig. 10 are extracted from the number of publications in each year starting from 2014. For example, the number of papers published in 2014 was found to be 4. The summaries of the published works are presented from Sects. 5.15.9; going through the summaries on yearly bases it reveals that 4 papers were published in 2014. The same procedure is applied to papers published in 2015 up to 2020. Figure 10 indicates that the application of deep learning in SC is saturated because it shows declined pattern of publishing research project in the research area as observed in the last 2 years. The lowest number of publications was recorded in 2014 and subsequent years recorded more number of publications than 2014, clearly indicating acceptability and popularity in the research community. The pattern is expected to continue into the future as SC keeps evolving and more studies are required to fully understand the SC and increase its robustness and efficiency. Already 2020 publication has start springing up. The bars representing the years 2019 and 2020 were marked with different color in view of the fact that publications for these years may still appear.

Fig. 10
figure 10

Publications trend

Figure 11 shows the frequency of different deep learning architectures and the shallow ANN applied in the SC. The data used to depict Fig. 11 is the frequency of each algorithm in the survey extracted from Tables 3, 4, 5, 6, 7, 8, 9, 10 and 11 in Sects. 5.15.9. For example, the frequency of applying DeepRL in SC is 5 extracted from Table 10. The same procedure is applied to ConvNet extracted from Table 3, DBN extracted from Table 4, etc. The presence of shallow ANN in the SC indicated that the shallow ANNs are still relevant in solving machine learning problem in the SC as some of the deep learning algorithms were hybridized with the shallow ANN to improve the performance of the deep learning algorithms.

Fig. 11
figure 11

The deep learning architectures and shallow ANN

Figure 11 shows that the deep learning architecture gained tremendous attention from the research community possibly because of the advantages of the deep learning over the shallow ANN in data analytics. For example, the shallow ANN require different technique for dimension reduction, whereas the deep learning does not require different technique for dimension reduction as the dimension reduction can be done automatically by the deep learning without requiring extra data engineering effort. The ConvNet is the most popular deep learning algorithm in solving problem in the SC as clearly shown by the longest bar likely because of its robustness and suitability in handling image processing problem. The DeepRL is an emerging deep learning architecture gaining fast popularity in the SC environment with initial appearance in 2018 (see Table 10). The trend is likely to continue in the future with already a future edition—2020 publication. It is indicated that hybrid algorithms are raising and it has the possibility of exceeding the ConvNet in the future because it has been proven from the literature that the hybrid deep learning algorithms are found to perform better than the constituent algorithms in solving SC problems. Figure 11 can help researchers to easily view the unexploited deep learning algorithms in SC and architectures that require more investigation in the SC.

The visualization of the application of different deep learning architectures per year in SC over 6 years (2014–2020) is shown in Fig. 12. The area started gaining attention in 2014 with DBN and hybrid as the only deep learning architecture. An increase in interest of researchers in the area was observed from 2015 onwards as more deep learning architectures were adopted. It is observed that ConvNet has the highest attention in 2017 and 2018. The DBN has least attention and was only adopted in 2014 and 2016. All the deep learning architectures were adopted in 2019 with the exception of DBN. The DRNN seems to maintain consistency of adoption for five consecutive years (2015–2019). The hybrid deep leaning architecture starts attracting attention especially in 2018 probably due to their better performance than single constituent algorithms. The first algorithm to be applied in 2020 is the DRL and hopefully more publications is expected in the future. The visualization in Fig. 12 gives an overview of the rate at which different deep learning algorithms have been applied in SC from 2014 to 2020.

Fig. 12
figure 12

Frequency of deep learning algorithms in SC per year

9 Challenges and future research opportunities

The survey clearly indicated some challenges in the applications of deep learning for ANN in SC. This section focuses its discussion on the major challenges found during the survey and propose possible direction for future research.

It was found from the survey that the application of nature inspired algorithms in deep learning for data analytics in SC is an untapped research area. In spite of the fact that the use of nature inspired algorithms such as cuckoo search algorithm, firefly algorithm, harmony search, and particle swarm optimization in optimizing the parameters of the deep learning architecture improves its efficiency and effectiveness, it is yet to witness its application in SC. However, evidence in [12] indicated that the deep learning model optimize through nature inspired algorithms outperform the conventional models of the deep learning in many cases. In the future, researchers are suggested to apply nature inspired algorithm for the optimization of deep learning for ANN in solving problems in SC e.g. classification, clustering, and prediction.

As shown in Sect. 8.2 deep learning for ANN were applied in different domain of the SC. However, the aspect of internet of vehicles within SC remains untapped. The internet of vehicles comprises of fully autonomous vehicles, semi-autonomous vehicles and conventional vehicles equipped with autonomous vehicle facilities. The internet of vehicles generates large volume of data as a result of communications. Such data are required to be analyze to unravel new insight for inform decision and improving the effectiveness of the internet of vehicle technology in the SC. In the future, we propose for researchers to apply deep learning for ANN in the processing of the large-scale data generated from the internet of vehicles within the SC.

Another major issue with the SC is the dynamic nature of technological advancement; as technology evolves, the concept of the SC also evolved. The continue evolving of the SC can pose a challenge to deep learning models because it can create challenges to the deep learning models in handling the new concept of the SC data analytics from the future generation SC. This is possible especially when the deep learning technology remains without being modified to adapt to the constant changes in the concept of the SC. In the future, we suggest researchers to continue to modify the deep learning models as the concept of the SC changes so as to adapt to the dynamic nature of the SC. By so doing, the deep learning can remain relevant in the phase of technological advancement.

Despite the penetration of deep learning in SC and remarkable achievement, other aspect of the deep learning remained unexploited in the SC. Some cases of the deep learning concept not exploited in SC includes but not limited to generative adversarial network, neural abstract machine, memory augmented neural network, interaction based deep network, attentive network and deep extreme learning machine. It will be interesting to exploit these deep learning aspects in the SC to unravel their effectiveness in solving problems in SC. In the future, we enjoin researchers to exploit the outlined unexploited deep learning concepts in the SC.

The tuning of many parameters required by the deep learning still remain an open research problem as it lacks a systematic way of getting the optimal parameter settings. In addition, the performance of the deep learning model depends on the optimal setting of the parameters. Another open problem is the issue of computational time required by the deep learning architectures, the training of the deep learning typically takes long time before convergence. However, there are issues that time is critical in taken decision. For example, in SC the notification of crime to the security agencies, alerting accident to the relevant authority if they occurred, fire outbreak and health related issues, time is highly critical, as any delay can lead to loss of life. In the future, researchers should work on increasing the convergence speed of the deep learning models in solving problems in the SC.

Performance analysis: inconsistency of the deep learning models across different domain of application (see Sect. 8.2) in the SC. The performance of the deep learning differs depending on the domain of application. The need for researchers to know the best deep learning architecture for a particular domain is critical to avoid trial and error that consume a lot of resources and it is tedious. The best deep learning architecture for each domain in the SC remains an open problem. Researchers should perform a rigorous performance analysis for different deep learning architecture in different domain of application to unravel the best deep learning model for each of the SC domain of application.

10 Conclusions

The survey intends to present recent progresses, taxonomy, challenges and future research opportunities on the application of deep learning in SC. In the paper, a dedicated rigorous survey on deep learning in SC was conducted. Recent progress made on the application of deep learning in SC is clearly outlined. Two new taxonomies based on deep learning and domain of applications in the SC were created. Publication trend indicated that this research area is rapidly gaining popularity and the trend is expected into the future with already a future edition appearing in the literature. Generally, ConvNet is highly popular in SC; hybrid algorithms are gaining acceptability over the single constituent algorithms and DeepRL is fast gaining popularity in SC over other single deep learning algorithms with the exception of ConvNet. The challenges currently mitigating against the smooth application of deep learning in SC were pointed out, and suggestion on how to deviate from the challenges pointed in the paper was outlined as opportunities for future research work. The survey is intended for novice researchers to use it as entry point into the research area; experts’ researchers can use it as a benchmark for further development of the research area.