1 Introduction

In recent years, there is a tremendous development observed in the field of information technology and electronic components while there exists an exponential increase in the growth of 5G IoT systems. These technologies helped the society to improve quality and effectiveness of different urban and suburban services such as healthcare, transport, energy, traffic, etc. At the same time, cybercrimes pose as an essential threat in day-to-day life since such crimes are not collected in a finite set of local crime scenes. Some of the probable traces of evidence are spread across a number of systems, victims, and cross in other jurisdictions than ever before [1]. Digital forensics is an essential part of cybercrime investigation process. It can be defined as the scientific acquisition, investigation, and archival of data contained in electronic media while the information in it, can be utilized as evidence in a court of law. Forensic image processing includes the computer restoration and improvement of surveillance imaging [2]. It is aimed at exploiting the information extraction from surveillance imagery, particularly for images which are noisy, incomplete or over/underexposed. Forensic imaging processing is a process of enhancing the digital image using different computer techniques. In computer vision, Face Sketch Synthesis (FSS) is a significant area which comprises of extensive range of applications like photo recovery, virtual social network, and face recognition.

Considering face recognition as a negligible issue that might be a real-time bottleneck, it might be succeeded to few extents through relative generation of training data with different variations. Even though the technological growth is visible in image synthesis, the generation of facial images with different modifications and preservation of primary identity are composite. Besides, there is a difficulty still exists in mapping, from a variation factor towards high dimensional face image. It also misleads at times when the human is highly adapted with slight details in the facial area. There are two significant difficulties present in FSS. At the same time, different modifications are also present in semantic face like lighting changes, pose, facial disguises and expressions that are difficult to synthesize within image space. It is still under discussion on how to learn a composite transformation from actual image space in an efficient manner towards appropriate latent semantic space. Even though face recognition had been modeled, the way of preserving the subject identity is difficult. FSS gained attention among the researchers in the recent years, providing importance to the research area. Generation of cross-modal image is one-directional research like FSS. The model of Bayesian synthesis of face sketch is projected in the earlier study [3] in which the task was segmented into two namely weight computation and neighbor selection models.

To speed up the synthesis procedure, a novel model of model-driven FSS was projected in the previous study, and numerous efficient techniques were introduced earlier [4] to enhance the process of neighbor selection. The purpose of an actual process is to repair the face images’ occluded parts and to carry out the appearance approximation of another facial portion by employing a morphable model [5]. To enhance the functioning of morphable models, Mo et al. [6] employed a frontal face image dictionary primarily. A novel method was projected [7] by depending on Principal Component Analysis (PCA) and this study demonstrated a goal image through linear Eigenfaces combination. Sparse coding [9] and cascaded pose regression were introduced [8] for generating various types of expressions and poses. Due to artifacts, lack of details in personal facial feature and resolution, the quality of the image produced through actual methods got reduced. On the whole, it is complicated for actual methods to recognize face images.

On the other end, Convolutional Neural Networks (CNN) are applied drastically in recent years. It occupies the central place since it has the capability of massive data processing and hierarchical feature representation in numerous computer vision applications [10,11,12,13,14]. The first layer of convolution is employed to derive the photo features. Due to a misalignment in sketch, the photo is present within the training pair. In order to ease the misalignment, an overlapping max pooling layer [10] was associated towards the primary layer. Subsequently, the nonlinear layer known as multilayer perceptron convolution (mlpconv) layer [15] converts the feature map, towards sketch from photo modalities. To synthesize the end sketch, the convolution layer is employed. Through backpropagation technique, all the parameters and filters were learnt in the training phase [16]. The images were passed through layers in testing phase until the end outcomes were attained. Mlpconv layer is made through multiple Fully Connected layers (FC) with nonlinear activation functions (ReLU) i.e., nonlinear micro-network instantiation, in spite of traditional linear convolution layer in CNN. The mlpconv layer is convolution framework, when the hidden layers counts are 0 and are superior in learning excellent content over convolution function. It also improves the abstraction level of the local model over convolution filter. The abstraction can be defined as the feature which is invariant for a similar idea.

Lu et al. [17], proposed a viable technique which comprises of two phases such as preprocessing and sketch synthesis. Broad investigations on open face sketch database confirmed that the presented strategy improves the sketch synthesis nature of exemplar-based strategies. Ye et al. [18], recommended Triple Interpretation GAN (TTGAN) with multi-layer sparse representation. The authors structured a multi-layer scanty representation model. In this model, L1-standard representation limitation was incorporated into the image age so as to improve the capacity of character safeguarding and the power of the created facial images-to-reconstruction error. The face synthesis exploratory outcomes, on benchmark face databases, unmistakably demonstrated better execution over other contending techniques. In the literature [19], a modified CNN (M-CNN) model was developed comprising of two convolutional layers i.e., a pooling layer and a multilayer perceptron (MLP) convolutional layer to learn the mapping between face photos to sketches. But the performance of this model can be improved by using parameter optimization techniques.

At present, several approaches that recognize the persons from sketches, as instructed by eyewitnesses, have been presented; however, the performance seems to be mostly degraded to be used in real-world forensic sketches and extended galleries which are based on law enforcement mug-shot galleries. Though various FSS models are available, none of the models has incorporated hyper parameter optimization and IoT enabled applications. In this view, this research work introduces a novel IoT-enabled Optimal Deep Learning-based Convolutional Neural Network (ODL-CNN) for FSS to assist in suspect identification process. The parameters of DL-CNN such as number of hidden layers, number of hidden layer nodes, batch size and learning rate were optimized using Improved Elephant Herd Optimization (IEHO) algorithm. In the proposed method, the sketches of the images are generated from surveillance cameras. Then, the sketches undergo comparison with professional sketch provided with the help of eyewitness. The highly resembling sketches are used to identify the suspects effectively. An extensive set of experimentation was utilized to determine the effectiveness of the ODL-CNN model.

The remaining portions of the paper are organized as follows. Section 2 introduces the presented FSS model. Section 3 validates the introduced FSS model, and Sect. 4 concludes the study.

2 The proposed ODL-CNN model

The general working process of ODL-CNN model is depicted in the Fig. 1. Primarily, the proposed method captures the surveillance videos using IoT-based cameras which are then fed into the proposed ODL-CNN model. The proposed method initially involves preprocessing in which contrast enhancement process is carried out using Gamma correction method. Before the transformation of digital image to face sketch, gamma correction is employed to improve the contrast level of the input image. Then, ODL-CNN model draws the sketches of the input images following which it undergoes similarity assessment with professional sketch drawn as per the directions given by eyewitnesses. When similarity between both the sketches are high, the suspect is identified. Hence, it is a rapid process when compared to exemplar-based techniques that are required to resolve complex optimization issues.

Fig. 1
figure 1

The block diagram of ODL-CNN model

2.1 Contrast enhancement process

The process involved in gamma correction is illustrated in the Fig. 2. In applied input image \(I(x, y), x = 1, 2, \dots , M, y = 1, 2, \dots , N\), it is primarily identified using the statistical quantity thresholding.

$$t=\frac{{m}_{I}-{T}_{t}}{{T}_{t}}$$
(1)

where \({ m}_{I} = {\sum }_{x}{\sum }_{y}I(x,y)/MN\). For normal natural images, constant \({T}_{t}\) is computed as specified global average brightness. The simulated statistical values, from different benchmark image databases, indicated that \({T}_{t}\) is appropriately configured at half of the high pixel intensity, i.e., 128 for 8-bit images. The images, provided as input, are decided as bright if \(t>{\tau }_{t}\), and dimmed if \(t<-{\tau }_{t}\), when the threshold \({\tau }_{t}\)is applied to distinguish the brightness-distorted images from original images. When considering the trade-off between technical applicability and enhancement quality, \({\tau }_{t}\) is configured experimentally. To restore the brightness and to improve the contrast, the ACG-dependent Cumulative Distribution Function (CDF) truncation and negative image are employed respectively with regards to the identified image types i.e., dimmed and bright ones. The images obtained from contrast enhancement process on some of the test images are given in the Fig. 3.

Fig. 2
figure 2

Contrast enhancement process

Fig. 3
figure 3

Gamma corrected images. a Input image, b contrast-improved image

2.2 ODL-CNN model

2.2.1 Architecture of CNN

The CNN is motivated from cognitive technique of biological vision which is a deep learning technique. The initial CNN idea is made through three elements such as shared weights framework, pooling and receptive field. A neuron corresponds to small area of visual domain in the convolutional layer, as it derives majority of the visual features such as precise direction edges and corners. All the neurons employ similar weight values and formulate feature maps. It indicates that the shared weights find a similar feature in whole visual field which is advantageous to protect the translation invariance and feature continuity. Various filters contain convolutional layer due to which it creates various feature maps. A pooling layer is associated with convolutional layer to minimize the sensitivity of feature locations. Every feature map is divided into non-overlapping rectangles by pooling layer which results in higher value for all rectangles. Hence, the procedure of down sampling becomes nonlinear and therefore minimizes the sum of parameters.

The architecture of CNN comprises of a cascade of pooling and convolutional layers. The final convolutional layer creates exceeding abstract features that might be employed for recognition as shown in the Fig. 4. CNN has the capability of handling huge data and attains higher results in classifying the images with a proposal of GPU acceleration development and ReLU. CNN might be used in numerous computer vision applications like object identification, semantic segmentation, and human action recognition through modeling various loss layers. Mostly, the CNN-based techniques are modeled to make decisions for high-level vision tasks. There exists few types of CNN-based research works for image restoration.

Fig. 4
figure 4

Structure of CNN

SC-based techniques might be combined with CNN model. Indeed, in convolutional layer, every atom might be considered as a filter in the dictionary. While the filter scans an image in every phase, it also plays like local patches in receptive domain and creates the response accordingly. Therefore, for every patch, a set of filters would create a response vector that belongs to sparse coefficient vector within SC-based technique. However, coefficient mapping among various patches’ aggregation and modalities might be considered through convolutional layers. Hence, the traditional SC-based techniques might be defined through three-layer fully-connected CNN. They highly differ in operation mechanism and in algorithm design too.

The notations are employed here. The image matrices are denoted through uppercase letters, X and Y. Vectors are denoted through lower case letters such as b, w, and f. The letters F and W denote mapping function and filter groups respectively. Scalars are denoted by lower case letters, for instance \(n, b\). The main aim is to derive a respective sketch Y for the applied face photo X. The main challenge is to draw mapping \(F(\cdot):\)Y=\(F\left(X\right).\) The CNN framework offers highly compacted form of SC-based technique, and comprises of various benefits. For mapping approximation, a four-layer network is built.

The first conv. layer is the primary layer that employs a set of filters for input photo scanning while it returns a feature map group. This is expressed as

$${F}_{1}\left(X\right)={W}_{1}*X+{b}_{1}$$
(2)

in which \({b}_{1}\) and \({W}_{1}\) demonstrate biases and filters correspondingly; the convolution operation is denoted through ‘∗’ and the primary layer output is demonstrated through \({ F}_{1}\left(X\right)\). With supporting \({ c}_{in}\times {f}_{1}\times {f}_{1}\), the \({n}_{1}\) filters are comprised of \({ W}_{1}\), wherever the receptive field is referred to \({f}_{1}\times {f}_{1}\) and input channels’ counts are denoted through \({n}_{1}\). To derive the feature map, every filter is convoluted in the procedure of convolution. By adding respective biases, the feature map is rectified.

In general, the sketch and photo within a training pair are not completely registered. To ease the training pair misalignment, a max-pooling layer is employed since it consists of invariance towards small transformations [20]. Every feature map is segmented into nonoverlapping rectangles in traditional max-pooling. Here, overlapping max-pooling layer [21] is used to protect the actual sketch resolution that employs sliding window to output which results in higher value in the accessible domain. The next layer is demonstrated below:

$${F}_{2}\left(X\right)=maxPooling\left({F}_{1}\right(X\left)\right),$$
(3)

The layer input comprises of feature maps of \({ n}_{1}\). Over every feature map, with \({f}_{2}\times {f}_{2 }\) receptive field, maxPooling (·) operator tests the operations of overlapping max-pooling. As every feature map is independently processed in the next layer, \({n}_{1}\) pooled feature maps are offered by \({ F}_{2}\left(X\right).\)

After sparse coding of the face photos, SC-based pipeline method is used, and representation coefficients are converted into sketch modalities. Therefore, the sketches are generally derived through professional artists and the conversion is non-linear. The improvised convolutional structure is known as a layer of mlpconv [22]. This is employed as nonlinear mapping between two modalities. The MLP is inserted within the filter which acts similar to convolution kernel that scans the feature maps of inputs and produces numerous novel feature maps. The mlpconv layer computation is demonstrated as follows

$${f}_{i,j,k}^{p}=\text{max}\left({w}_{k}^{{p}^{T}}{f}_{i,j}^{p-1}+{b}_{k}^{p},0\right)1\le k\le {k}_{p},1\le p\le l$$
(4)

Let the sum of perceptron layers is denoted through \(l\), in pth layer, \({k}_{p}\) refers to the node count and \((i,j)\) refers to the center of the presently processed patch. Furthermore, \({f}_{3}\times {f}_{3}\) denotes the receptive-field. The vector \({w}_{k}^{{p}^{T}}\) is made up of weights that associate towards node \({ f}_{i,j,k}^{p}\), input from prior layer is denoted as \({f}_{i,j}^{p-1}\) vector and the bias is denoted through \({ b}_{k}^{p}\). ReLU is used in every node. One ReLU advantage in gradient is it is unsaturated. Since the gradient dispersion problem is alleviated, the attributes of primary neural network layers might get rapidly updated in reverse propagation procedure. At the same time, ReLU computation is faster when compared to tanh and sigmoid functions as the limit is set in the process of forwarding propagation [19]. The activation function for ReLU is denoted as

$$f\left(x\right)=max\left(0,x\right)$$
(5)

The final layer node \({f_{i,j,k}^{l}}\) demonstrates the patch \(\left(i,j\right)\)response in kth feature map of \({ F}_{3}\left(X\right)\) i.e.,

$${F}_{3} {(X)}_{i,j,k} = {f_{i,j,k}}^{1}$$
(6)

By exploiting the entire probable \(\left(i,j,k\right)\) combinations, output \({ F}_{3}\left(X\right)\) is obtained. For final sketch synthesis, convolutional layer is employed. The mapping from digital photo towards sketch could be represented as follows

$$F\left(X\right)={W}_{4}*{F}_{3}\left(X\right)+{b}_{4}$$
(7)

in which, with the support of \({n}_{3}\times {f}_{4}\times {f}_{4}, {W}_{4}\) contain \({ c}_{out }\) filters, \(X\), the advanced IT technologies and IoT devices can be used to ease the investigation process of identifying the suspects. The IoT devices gather the digital data from crime scene, examine the social media accounts and transmit to forensic department for further investigation.

2.2.2 Optimization of DL-CNN model using IEHO algorithm

Though the DL-CNN model yields better results, the choice of optimal architectural model for a particular process still remains as an open research problem. Generally, the effectiveness of CNN is decided on the basis of different hyper parameters like depth of CNN, number of layers, filter count, filter size, batch size, step size and learning rate. Several CNN structures were manually considered by developers and were validated to ensure the performance. To automate this process, in this paper, IEHO algorithm was employed to optimize the hyper parameters of DL-CNN model.

In IEHO algorithm, the herding nature of elephants is considered to overcome the global optimization issue and is provided in the following points [23].

  1. 1

    Elephant population lives as clans, and every individual clan has few numbers of elephants. Actually, a clan has equal and predefined elephant count.

  2. 2

    Some permanent male elephants would leave families and live alone in a distance at the beginning of every iteration.

  3. 3

    Elephants in every clan could live along with the leadership of a matriarch. Basically, the oldest elephant is referred as a matriarch in all the clans and is assigned as the correct elephant individual in a clan for optimization issue.

2.2.2.1 Clan updating operator

As mentioned before, every elephant resides with one another jointly under matriarch leadership in all the clans. Hence, for an elephant in a clan \(\text{c}\text{i}\), the subsequent position is inclined by a matriarch. The elephant \(\text{j}\) in clan \(\text{c}\text{i}\) can be upgraded as

$${\text{x}}_{\text{n},\text{c}\text{i},\text{j}}={\text{x}}_{\text{c}\text{i},\text{j}}+{\upalpha }\times \left({\text{x}}_{\text{b},\text{c}\text{i}}-{\text{x}}_{\text{c}\text{i},\text{j}}\right)\times \text{r}$$
(8)

where \({\text{x}}_{\text{n},.\text{c}\text{i}\text{j}}\) and \({\text{x}}_{\text{c}\text{i}\text{j}}\) are said to be the newly improved and existing positions for elephant \(\text{j}.\) In clan \(\text{c}\text{i}\), \({\upalpha }\in \left[\text{0,1}\right]\) denotes the scaling factor that computes the impact of matriarch \(\text{c}\text{i}\) on \({\text{x}}_{\text{c}\text{i}\text{j}}.{\text{x}}_{\text{b}\text{e}\text{s}\text{t},\text{c}\text{i}}\) shows the matriarch \(\text{c}\text{i}\) i.e., fit elephant individual in clan ci i.e., \(\text{r}\in \left[\text{0,1}\right]\). It is defined as the type of stochastic distribution that improves the population diversity in advanced search space. In this study, uniform distribution was applied. Obviously, the fittest elephant in every clan could not be extended, where \({\text{x}}_{\text{c}\text{i}\text{j}}={\text{x}}_{\text{b}\text{e}\text{s}\text{t},\text{c}\text{i}}\). This scenario can be eliminated by the fittest elephant and has been expanded as,

$${\text{x}}_{\text{n},\text{c}\text{i},\text{j}}={\upbeta }\times {\text{x}}_{\text{c}\text{e}\text{n},\text{c}\text{i}}$$
(9)

where \({\upbeta }\in \left[\text{0,1}\right]\) implies the factor, which calculates the influence of \({\text{x}}_{\text{c}\text{e}\text{n}\text{t}\text{e}\text{r},\text{c}\text{i}}\) on \({\text{x}}_{\text{n}\text{e}\text{w},\text{c}\text{i}\text{j}}\). In fact, the novel individual \({\text{x}}_{\text{n},\text{c}\text{i}\text{j} }\) in Eq. (2) is produced by the data attained from every elephant in clan ci. \({\text{x}}_{\text{c}\text{e}\text{n},\text{c}\text{i}}\) represents the centre of clan \(\text{c}\text{i}\), and dth dimension is determined as,

$${\text{x}}_{\text{c}\text{e}\text{n}\text{t}\text{e}\text{r},\text{c}\text{i},\text{d}}=\frac{1}{{\text{n}}_{\text{c}\text{i}}}\times {\sum }_{\text{j}=1}^{{\text{n}}_{\text{C}\text{i}}}{\text{x}}_{\text{c}\text{i},\text{j},\text{d}}$$
(10)

where \(1\le \text{d}\le \text{D}\) signifies the dth dimension, and \(\text{D}\) refers to dimension. \({\text{n}}_{\text{c}\text{i}}\) depicts the number of elephants present in clan \(\text{c}\text{i}.{\text{x}}_{\text{c}\text{i},\text{j},\text{d}}\) is the dth dimension of elephant individual \({\text{x}}_{\text{c}\text{i},\text{j}}\). The centre of clan ci, \({\text{x}}_{\text{c}\text{e}\text{n},\text{c}\text{i}}\) is determined by \(\text{D}\) calculations based on Eq. (10).

2.2.2.2 Separating operator

The male elephants often live separately by leaving their community, once they mature. This isolation process can be developed into separating operator, while resolving the optimization problem. The search ability for EHO model is enhanced by considering the scenario in which the elephant individuals that possess inferior fitness would be named as separating operator at every iteration as given in Eq. (11).

$${\text{x}}_{\text{w},\text{c}\text{i}}={\text{x}}_{ \text{m}\text{i}\text{n} }+\left({\text{x}}_{ \text{m}\text{a}\text{x} }-{\text{x}}_{ \text{m}\text{i}\text{n} }+1\right)\times \text{r}\text{a}\text{n}\text{d}\text{o}\text{m}$$
(11)

where \({\text{x}}_{ \text{m}\text{a}\text{x} }\) and \({\text{x}}_{ \text{m}\text{i}\text{n} }\) are maximum and minimum limits of the position of an elephant. \({\text{x}}_{\text{w},\text{c}\text{i}}\) is considered to be the worst elephant individual in clan ci. \(\text{r}\text{a}\text{n}\text{d}\in \left[\text{0,1}\right]\) defines the type of stochastic distribution as well as uniform distribution within the range \(\left[\text{0,1}\right]\) and is applied in the recent operation.

2.2.2.3 Elitism strategy

In line with the meta-heuristic approach, the type of elitism procedure is employed here to prevent the decomposition of optimal elephant individuals by clan updating and separating operators. Initially, the optimal elephant individual is protected and inferior ones are swapped by the protected optimal elephant individual in the searching procedure. It confirms that the advanced elephant population is not worst at every search process.

2.2.2.4 Improved EHO algorithm

The classical EHO algorithm fails to consider the optimal details which exist in the earlier group of distinct elephants for guiding present and future searching processes. It might result in slower convergences on complex optimization problems. So, certain information, utilized for the earlier individual elephants, was reprocessed with the intention of enhancing the searching capability of EHO algorithm.

The EHO algorithm is improved using new individual updating strategy. Theoretically, \(k(k\ge 1)\) the earlier elephant individuals can be chosen. However, the selection of many individuals (> 3\()\) makes the weight determination process difficult. So, the value of k is examined by using \(k\in \{\text{1,2},3\}.\)

Consider \({X}_{r}^{t}\) is the \(ith\) individual at round \(t\) whereas \(xr\) and \({f}_{r}^{t}\) denote position and fitness values, correspondingly. At this point, \(r\) is the present round, \(1\le i\le {N}_{P}\) is an integer number, and NP is the population size. \({y}_{i}^{r+1}\) is the individual produced by the traditional EHO algorithm and \({f}_{i}^{r+1}\) is fitness [24]. The presented model is derived by the individual at \((r-2)th,(r-1)th,\) \(rth\), and \((r+1)th\) rounds. In case of k=1, the \(ith\) individual \({X}_{i}^{r+1}\) could be produced as given

$${x}_{i}^{r+1}=\theta {y}_{i}^{r+1}+\omega {x}_{j}^{r}$$
(12)

where \({x}_{j}^{t}\) represents the individual’s position \(j(j\in \{\text{1,2},\ldots , {N}_{P}\left\}\right)\) at round \(r\), \(\theta\) and \(\omega\) are weighting factors filling \(\Theta\). It could be represented in Eq. (13).

$$\theta =rand, co=1-rand$$
(13)

where \(rand\) is an arbitrary number that lies under the uniform distribution in \(\left[\text{0,1}\right]\). The individual \(j\) could be computed using two criteria

  1. 1

    \(\dot{L}=i;\)

  2. 2

    \(\dot{L}\)=rand1, where \({r}_{and1}\) is random number, \(1\le {rand}_{1}\le NP.\)

The individuals produced from the second point attained high population diversity over the other one.

2.3 Validation phase

Once the face sketches of the images, captured by IoT devices, are generated by ODL-CNN model, a similarity measurement is done with professional sketch image. The measures used to determine the similarity are Structural Similarity (SSIM) and Peak Signal to Noise Ratio (PSNR). The images with maximum resemblance are considered as the suspect.

3 Performance validation

The proposed method was simulated using Python Programming language in a PC with configurations such as Intel i7-7500UCPU @ 2.70 GHz, 8 GB RAM, 64-bit OS and 1 TB HDD. To evaluate the efficiency of the projected technique, a detailed simulation analysis was performed on four benchmark datasets. A detailed quantitative and qualitative analyses, using SSIM and PSNR measure, were performed. The parameter settings of the proposed method are given as follows: Batch Size: 64, Epochs: 100, Learning rate: 0.001 and Activation function ReLU.

3.1 Dataset used

The analysis of the proposed technique was conducted over four datasets namely IIIT-D Sketch database [25], CUHK [26], AR [27] and CUFSF [28] dataset. The first IIIT-D dataset comprises of a group of three types of sketches such as Semi-forensic Sketch, Forensic Sketch database, and Viewed Sketch. The used database comprises of 238 sketch-digital image set of pairs. From specialized sketch artist, the sketches were derived from different sources and provided as digital images. From FG-NET aging database, 67 sketch-digital image set of pairs were considered. A total of 72 pairs of sketch-digital image from IIIT-D staff and student database was sourced and finally from Labeled Faces in Wild (LFW) database, 99 sketch-digital images were sourced. Then, CUHK and AR datasets hold a total of 188 and 123 color photo-sketch pairs correspondingly and the CUFSF dataset has a total of 1194 images with respective sketches. Some sample images are shown in the Fig. 5.

Fig. 5
figure 5

Sample test images

3.2 Measures

To evaluate the visual resemblance between the viewed sketch and output from the projected technique, SSIM and PSNR were employed. To quantitatively evaluate the quality of a visual image, SSIM was employed as it is the common metric in this regard. To find out the structural resemblance between two input images, it acts as a Human Visual System (HVS)-based metric. By employing loss combination in luminance, contrast, and correlation, the image distortion is designed by it. The rate of SSIM remains in the range of 0 towards 1 while the maximum resemblance is represented through higher value. It is demonstrated as

$${\text{SSM}}=\frac{\left(2{\upmu}_{{\text{x}}}{\upmu}_{{\text{x}}}+{\text{C}}_{1}\right)\left(2{\upsigma}_{{\text{xy}}}+{\text{C}}_{2}\right)}{({\upmu}_{{\text{x}}}^{2}+{\upmu}_{{\text{x}}}^{2}+{\text{C}}_{1})({\upsigma}_{{\text{x}}}^{2}+{\upsigma}_{{\text{x}}}^{2}+{\text{C}}_{2})}$$
(14)

PSNR is a measure used to determine the quality of original and reconstructed images. The value of PSNR should be high for better performance.

3.3 Results analysis

With a given sketch image database, Fig. 6 demonstrates the comparative outcomes of projected technique. The first two rows demonstrate the actual input and viewed sketch images correspondingly. At the same time, the fifth and fourth rows demonstrate the generated image through proposed techniques and corrected gamma image. It is proved from the figure that the derived sketch image through projected method is visually clear when compared with the viewed sketch drawn by artist.

Fig. 6
figure 6

Sample visualization of different sketches. a Input image, b Viewed sketch, c forensic image, d proposed sketch

Figures 7 and 8 show the comparison of results offered by ODL-CNN with existing models in terms of PSNR and SSIM. A set of methods namely Markov Random Field (MRF), Markov Weight Field (MWF), Sparse Representation-based Global Search method (SRGS), Semi-Coupled Dictionary Learning method (SCDL), CNN and Modified CNN (MCNN) [19] were used.

Fig. 7
figure 7

PSNR analysis of ODL-CNN with different models

Fig. 8
figure 8

SSIM analysis of ODL-CNN with different models

Based on the results on AR dataset in terms of PSNR, the SCDL model failed to yield good results and attained a minimal PSNR value of 17.18 dB. Next to that, the MWF and CNN models yielded slightly better and closer PSNR values of 18.74 dB and 18.23 dB, respectively. Along with that, the MRF and SRGS models attained even higher PSNR values of 19.84 dB and 19.13 dB, respectively. In line with this, a competitive PSNR value of 20.10 dB was achieved by M-CNN model, whereas the proposed ODL-CNN model offered superior performance with higher PSNR value of 21.98 dB. While computing the simulation outcome on AR dataset with respect to SSIM, both MRF and MWF models were unable to demonstrate the best results and reached a least SSIM of 0.62 dB. Then, the CNN and SCDL methods exhibited slightly better and identical SSIM values of 0.63 dB and 0.64 dB correspondingly. In line with this, the SRGS and M-CNN approaches accomplished better SSIM values of 0.65 dB. Along with that, the newly developed ODL-CNN model attained optimal performance with maximum SSIM value of 0.69 dB. In the determination of final outcome of CUHK dataset by means of PSNR, the MWF method seems to be inappropriate to illustrate the maximum results and achieved lower PSNR value of 14.41 dB. Followed by this, the SRGS and MRF techniques arrived at moderate and nearby PSNR values of 14.79 dB and 15.07 dB respectively. Likewise, SCDL and CNN models reached the finest PSNR values of 15.14 dB and 15.64 dB correspondingly. Similarly, a competing PSNR value of 17.15 dB was attained by M-CNN model while the presented ODL-CNN model ended up with best performance by accomplishing high PSNR value of 18.64 dB.

By estimating the final outcome on CUHK dataset, with respect to SSIM, the MRF and SRGS models are found to be ineffective to produce optimal results since it obtained lower SSIM value of 0.58 dB. Next, the CNN, MWF and SCDL frameworks attained gradual and identical SSIM value of 0.59 dB. In line with this, the M-CNN model reached far better SSIM value of 0.60 dB. Likewise, the newly projected ODL-CNN model yielded best performance with good SSIM value of 0.63 dB. During the performance evaluation of CUFSF dataset in terms of PSNR, the SCDL models failed to display better results and accomplished a less PSNR value of 12.40 dB. Besides, the MWF and CNN models provided considerable and identical PSNR values of 14.34 dB and 14.36 dB correspondingly. In line with this, the MRF and SRGS models achieved better PSNR values of 15.72 dB and 15.34 dB respectively. Along with this, a competing PSNR value of 17.15 dB was produced by M-CNN model and the deployed ODL-CNN model offered qualified performance with the maximum PSNR value of 18.07 dB.

By computing the results on CUFSF dataset, with respect to SSIM, the SCDL model is found to be inappropriate to produce better results since it reached a minimum SSIM value of 0.37 dB. Then, the CNN and MRF models ended with moderate and closer SSIM value of 0.38 dB. Similarly, the MWF and SRGS models accomplished even better SSIM values of 0.39 dB and 0.40 dB correspondingly. Likewise, a competitive SSIM value of 0.46 dB was achieved by M-CNN model while the proposed ODL-CNN model performed extremely well with the best SSIM value of 0.54 dB. When determining the results on IIIT dataset, by means of PSNR, the MWF model failed to produce the best results and reached a lower PSNR value of 17.20 dB. Following this, both SRGS and SCDL models concluded with manageable and closer PSNR values of 18.46 dB and 18.33 dB correspondingly. On the same way, the MRF and CNN models achieved even better PSNR values of 19.26 dB and 19.62 dB respectively. Similarly, a competitive PSNR value of 20.98 dB was yielded by M-CNN model while the projected ODL-CNN model depicted qualified performance with maximum PSNR value of 21.74 dB. When calculating the outcome on IIIT dataset in terms of SSIM, the MRF model failed to provide good results and achieved a lower SSIM value of 0.54 dB. Then, the MWF and SCDL methods yielded considerable and closer SSIM values of 0.57 dB and 0.58 dB correspondingly. In line with this, the SRGS and CNN models reached even greater SSIM values of 0.59 dB and 0.61 dB correspondingly. In line with this, the competitive SSIM value of 0.64 dB was attained by M-CNN model while the deployed ODL-CNN model yielded better performance with maximum SSIM value of 0.68 dB.

Table 1; Figs. 9 and 10 show the average analyses of the proposed ODL-CNN model with existing ones in terms of PSNR and SSIM. The proposed ODL-CNN model showcased effective results with maximum PSNR value of 20.11dB and SSIM of 0.64.

Table 1 Average PSNR and SSIM analyses of existing models with ODL-CNN method
Fig. 9
figure 9

Average PSNR analysis of ODL-CNN with different models

Fig. 10
figure 10

Average SSIM analysis of ODL-CNN with different models

A detailed accuracy analysis of ODL-CNN model with existing models is shown in Table 2; Fig. 11. Also, Fig. 12 shows the average accuracy analysis of different models on the applied four datasets. It is depicted that both MWF and SCDL models attained ineffective and closer average accuracy values of 69.14% and 69.08%, respectively. Simultaneously, the MRF model produced slightly higher average accuracy of 70.26%. In line with this, the SRGS model outperformed previous methods with average accuracy of 71.14%. Concurrently, the CNN model produced moderate results with average accuracy of 78.64%. Then, the M-CNN model outperformed all the other methods except ODL-CNN with an average accuracy of 85.61%. However, the proposed ODL-CNN model performed extraordinarily with maximum average accuracy of 90.10%.

Table 2 Accuracy analysis of existing with proposed ODL-CNN method
Fig. 11
figure 11

Accuracy analysis of ODL-CNN with different models

Fig. 12
figure 12

Average accuracy analysis of ODL-CNN with different models

Based on the above presented graphs, it is evident that the proposed ODL-CNN model outperformed all the existing methods on four applied dataset. Therefore, it can be found as an appropriate tool for FSS and it helps in suspect identification process effectively.

4 Conclusion

In this paper, a new FSS approach is proposed using ODL-CNN model where the parameter optimization of DL-CNN was performed using IEHO algorithm. Before transforming the digital images to face sketches, gamma correction was employed to improve the contrast level of the input image. The ODL-CNN model drawn the sketches of the input images which underwent similarity assessment with professional sketch for proper suspect identification. The proposed technique was assessed using four datasets. To evaluate the visual resemblance between viewed sketch and output from the projected technique, SSIM and PSNR were employed. The detailed simulation analysis pointed out the effective performance of ODL-CNN model with maximum average PSNR of 20.11 dB, average SSIM of 0.64 and average accuracy of 90.10%. The presented model exhibited better performance over other methods in all the applied test images. In future, the proposed model can be implemented in real time scenario to investigate surveillance videos in railway stations, airports and other public places.