1 Introduction

Applications of deep learning approaches in digital holography are highly promising as they can adapt well to 2D and 3D information processing. In recent years, a variety of applications of deep neural networks in digital holography have been demonstrated to handle a wide range of problems, such as hologram segmentation, pixel super-resolution, three-dimensional (3D) digital holographic display, focus prediction, phase unwrapping, image super-resolution, and objects classification [1,2,3,4,5,6,7,8]. Image reconstructed from a digital hologram is a 2D digital complex-valued image containing 3D information [9,10,11,12]. Amplitude and phase information contained in such a holographically reconstructed image data set can be post-processed in a real-time deep learning framework to perform the 2D/3D object classification and regression tasks.

Classification and regression tasks are the basic deep learning applications in supervised learning. The major difference between classification and regression is that the output of the classification task produces discrete labels, whereas the output of the regression task produces continuous labels [13]. Convolutional neural network (CNN) is one of the important deep neural networks and is applied in classification [14] and regression tasks [15]. CNN consists of independent layers in the feature extraction and classification stages [16]. The feature extraction stage has multiple convolutional and pooling layers and the classification stage includes dense layers and an output layer. Convolutional layers comprised convolutional filters and the number of filters and their size are varied according to the particular problem in hand. The dense layer encompasses several numbers of neurons and the number of neurons present in the output layer changes according to the problem under consideration.

Ren et al. [17] have proposed CNN to perform autofocusing as a regression task on holograms dataset at different amplitude and phase objects. Pitkaaho et al. have proposed that deep CNN can perform depth prediction as classification task on digital holographic images [18]. Ren et al. also have proposed that CNN can perform autofocusing as a classification task on digital holographic images [19]. Lee et al. have proposed that CNN can be used to perform autofocusing in off-axis digital Fresnel holography [20]. Pitkaaho et al. have proposed that Alex network can perform autofocusing in deep learning for digital holograms of semi-transparent biological samples [21]. Reddy et al. proposed deep CNN to perform 3D objects classification of augmented holographic phase only image dataset [22]. Pitkaaho et al. [4] proposed that Alex and Visual Geometry Group (VGG) neural networks can be employed to perform focus prediction for Madin–Darby Canine kidney cell clusters encoded in digital holograms. Recently, Shimobaba et al. [15] have proposed the application of CNN to perform binary regression task for depth prediction on holograms and power spectrum datasets.

The present paper demonstrates a CNN-based binary regression performed on the whole object information dataset of the 3D objects retrieved using digital holography. The major difference between the proposed method and the previously mentioned methods is that, in this method CNN is used to perform a binary regression task on the whole information retrieved from digital holograms, i.e., by the combined use of the intensity and phase of the retrieved object field from the digital holograms in the dataset. This is achieved by preparing an image dataset, presenting the intensity and phase in the concatenated form in a single image to represent the whole information of 3D objects. The regression task in supervised learning is applied to the whole information objects dataset to accomplish the binary regression task. This enables us to process the whole information, i.e., intensity and phase (depth) features of the 3D object simultaneously for CNN-based binary regression in a deep learning network. Binary regression in deep learning done on holographic information of 3D objects is equivalent to 3D objects prediction done on whole information objects data set, which produces continuous labels as output, which justifies the intention of the present work. To perform the proposed binary regression, we have considered a set of 18 different 3D objects that was divided into two subsets namely ‘SET1’ and ‘SET2’ with elements 4 and 14 respectively. Dataset for this experiment consisted of 2268 concatenated reconstructed intensity-phase images produced from the aforementioned 3D objects at various locations and rotation angles. The whole information of such 3D objects in a scene was retrieved using an off-axis digital holographic recording geometry and a complex wave retrieval algorithm [23]. The results of CNN-based binary regression task on the concatenated intensity-phase images of these 18 3D objects belonging to ‘SET1’ and ‘SET2’ were analyzed using different evaluation metrics like mean absolute error (MAE), \({R}^{2}\) score (coefficient of determination), and explained variance (EV) regression score on the test/validation sets. Further, CNN was compared with the methods like K-nearest neighbor (KNN), support vector machine (SVM), multi-layer perceptron (MLP), decision tree (DT), AdaBoost (ADB), random forest (RF), extra trees (ET), gradient boosting (GB), histogram gradient boosting (HGB), and stochastic gradient descent (SGD) regressors. The robustness and performance metrics are presented from a proof of the concept experiment.

2 Regression analysis of 3D objects

A CNN-based binary regression of 3D objects is illustrated as follows. A set of 18 different 3D objects considered in this study are circle–pentagon (\({A}_{{d}_{i}}\)), circle–triangle (\({B}_{{d}_{i}}\)), circle–square (\({C}_{{d}_{i}}\)), circle–rectangle (\({D}_{{d}_{i}}\)), square–circle (\({E}_{{d}_{i}}\)), square–pentagon (\({F}_{{d}_{i}}\)), square–rectangle (\({G}_{{d}_{i}}\)), square–triangle (\({H}_{{d}_{i}}\)), triangle–circle (\({J}_{{d}_{i}}\)), triangle–pentagon (\({K}_{{d}_{i}}\)), triangle–rectangle (\({L}_{{d}_{i}}\)), triangle–square (\({M}_{{d}_{i}}\)), pentagon–circle (\({N}_{{d}_{i}}\)), pentagon–square (\({O}_{{d}_{i}}\)), pentagon–triangle (\({Q}_{{d}_{i}}\)), rectangle–circle (\({R}_{{d}_{i}}\)), rectangle–square (\({S}_{{d}_{i}}\)), and rectangle–triangle (\({T}_{{d}_{i}}\)) constituting the \(\left\{k\right\}\).

$$k\in \left\{\begin{array}{c}{A}_{{d}_{i}}, {B}_{{d}_{i}},{C}_{{d}_{i}}, {D}_{{d}_{i}},{E}_{{d}_{i}},{F}_{{d}_{i}}, {G}_{{d}_{i}},{H}_{{d}_{i}},\\ {J}_{{d}_{i}},{K}_{{d}_{i}},{L}_{{d}_{i}}, {M}_{{d}_{i}},{N}_{{d}_{i}}, {O}_{{d}_{i}},{Q}_{{d}_{i}}{,R}_{{d}_{i}}, {S}_{{d}_{i}},{T}_{{d}_{i}}\end{array}\right\}, {d}_{i}\in \left\{{d}_{1},\,\,\dots \dots {d}_{i},\dots \dots {d}_{N}\right\}$$
(1)

Here \({d}_{i}\) denotes the distance between the recording plane and object plane and \(i\) represents the index for the individual objects. The objects are located at different distances (locations) in a 3D scene. These objects from \(\left\{k\right\}\) are categorized into two subsets, i.e., ‘SET1’ and ‘SET2’ which are represented as follows:

$${\mathrm{SET}1: k}_{1}\in \left\{{A}_{{d}_{i}}, {B}_{{d}_{i}},{C}_{{d}_{i}}, {D}_{{d}_{i}}\right\}$$
(2)
$$\mathrm{SET}2:{k}_{2}\in \left\{{E}_{{d}_{i}},{F}_{{d}_{i}}, {G}_{{d}_{i}},{H}_{{d}_{i}}, {J}_{{d}_{i}},{K}_{{d}_{i}},{L}_{{d}_{i}}, {M}_{{d}_{i}},{N}_{{d}_{i}}, {O}_{{d}_{i}},{Q}_{{d}_{i}}{,R}_{{d}_{i}}, {S}_{{d}_{i}},{T}_{{d}_{i}}\right\}$$
(3)

Here the circle–pentagon (\({A}_{{d}_{i}}\)), circle–triangle (\({B}_{{d}_{i}}\)), circle–square (\({C}_{{d}_{i}}\)), and circle–rectangle (\({D}_{{d}_{i}}\)) combinations are considered for the ‘SET1’ and the remaining objects constitute the ‘SET2’.

The construction of the chosen 18 3D objects used in this study for experimental hologram recording can be understood from the schematic illustration of the four of the objects presented in Fig. 1. Each of the 3D objects consists of two parallel planes, namely front and back planes separated by a distance z = 8 mm. Features present on the planes together with depth \((z)\) impart both intensity and phase information on it as depicted in Fig. 1. Intensity features on the planes are made by shapes, namely circle, triangle, square, pentagon, and rectangle, and hence the names of the 3D object.

Fig. 1
figure 1

3D objects used for the recording of off-axis digital Fresnel hologram: a pentagon–circle, b rectangle–circle, c square–circle and d triangle–circle. Circle: 2 mm in diameter, triangle: 2 mm in x and 2 mm in y directions, square: 2 mm in x and 2 mm in y directions, pentagon: 2 mm in x and 2 mm in y directions, rectangle: 2 mm in x and 1 mm in y directions, the distance between the front and the back plane is 8 mm in the z-direction

Let the complex field distribution of the front plane is represented by \({a}_{1}\left(x,y\right)\mathrm{exp}[i{\phi }_{1}\left(x,y\right)]\) and similarly the complex field distribution of the back plane is represented by \({a}_{2}\left(x,y\right)\mathrm{exp}(i{\phi }_{2}\left(x,y\right)]\). When light is propagated through the front plane, the amplitude and phase information of the object with the certain features of the front plane is obtained. Then, after propagating through a distance \(z\) using free space propagation, the amplitude and phase information of the second 3D object features in the back plane is also obtained. Now, the total complex field distribution of 3D object wave field immediately after the back plane is represented by Eq. (4).

$${a}_{u}\left(x,y\right)=\left\{{a}_{1}\left(x,y\right)\mathrm{exp}\left[i{\phi }_{1}\left(x,y\right)\right] \otimes \mathrm{exp}\left[\frac{i\pi }{\lambda z}\left({x}^{2}+{y}^{2}\right)\right]\right\}{a}_{2}\left(x,y\right)\mathrm{exp}[i{\phi }_{2}\left(x,y\right)]$$
(4)

Here ⊗ represents the convolution operation and\({a}_{1}\left(x,y\right)\),\({\phi }_{1}(x,y)\), and \({a}_{2}(x,y)\), \({\phi }_{2}(x,y)\) represent the amplitude and phase components of the 3D object in the front and back planes respectively. To detect the 3D information of the objects data set, the complex field distribution of each of the 3D objects is recorded into a single digital Fresnel hologram in off-axis geometry. Thereafter the complex object whole wave field information is retrieved by computational means and used for further processing by feeding to a deep neural network to perform binary regression task. An off-axis digital Fresnel hologram is formed by the interference between the object wave \({a}_{u}\left(.\right)\) and plane reference wave \(R\left(.\right)=\mathrm{exp}\left(i\theta \right)\) at an angle θ. Let the complex field distribution of the 3D object function \({a}_{u}(x,y)\) located at a distance \(d\) from the recording plane is given by Eq. (5).

$$U\left(x^{\prime},y^{\prime},d\right)= \frac{{\mathrm{e}}^{i{k}_{1}d}}{i\lambda d}\iint {a}_{u}\left(x,y,0\right)\,\mathrm{exp}\frac{i\pi }{\lambda d}\left[{\left(x^{\prime}-x\right)}^{2}+{\left(y^{\prime}-y\right)}^{2}\right] \mathrm{d}x \mathrm{d}y$$
(5)

Here \(U(.)\) is the complex Fresnel field distribution of the 3D object at the recording plane, \((x,y)\) and (x′, y′) are the coordinates in the object plane and recording plane respectively, and λ is the laser wavelength used to propagate the 3D object information. The recorded off-axis Fresnel digital hologram H(x′, y′) is described by Eq. (6).

$$H\left(x^{\prime},y^{\prime}\right)={\left|U\left(x^{\prime},y^{\prime}\right)+R\left(x^{\prime},y^{\prime}\right)\right|}^{2}={|{O}_{u}\,\mathrm{exp}(i{\overrightarrow{k}}_{1}.\overrightarrow{r})+\mathrm{exp}(i{\overrightarrow{k}}_{2}.\overrightarrow{r})|}^{2}$$
(6)

In the above Eq. (6), \({\overrightarrow{k}}_{1}\) and \({\overrightarrow{k}}_{2}\) represent the propagation wave vector of object and reference waves respectively. Now, the resultant wave vector \(\overrightarrow{K}\) of the fringe pattern is given by Eq. (7).

$$\left|\overrightarrow{K}\right|=\left|{\overrightarrow{k}}_{1}-{\overrightarrow{k}}_{2}\right|=\frac{4\pi }{\lambda }\mathrm{sin}\left(\frac{\theta }{2}\right)$$
(7)

where θ is the angle between the object and reference waves during the interference.

Figure 2 shows the recording setup of an off-axis digital Fresnel hologram of 3D objects. He–Ne laser emitting light of wavelength \(\uplambda =632.8\) nm is used as the source. Off-axis digital Fresnel hologram is recorded using CMOS sensor at an angle \(\theta =1.4^\circ\). A CMOS sensor with square pixel pitch of \(6\times 6 \mu \mathrm{m}\) is used for recording the hologram of size \(1600\times 1600\) pixels. The complex wave retrieval algorithm [23] is used to retrieve the complex object wave-field of 3D object at the recording plane and focusing back to the object plane by an inverse Fresnel transform. This method is advantageous in eliminating zero order and twin image terms from the off-axis digital Fresnel hologram. The resulting 2D digital complex-valued images at the object plane contain the 3D information of the objects.

Fig. 2
figure 2

Off-axis digital holography system configured for the recording of hologram of 3D objects. SF spatial filter assembly; CL collimation lens; BS beam splitter; M mirror; CMOS camera sensor

The 3D objects of ‘\(\left\{k\right\}\)’ shown in Eq. (1) denote the real-valued digital holograms, whereas the ‘\(\left\{Rk\right\}\)’ shown in Eq. (8) represents the reconstructed 2D digital complex-valued images of 3D objects which contain intensity and phase information. Thus, the 2D digital complex-valued images of each 3D object belonging to the ‘\(\left\{k\right\}\)’ belong to ‘\(\left\{Rk\right\}\)’.

$$Rk\in \left\{\begin{array}{c}{RA}_{{d}_{i}}, R{B}_{{d}_{i}},{RC}_{{d}_{i}}, {RD}_{{d}_{i}},{RE}_{{d}_{i}},{RF}_{{d}_{i}},\\ {RG}_{{d}_{i}},{RH}_{{d}_{i}}, {RJ}_{{d}_{i}},{RK}_{{d}_{i}},{RL}_{{d}_{i}}, R{M}_{{d}_{i}},\\ {RN}_{{d}_{i}}, {RO}_{{d}_{i}},{RQ}_{{d}_{i}}{,RR}_{{d}_{i}}, {RS}_{{d}_{i}},{RT}_{{d}_{i}}\end{array}\right\}, \quad {d}_{i}\in \left\{{d}_{1},\dots \dots {d}_{i},\dots \dots {d}_{N}\right\}$$
(8)

The reconstructed intensity images of the 3D objects are obtained from the 2D digital complex-valued images of \(\left\{Rk\right\}\) to form intensity image dataset ‘\(\{R{k}_{I}\}\)’.

$$R{k}_{I}\in \left\{\begin{array}{c}{RA}_{{d}_{i,I}}, R{B}_{{d}_{i,I}},{RC}_{{d}_{i,I}}, {RD}_{{d}_{i,I}},{RE}_{{d}_{i,I}},\\ {RF}_{{d}_{i,I}}, R{G}_{{d}_{i,I}},{RH}_{{d}_{i,I}}, {RJ}_{{d}_{i,I}},{RK}_{{d}_{i,I}},\\ {RL}_{{d}_{i,I}}, R{M}_{{d}_{i,I}},{RN}_{{d}_{i,I}}, {RO}_{{d}_{i,I}},{RQ}_{{d}_{i,I}},\\ {RR}_{{d}_{i,I}}, R{S}_{{d}_{i,I}},{RT}_{{d}_{i,I}}\end{array}\right\}, \quad {d}_{i}\in \left\{{d}_{1},\dots \dots {d}_{i},\dots \dots {d}_{N}\right\}$$
(9)

Here index \(I\) represents the set containing only intensity images. Similarly, the phase images of 3D objects are also separated from the 2D digital complex-valued images of ‘\(\{Rk\}\)’ to form phase image dataset ‘\(\{R{k}_{P}\}\)’.

$$R{k}_{P}\in \left\{\begin{array}{c}{RA}_{{d}_{i,P}}, R{B}_{{d}_{i,P}},{RC}_{{d}_{i,P}}, {RD}_{{d}_{i,P}},{RE}_{{d}_{i,P}},\\ {RF}_{{d}_{i,P}}, R{G}_{{d}_{i,P}},{RH}_{{d}_{i,P}}, {RJ}_{{d}_{i,P}},{RK}_{{d}_{i,P}},\\ {RL}_{{d}_{i,P}}, R{M}_{{d}_{i,P}},{RN}_{{d}_{i,P}}, {RO}_{{d}_{i,P}},{RQ}_{{d}_{i,P}},\\ {RR}_{{d}_{i,P}}, R{S}_{{d}_{i,P}},{RT}_{{d}_{i,P}}\end{array}\right\}, {d}_{i}\in \left\{{d}_{1},\dots \dots {d}_{i},\dots \dots {d}_{N}\right\}$$
(10)

Here index \(P\) represents the set containing only phase images. Later, the intensity image dataset ‘\(\{R{k}_{I}\}\)’ and phase image dataset ‘\(\left\{R{k}_{P}\right\}\)’ are combined to form concatenated intensity-phase image dataset ‘\(\{R{k}_{IP}\}\)’.

$$R{k}_{IP}\in \left\{\begin{array}{c}{RA}_{{d}_{i,IP}}, R{B}_{{d}_{i,IP}},{RC}_{{d}_{i,IP}}, {RD}_{{d}_{i,IP}},{RE}_{{d}_{i,IP}},\\ {RF}_{{d}_{i,IP}}, R{G}_{{d}_{i,IP}},{RH}_{{d}_{i,IP}}, {RJ}_{{d}_{i,IP}},{RK}_{{d}_{i,IP}},\\ {RL}_{{d}_{i,IP}}, R{M}_{{d}_{i,IP}},{RN}_{{d}_{i,IP}}, {RO}_{{d}_{i,IP}},{RQ}_{{d}_{i,IP}},\\ {RR}_{{d}_{i,IP}}, R{S}_{{d}_{i,IP}},{RT}_{{d}_{i,IP}}\end{array}\right\}, {d}_{i}\in \left\{{d}_{1},\dots \dots {d}_{i},\dots \dots {d}_{N}\right\}$$
(11)

The \(\left\{R{k}_{IP}\right\}\) is further subdivided into two different sets ‘SET1’ and ‘SET2’ as given by Eqs. (12) and (13) respectively.

$${\mathrm{SET}1:Rk}_{IP1}\in \left\{{RA}_{{d}_{i,IP}}, R{B}_{{d}_{i,IP}},{RC}_{{d}_{i,IP}}, {RD}_{{d}_{i,IP}}\right\}$$
(12)
$${\mathrm{SET}2: Rk}_{IP2}\in \left\{\begin{array}{c}{RE}_{{d}_{i,IP}},{RF}_{{d}_{i,IP}}, R{G}_{{d}_{i,IP}},{RH}_{{d}_{i,IP}}, {RJ}_{{d}_{i,IP}},{RK}_{{d}_{i,IP}},\\ {RL}_{{d}_{i,IP}}, R{M}_{{d}_{i,IP}},{RN}_{{d}_{i,IP}}, {RO}_{{d}_{i,IP}},{RQ}_{{d}_{i,IP}},\\ {RR}_{{d}_{i,IP}}, R{S}_{{d}_{i,IP}},{RT}_{{d}_{i,IP}}\end{array}\right\}$$
(13)

CNN is designed to perform binary regression task on concatenated intensity-phase image dataset ‘\(\{R{k}_{IP}\}\)’ which contains whole information of the objects set. Binary regression in deep learning done on holographic information of 3D objects is equivalent to 3D objects prediction done on whole information objects dataset which produces continuous labels as output and this justifies the intention of the present work. Next, CNN was compared with K-nearest neighbor (KNN), support vector machine (SVM), multi-layer perceptron (MLP), decision tree (DT), AdaBoost (ADB), random forest (RF), extra trees (ET), gradient boosting (GB), histogram gradient boosting (HGB), and stochastic gradient descent (SGD) regressors.

3 Design of convolutional neural network for binary regression task

Figure 3 shows the architecture of CNN that is used to perform 3D objects binary regression task using concatenated intensity-phase image (whole information) dataset ‘\(\{R{k}_{IP}\}\)’.

Fig. 3
figure 3

Proposed convolutional neural network (CNN) for regression task

CNN has four stages of convolutional layers and four stages of pooling layers combined in the feature extraction stage. Classification layer present in CNN was modified into regression layer to perform binary regression task. CNN takes the input as concatenated intensity-phase image obtained from the digital hologram. The input to CNN was of size \(128\times 128\) from \(1600\times 3200\). First convolutional layer performed convolution operation using the input of size \(128\times 128\times 32\) and \(3\times 3\) filter to produce the first stage output of size \(126\times 126\times 32\). In the first convolutional layer, \(32\) filters were used and size of each filter was \(3\times 3\). Each convolutional layer generated the feature map which was reduced in the pooling stage. First maxpooling stage compressed the output of the first convolutional layer stage from \(126\times 126\times 32\) to \(63\times 63\times 32\). Second convolutional layer got the input from the first pooling stage as \(63\times 63\times 64\) and produced the output of size \(61\times 61\times 64\). In second convolutional layer, \(64\) filters were used. The output of the second convolutional layer stage was given as input to the second maxpooling stage. Second maxpooling stage produced an output of size \(30\times 30\times 64\). The same process was repeated for two more stages simultaneously. In third and fourth convolutional layers, \(128\) and \(256\) filters were used. The size of the filter in all four convolutional layers was same as the first convolutional layer. The subsequent pooling stage helped to reduce the number of parameters and thereby reducing the complexity of the network. Rectified linear unit (ReLU) activation function was present in the dense layers and in the convolutional layers. The output of the fourth pooling stage was of size \(6\times 6\times 256\). Fourth pooling stage output was flattened and was given to the dense layer. The dense layer selected \(2048\) neurons among \(9216\) neurons. Next, the output layer received the input from the output of the dense layer. Finally, the output layer performed binary regression task by considering a single neuron along with a linear activation function to generate the output. The equation for the linear activation function is given below:

$$y=x$$
(14)

Here y and x are the output and input values respectively.

3.1 Evaluation metrics

The evaluation metrics for the binary regression task considered are mean absolute error (MAE), R2 score (coefficient of determination), and explained variance (EV) regression score. The MAE is the absolute difference between the target value and predicted value.

$${\text{MAE}}\left( {t,p} \right) = \frac{1}{N}\sum\limits_{{i = 1}}^{N} {|t_{i} - p_{i} |}$$
(15)

where \({t}_{i}\) is the target value and \({p}_{i}\) is the predicted value. The range for MAE is not fixed. However, the MAE value should be as low as possible, i.e., it should be equal to zero (0). The R2 score and EV regression score measure the performance of the network by predicting the future values of data. The R2 score of 1.0 tells that the network has the best performance, and a score of 0.0 tells that the network has a constant performance. The EV regression score and R2 score are given by

$$\mathrm{EV}\left(t,p\right)=1-\frac{\mathrm{Var}\{t-p\}}{\mathrm{Var}\{t\}}$$
(16)

Here \(\mathrm{Var}\{.\}\) represents the variance.

$${R}^{2}\left(t,p\right)=1-\frac{{\sum }_{i=1}^{N}({{t}_{i}-{p}_{i})}^{2}}{{\sum }_{i=1}^{N}{\left({t}_{i}-{t}_{\mathrm{bar}}\right)}^{2}}$$
(17)

Here \(t_{{{\text{bar}}}} = \frac{1}{N}\sum\limits_{{i = 1}}^{N} {(t_{i} )}\) represents mean or average.

4 Dataset preparation

Eighteen 3D objects as shown in Eq. (1) were used for the preparation of the concatenated intensity-phase image (whole information) dataset. Holograms of these eighteen 3D objects were recorded using an off-axis digital holographic recording setup configured in Mach Zehnder interferometer geometry as shown in Fig. 2. These 18 3D objects in the \(\left\{k\right\}\) were used in an off-axis digital holographic recording setup to form 63 holograms in the \(\left\{k\right\}\) of size \(1600\times 1600\) using CMOS sensor at fifteen different distances of \({d}_{1}=180\) mm, \({d}_{2}=185\) mm, \({d}_{3}=200\) mm, \({d}_{4}=201\) mm, \({d}_{5}=205\) mm, \({d}_{6}=210\) mm,\({d}_{7}=220\) mm, \({d}_{8}=250\) mm, \({d}_{9}=251\) mm, \({d}_{10}=255\) mm, \({d}_{11}=300\) mm, \({d}_{12}=305\) mm, \({d}_{13}=310\) mm, \({d}_{14}=311\) mm, and \({d}_{15}=315\) mm respectively.

Recorded holograms of some of the 3D objects at a distance of \({d}_{3}=200\) mm are shown in Fig. 4. Equations (2) and (3) represent the categorization of holograms into two sets as ‘SET1’ and ‘SET2’ respectively. Holograms of the \(\left\{k\right\}\) were used to get intensity and phase information of the 3D objects in the form of a 2D digital complex-valued image using complex wave retrieval method [23] to form ‘\(\left\{Rk\right\}\)’. The ‘\(\left\{R{k}_{I}\right\}\)’ and ‘\(\left\{R{k}_{P}\right\}\)’ consists of intensity and phase images of 3D objects that were obtained from ‘\(\left\{Rk\right\}\)’. The four corresponding elements of ‘\(\left\{R{k}_{I}\right\}\)’ and ‘\(\left\{R{k}_{P}\right\}\)’ were further combined to form concatenated intensity-phase image dataset ‘\(\left\{R{k}_{IP}\right\}\)’. The concatenated intensity-phase image of the object circle–triangle belonging to ‘SET1’ (\({RB}_{{d}_{3,IP}})\) at a distance of \({d}_{3}=200\) mm is shown in Fig. 5. The concatenated intensity-phase image of the object pentagon–square belonging to ‘SET2’ (\({RO}_{{d}_{3,IP}})\) at a distance of \({d}_{3}=200\) mm is shown in Fig. 6. Equations (9) and (10) represent the reconstructed intensity and phase image dataset of the 3D objects respectively. Equation (11) represents the concatenated intensity-phase image dataset of the 3D objects.

Fig. 4
figure 4

Experimentally recorded holograms of five 3-D objects at distance of \({d}_{3}=200\) mm for the preparation of the concatenated intensity-phase image dataset. (i) circle–triangle (\({B}_{{d}_{3}}\)), (ii) square–rectangle (\({G}_{{d}_{3}}\)), (iii) triangle–rectangle (\({L}_{{d}_{3}}\)), (iv) triangle–pentagon (\({K}_{{d}_{3}}\)), (v) pentagon–triangle (\({Q}_{{d}_{3}}\))

Fig. 5
figure 5

Concatenated intensity-phase image of the object circle-triangle (\({RB}_{{d}_{3,IP}}\)) at a distance of \({d}_{3}=200\) mm

Fig. 6
figure 6

Concatenated intensity-phase image of the object pentagon-square (\({RO}_{{d}_{3,IP}}\)) at a distance of \({d}_{3}=200\) mm

Each concatenated intensity-phase image present in the ‘\(\left\{R{k}_{IP}\right\}\)’ was rotated in steps of five degrees to form 2268 images in the concatenated intensity-phase image dataset ‘\(\left\{R{k}_{IP}\right\}\)’. Using the proposed CNN as shown in Fig. 3, the binary regression task of 3D objects was performed for the ‘\(\left\{R{k}_{IP}\right\}\). For the implementation of CNN, the size of the concatenated intensity-phase image considered was \(1600\times 3200\). The binary regression task using CNN was performed for the ‘\(\left\{R{k}_{IP}\right\}\)’ as per Eqs. (12) and (13) respectively. Next, CNN was compared with KNN, SVM, MLP, DT, ADB, RF, ET, GB, HGB, and SGD regressors. For the implementation of KNN regressor, the number of the nearest neighbors considered was \(k=1\). i.e., 1-nearest neighbor regressor. Similarly for the implementation of SVM regressor, the linear kernel was used. For the implementation of the MLP regressor, a single hidden layer neural network was used. The activation function used in the hidden layer was ReLU function. The MLP regressor was solved using Adam optimizer with a learning rate of \(0.001\) and regularization parameter (\(\alpha\)) was also set to \(0.0001\). For the implementation of DT regressor, the squared error was used as criterion, best method was used as splitter, min_samples_split was set to 2, and min_samples_leaf was set to 1 respectively. For the implementation of ADB regressor, the number of estimators considered was 50, i.e., n_estimators = 50, learning rate considered was 1.0, and linear function was used as loss function. For the implementation of RF regressor, the number of estimators considered was 100, squared error was used as criterion, min_samples_split was set to 2, and min_samples_leaf was set to 1. For the implementation of ET regressor, the number of estimators considered was 100, squared error was used as criterion, min_samples_split was set to 2, and min_samples_leaf was set to 1. For the implementation of GB regressor, the number of estimators considered was 100, maximum depth considered was 3.0, and learning rate considered was 0.1 for the training of the regressor. For the implementation of HGB regressor, the number of estimators considered was 100, and learning rate considered was 0.1 for the training of the regressor. For the implementation of SGD regressor, the number of iterations considered was 1000, i.e., max_iter = 1000, and regularization parameter (α) was set to 0.0001 respectively. CNN was trained using a train-validation-test split with a ratio of 75:15:10 respectively. The training set consisted of 1701 (75%) images considered from the sets ‘SET1’ and ‘SET2’. Similarly, the validation set consisted of 341 (15%) images considered from the sets ‘SET1’ and ‘SET2’ and finally the test set consisted of 226 (10%) images considered from the sets ‘SET1’ and ‘SET2’ respectively. The remaining regressors, such as KNN, SVM, MLP, DT, ADB, RF, ET, GB, HGB, and SGD regressors, were also trained in the similar manner as CNN was trained. The training of CNN was performed using Adam optimizer for 50 epochs with a learning rate of 0.0005. The learning rate was kept constant throughout the training of CNN. The loss function considered for the training of CNN was mean square error (MSE) and metrics considered were MSE, mean absolute error (MAE) respectively. The equation for the loss function is shown below:

$${\text{MSE}}\left( {t,p} \right) = \frac{1}{N}\sum\limits_{{i = 1}}^{N} {\left| {t_{i} - p_{i} } \right|^{2}}$$
(18)

CNN was implemented using TensorFlow Version 2.0 and programming language used was Python 3.5. The KNN, SVM, MLP, DT, ADB, RF, ET, GB, HGB, and SGD regressors were implemented using scikit learn.

4.1 Results and discussion

A batch of 21 images was considered in each epoch for the training of CNN from the training set and 20 images were considered from the validation set. Each epoch was iterated for 81 steps on the training set and 17 steps on the validation set.

The loss and MSE curves on the training set and validation set are shown in Fig. 7a, b respectively. From Fig. 7a, it can be observed that the loss on the training set is decreasing and validation loss is fluctuating around 0.2. After the training, it can be observed that the training loss is 0.0017, validation loss is 0.2049. Further from Fig. 7b, it can be seen that the MSE is also decreasing on the training set and validation MSE is fluctuating around 0.3. After the training, the MSE values on the training set and validation set are 0.0017 and 0.2049 respectively. Since the training loss is less compared to the validation loss, CNN model is converging well and is correctly fitting.

Fig. 7
figure 7

a Loss b mean square error (MSE) curves on the training set and validation set

Figure 8 shows the MAE curve on the training set and validation set. From Fig. 8, it can be observed that the MAE on the training set is decreasing and validation MAE is fluctuating around 0.4. After the training, the MAE values on the training set and validation set are 0.0295 and 0.3395 respectively. CNN was tested separately on the test set which contains 23 images in a batch. The test loss, MSE, and MAE values are 0.0273, 0.0275 and 0.1142 respectively. Further, the evaluation metrics are also evaluated on the test set and validation set and are listed in Tables 1 and 2 respectively.

Fig. 8
figure 8

Mean absolute error (MAE) curve on the training set and validation set

Table 1 Evaluation metrics for the concatenated intensity-phase image dataset on the test set
Table 2 Evaluation Metrics for the concatenated intensity-phase image dataset on the validation set

From Table 1, it can be observed that CNN results in a MAE of 0.17, R2 score of 0.73, and EV regression score of 0.83. The evaluation metrics for the binary regression task on the test/validation sets are calculated by referring to Sect. 3.1. Since CNN model results in a good R2 score value of 0.73 and a high EV regression score of 0.83, CNN model has a good regression performance for the concatenated intensity-phase image dataset on the test set. Further from Table 1, it can be observed that CNN has higher values of R2 score and EV regression score compared to KNN, SVM, MLP, DT, ADB, RF, ET, GB, HGB, and SGD regressors.

The evaluation metrics on the validation set, computed by considering 20 images in a batch are shown in Table 2. The MAE, R2 score, and EV regression score values obtained from CNN are 0.30, 0.25, and 0.30 respectively. Since CNN model results in a constant R2 score value of 0.25, and constant EV regression score of 0.30, CNN model is concluded to have a constant regression performance for the concatenated intensity-phase image dataset on validation set. Further from Table 2, it can be observed that CNN has higher values of R2 score and EV regression score compared to KNN, SVM, MLP, DT, ADB, RF, ET, GB, HGB, and SGD regressors.

5 Conclusion

In this paper, CNN is applied to perform a binary regression on concatenated intensity-phase images of 3D objects. Images of the dataset were prepared using the intensity and phase (depth) information that is computationally retrieved from the digital holograms presented as a single image in a concatenated form, which has well accommodated the whole information of the 3D object. A data set comprising 2268 such concatenated intensity-phase images of the chosen 18 3D objects at different recording distances and varying rotation angles were utilized for this study. Binary regression task in deep learning done on this data set consisting of holographic information of 3D objects, which is equivalent to 3D objects prediction done on whole information objects data set. This produces continuous labels as output, which justifies the intention of the reported work. The results, such as loss, mean square error (MSE), and mean absolute error (MAE) curves, are shown on the training/validation sets, and evaluation metrics, such as MAE, R2 score, and explained variance (EV) regression score, are shown on the test/validation sets for the confirmation of the work. CNN has resulted in lesser loss, lesser MAE, and lesser MSE on the training set. The evaluation metrics have a good performance on the test set and constant performance on the validation set with an R2 score values of 0.73, 0.25 and EV regression score values of 0.83, 0.30 respectively. The R2 score and EV regression score have a good performance on the test set compared to the validation set. Further, CNN has resulted in higher values of R2 score and EV regression score compared to KNN, SVM, MLP, DT, ADB, RF, ET, GB, HGB, and SGD regressors on the test/validation sets. Therefore, it can be concluded that CNN model has a good performance for the binary regression task on the concatenated intensity-phase image dataset of 3D objects, i.e., on the whole information dataset produced from the digital holograms.