1 Introduction

With the recent advancements in information and communication technology, information displays have become pervasive in our daily lives. By combining these information displays with gesture recognition technology, it becomes possible to create interactive information interfaces that can switch images on the display based on the user’s gestures. Examples of gesture recognition applications include patient monitoring, anomaly detection using surveillance cameras, master–slave operations for robots, and sign language recognition [1]. To perform gesture recognition, various devices are used, such as stereo high-speed cameras [2], stereo infrared cameras (Leap Motion) [3], and Time of Flight (ToF) 3D cameras (Kinect) [4]. However, using cameras capable of capturing detailed images for gesture recognition is not feasible in many places due to concerns regarding privacy and information leakage. Examples of such places include personal spaces like toilets and bathrooms, as well as public spaces. Particularly in bathrooms, it is not possible to use electrostatic sensors, and voice recognition is difficult due to water sounds. To address this problem, research has been conducted on methods such as reducing the resolution of captured images [5] and performing masking operations outside the required areas [6]. We have proposed a method of capturing shadow pictures using single-pixel-imaging to realize privacy-conscious gesture recognition [7].

Single-pixel-imaging is a technique that utilizes spatially modulated illumination and a single light detector to capture images [8]. It allows imaging under low-light conditions and with light sources other than visible light, making it applicable in a wide range of scenes. To perform single-pixel-imaging, a modulable light source is required, and various displays already present in public spaces can serve as suitable light sources. We have previously proposed single-pixel-imaging using a high-speed modulable LED display for banner advertisements and news display [9]. In this case, the content of the banner display can be directly utilized as the spatial light intensity distribution of the light source [10]. Alternatively, by embedding random patterns while maintaining the apparent image recognizable to observers [11], it becomes possible to achieve a balance between digital signage display and imaging without constraints on the content. However, this approach presents a challenge where the reconstructed images through single-pixel-imaging are influenced by the apparent image, making gesture recognition difficult [12].

To solve this problem, we propose to use deep learning to restore the original image from which the apparent image has been removed from the reconstructed image of single-pixel-imaging. Although deep learning has been proposed to reduce the number of illumination times for single-pixel-imaging [7], this study aims to achieve both reduction of illumination times and removal of apparent images. Preliminary results of this study were presented at LDC2023 [13]. The purpose of this paper is to investigate the classification accuracy of reconstructed single-pixel-imaging images with latent random patterns in the illumination by removing the influence of apparent images through deep learning. To achieve this, a neural network, U-Net, is used to train pairs of reconstructed and original images, and the image is restored by the network. LeNet was then used to determine the classification accuracy of the restored image.

2 Principle

2.1 Single-pixel-imaging with random-dot-embedded apparent images

The principle of the single-pixel-imaging with random-dot-embedded apparent images is shown in Fig. 1. The encoded images are displayed on an LED display at a sufficiently high frame rate, so the observer perceives an apparent image that integrates the encoded images. The light transmitted through the subject is measured by a single detector and reconstructed using the principle of single-pixel-imaging with 2D encoding images and 1D temporal signals. The reconstruction of single-pixel-imaging is expressed by

$$\begin{aligned} G\left(x,y,n\right) & =\langle \Delta I\left(x,y,n\right)\Delta A\left(n\right)\rangle\\ &=\langle \left[I\left(x,y,n\right)-\langle I\left(x,y,n\right)\rangle \right]\left[A\left(n\right)-\langle \Delta A\left(n\right)\rangle \right]\rangle\\ & =\langle I\left(x,y,n\right)A\left(n\right)\rangle -\langle I\left(x,y,n\right)\rangle \langle A\left(n\right)\rangle\end{aligned}$$
(1)

where \(\Delta I\left(x,y,n\right)\) is the deviation between the light intensity \(I\left(x,y,n\right)\) and the mean \(\langle I\left(x,y,n\right)\rangle\) of the n-th 2D encoding images in the coordinates \(\left(x,y\right)\). \(\Delta A\left(n\right)\) is the deviation of average value of 1D temporal signals. \(A\left(n\right)\) can also be given by

$$A\left(n\right)=\iint T\left(x,y\right)I\left(x,y,n\right)\,\mathrm{d}x\mathrm{d}y$$
(2)

where \(T\left(x,y\right)\) denotes the transmission function [14]. Thus, the reconstructed image from n-th measurements can be obtained from 2D encoded images displayed on the LED display and the 1D temporal signal measured by a single detector. The reconstructed images are influenced by noise and apparent images, making gesture recognition difficult.

Fig. 1
figure 1

Principle of single-pixel-imaging with apparent image latent with random pattern

2.2 Encoding of apparent images

The LED display is updated at a sufficiently high frame rate so that the observer perceives an integrated image of latent random patterns. This principle has been confirmed with LED displays at 960 fps [11]. Encode m frames to latent random patterns in the apparent image. The latent random pattern satisfies:

$$V\left(x,y\right)\equiv \sum_{n=1}^{m}E\left(x,y,n\right)$$
(3)

where \(V\left(x,y\right)\) be the pixel value of the apparent image at coordinate \(\left(x,y\right)\) and \(E\left(x,y,n\right)\) be the pixel value of the n-th coded image [15].

In this study, the apparent image was also encoded to satisfy Eq. (3). The apparent image used in the experiment was a binary image with pixel values (190,255) as shown in Fig. 2. When \(m=2\) is used as an example of encoding, Fig. 3 shows two coded images of Fig. 2. Table 1 shows the composition of pixel values by encoding two images. By displaying these two images at high speed on an LED display, the observer perceives the apparent image shown in Fig. 2.

Fig. 2
figure 2

Apparent image

Fig. 3
figure 3

Two encoded images

Table 1 Composition of pixel values by two encoded images

2.3 U-Net

Structure of U-Net is shown in Fig. 4. U-Net is a convolutional neural network (CNN) that is good at capturing and restoring features of input images [16]. In the convolutional process, a filter-based convolution is performed on the input to output a feature map. Maxpooling reduces the resolution of the input by extracting the maximum value in the filter and aggregating it into one. Then, unpooling brings the resolution back to the original. These processes enable capturing the features of an object. However, since the positional information of the object is lost in these processes, the feature maps before the convolution is concatenated to complement the positional information, which is called skip-connection.

Fig. 4
figure 4

Structure of U-Net

U-Net was developed for medical image segmentation and was also used in this study because it is suitable for single-pixel-imaging that contains a lot of noise.

2.4 LeNet

Structure of LeNet is shown in Fig. 5. LeNet is a network model suitable for image classification that consist of CNN [17]. This network performs classification by repeating the convolutional layer and the max-pooling layer, and then repeating the affine layer. In this paper, we added layers for image augmentation to compensate for the lack of training data.

Fig. 5
figure 5

Structure of LeNet

3 Experiments

3.1 Reconstruction of single-pixel-imaging

Hand gesture images were reconstructed using single-pixel-imaging with random patterns and single-pixel-imaging with apparent images. The SSIM value is a measure of structural similarity, and the closer the value is to 1, the higher the similarity. The apparent images were encoded into 20 images, and the pixel value composition of the 20 encoded images is shown in Table 2 and a part of 20 encoded images are shown in Fig. 6. The order in which these are displayed is random for each pixel. Hand gesture images are 18,000 images of 40 × 40 pixels and are simulated on a computer. Composition of the hand gesture images is shown in Table 3 and hand gesture images are shown in Fig. 7.

Table 2 Composition of pixel values by 20 encoded images
Fig. 6
figure 6

A part of 20 encoded images

Table 3 Composition of gesture images
Fig. 7
figure 7

Images of hand gesture

3.2 Elimination of apparent image with U-Net

To remove the influence of the apparent image, the reconstructed image was restored by learning with U-Net. By using pairs of original gesture images and reconstructed images from single-pixel-imaging, U-Nets were trained to remove the influence of apparent image. To obtain the transition of the SSIM value in response to changes in the number of illuminations in the reconstructed image, Network settings were the same, and training was performed for each number of illuminations. Training was performed using the Neural Network Console (NNC) provided by Sony. Dataset structure of U-Net is shown in Table 4, Network settings of U-Net is shown in Table 5, and U-Net implemented on NNC is Fig. 8.

Table 4 Dataset structure of U-Net
Table 5 Network setting of U-Net
Fig. 8
figure 8

U-Net implemented on NNC

3.3 Classification of hand gesture

We performed learning to classify the restored images using LeNet. Restored images were given labels corresponding to gestures. Training was performed using the labeled restored images. Network settings were the same and training was performed for each number of illuminations. Training was performed using NNC. Dataset structure of LeNet is shown in Table 6, Network settings of LeNet is shown in Table 7, and LeNet implemented on NNC is Fig. 9. In “ImageAugmentation” layer, input images are rotated, and in “RandomShift” layer, patterns are increased by shifting left and right.

Table 6 Dataset structure of LeNet
Table 7 Network setting of LeNet
Fig. 9
figure 9

LeNet implemented on NNC

4 Result

4.1 Reconstruction of single-pixel-imaging

Figure 10 shows reconstruction results of single-pixel-imaging with random patterns and single-pixel-imaging with apparent images, and Fig. 11 shows the SSIM values of single-pixel-imaging with random patterns and single-pixel-imaging with apparent images.

Fig. 10
figure 10

Reconstruction results of single-pixel imaging with random patterns and single-pixel-imaging with apparent images

Fig. 11
figure 11

SSIM value for reconstructed image of a random pattern and b apparent image

Figure 10 shows that the reconstruction results of the random pattern and the apparent image are clearer when there is more illuminations, and noisier when there is less illuminations. The single-pixel-imaging using the apparent image shows the influence of the apparent image.

Figure 11 shows that when the number of illuminations is 1000 or less, the SSIM values of the random pattern and the apparent image are comparable. When the number of illuminations exceeds 1000, the random pattern has a higher SSIM value.

4.2 Elimination of apparent image with U-Net

U-Net was trained to restore the reconstructed image. Learning curves of U-Net for 10,000 illuminations and 100 illuminations in single-pixel-imaging using apparent images are shown in Fig. 12, and restoration result of random pattern and apparent image are shown in Fig. 13. SSIM values of the restored image using single-pixel-imaging with random patterns and the restored image of single-pixel-imaging with apparent images are shown in Fig. 14.

Fig. 12
figure 12

Learning curves of U-Net for 10,000 illuminations and 100 illuminations in single-pixel imaging using apparent images

Fig. 13
figure 13

Restoration result of a random pattern and b apparent image

Fig. 14
figure 14

SSIM value for restored image of a random pattern and b apparent image

Figure 12 shows that the error value converges to a small value when the number of illuminations is set to 10,000. As the number of illuminations decreases, the error value gradually increases, and the error value for 100 illuminations is about ten times larger than that for 10,000 illuminations.

Figure 13 shows that the effect of the apparent image was removed by the U-Net restored image. In addition, it was confirmed that the gestures in the reconstructed image could be restored when the number of illuminations was 500 or more, but the reconstructed image could not be completely restored when the number of illuminations was 100.

Figure 14 shows that there is no difference in SSIM values between the restored image of single-pixel-imaging with random patterns and the restored image of single-pixel-imaging with apparent images.

4.3 Classification of hand gesture

LeNet was trained to classify the restored image. Learning curves of LeNet for 10,000 illuminations and 100 illuminations in single-pixel-imaging using apparent images are shown in Fig. 15, and the relationship between the number of illuminations and classification accuracy of random pattern and apparent image are shown in Fig. 16.

Fig. 15
figure 15

Learning curves of LeNet for a 10,000 illuminations and b 100 illuminations in single-pixel imaging using apparent images

Fig. 16
figure 16

The relationship between the number of illuminations and classification accuracy of a random pattern and b apparent image

Figure 15 shows that the error value converges to a small value when the number of illuminations is set to 10,000. As the number of illuminations decreases, the error value gradually increases, and the error value for 100 illuminations not only decrease when the number of epochs increases, but also increase in some places.

Figure 16 show that classification accuracy depends on the number of illuminations. When the number of illuminations was 300 or more, all restored images could be classified, and when the number of illuminations was less than 200, the classification accuracy began to decrease. The classification accuracy was similar for both random patterns and apparent images.

5 Discussion

Figures 12 and 15 show that there is a large difference in error values when comparing the error values resulting from 10,000 illuminations and 100 illuminations, and there are apparent signs of over-learning in the case of 100 illuminations. To solve this problem, it is considered necessary to improve the network and adjust parameters.

From Fig. 11, the difference in SSIM values between the reconstructed image of single-pixel-imaging with random patterns and the reconstructed image of single-pixel-imaging with apparent images can be seen. However, from Fig. 14, the SSIM values of the reconstructed image of single-pixel-imaging with random patterns and the reconstructed image of single-pixel-imaging with apparent images are similar. Also, from Fig. 16, the classification accuracy of the restored image of single-pixel-imaging using random patterns by LeNet and that of the restored image of single-pixel-imaging using apparent images by LeNet are similar. Therefore, using U-Net and LeNet in single-pixel-imaging with apparent images, it is possible to classify more than 80% of the restored images with more than 200 illuminations. We expect that the measurement with 200 illuminations and a 3000 Hz LED display can realize gesture classification with a sampling rate of 15 fps.

6 Conclusion

Reconstructed images by single-pixel-imaging using apparent images are influenced by the apparent images, and it is difficult to classify gestures. Using U-Net for restoration and LeNet for classification, it is possible to classify all of them with more than 200 illuminations.