1 Introduction

Persons with disabilities face a number of daily obstacles and challenges, including daily movement and communicating with other persons, which limit their freedom to interact and engage with the world around them independently. Modern electronic technologies can be developed to address such challenges and improve the lives of individuals with physical and sensory disabilities. The use of appropriate auxiliary equipment can greatly impact an individual’s continued quality of life and promote and maintain independent living. At present, advances in high-speed microprocessors lead to various assistive technology systems [3, 9, 11, 15, 20, 26, 32, 36, 37]. These various systems allow people with disabilities to use limited voluntary motions for communication, computer manipulation, and control of household appliances. Each system has its own considerations, applicability, and limitations. The white cane is a common travel aid for blind or visually impaired persons, but it only ensures that a small ground area in front of the person is clear or locate any nearby obstacle on the ground via contact with the cane. Although the white cane has many advantages, such as being lightweight, small when folded, and low cost, its main limitation is its limited sensing range (approximately 1–2 m). Some blind persons may get around with the help of a guide dog; however, only a few blind or visually impaired persons have access to guide dogs because fully trained guide dogs are costly, and the breeding fee for guide dogs is usually high. Therefore, easy-to-use and cost-efficient electronic travel aids are needed to help visually impaired persons by expanding their ability to perceive unfamiliar environments. The Toyota’s robot “BLAID” is a wearable device for helping the blind to detect some signs such as traffic lights, toilets, etc., but it didn’t detect the obstacle and it still relied on the guide dog to help them walk. However, the “BLAID” did not release the method on how to implement their device and it still described their device scenario. The “BLAID” also did not provide their device experiments and reference papers about technical reports. Our proposed method adopted the semantic segmentation based on the CNN model that can describe the environment information and environment depth information based on TX2 hardware. Our system would be achieved a real implementation system.

At present, the white cane is commonly used as a guide aid. Objects and surfaces, such as metal, plastic, wood, or guide bricks, can be recognized by the different sounds made when such objects are tapped with a cane. However, many problems exist with this guide method. For example, if the cane does not strike an object or surface, then the user will be unaware of its presence. In particular, white canes and guide dogs are not ideal for navigating unfamiliar environments.

Various types of guide robots have been developed to address these shortcomings [32, 36], but most of the early designs used sensing components to avoid obstacles [11]. However, such an approach cannot cope with unexpected situations. At present, machine learning technology is becoming increasingly mature, and high-performance GPU hardware architecture is progressing considerably with the ability to analyze images and run high-dimensional operations in real time. By combining the former and current machine learning application methods, a depth image is generated to allow distance measurements. The combination of these measuring components can improve obstacle avoidance and guidance of guide robots and cope with most emergencies. However, the infrared light of a camera is susceptible to external environmental disturbances due to its depth [9, 37]. In previous years, scholars have proposed the use of a single camera [20, 21] or the addition of a laser [1, 7] to calculate the image depth value and assist with the ranging process. However, the calculation results are not widely used in various scene images. Therefore, computer vision in this study uses a single RGB camera, designs an algorithm for predicting image depth values in RGB images, and establishes depth information through deep learning [22]. Deep learning is developed using a convolutional neural network (CNN)-based deep prediction algorithm, which converts RGB images into grayscale as the CNN input and predicts image depth training using the depth value of each pixel as the desired value. The obtained depth image is applied to obtain a flat path and provide autonomous obstacle avoidance using CNN’s radar ranging [5, 48]. Compared with the use of the white cane or a guide dog, the device can calculate and communicate a safer route for the wearer at a greater distance and in far more detail with higher robustness regarding unexpected obstacles in an unknown environment. Therefore, the proposed device is safer and more reliable and convenient.

The contribution of our proposed system is to use a low-cost device with an RGB camera to predict the obstacles and the walking plane for guiding the blind safely walking. The low-cost device is described as details in section III. The hardware of our system is built-in Nvidia Jason TX2 combined with the RGB camera. The TX2 is an application client to stream video image and the RTSP protocol is adopted to stream the image into the server. The deep learning algorithm is processed by the server, and then, after processing the resulting output returns to the TX2 client. Our software algorithm is illustrated by the five steps. The first step is to predict the environment depth by the four stages CNN method, but the predicted depth environment information is rough. Therefore, the second step is to fine-tune the depth environment information by the scale-invariant mean squared error. Then, the walking plane must be found to guide the blinds walking safely, and hence the floor and plane detection methods are adopted. Moreover, the blind must know the safe road distance to avoid users colliding with the obstacles. Finally, our proposed system will tell the blind the distance of the safe road through the earphones. The flowchart of our system is shown in Fig. 1.

Fig. 1
figure 1

The algorithm flowchart

The remainder of this paper is organized as follows. Section 2 briefly reviews the machine learning and scene segmentation. The proposed walking plane detection method is introduced in Section 3. The simulation results are discussed in Section 4. Finally, some concluding remarks are given in Section 5.

2 Related work

The advent of the era of deep learning has made depth imaging is a major research topic to date. Adopting learning mechanisms based on neuropsychology is the first step of machine learning, but machine learning algorithms require many iterations, many calculations, and are too dependent on hardware to produce training results. In the early days, the efficiency of machine learning development is high. However, the rapid development of computer hardware in recent years has remarkably improved the computing power of computers and considerably enhanced neural network technology. Machine learning has been promoted by many scholars, and the system architecture of computer vision has become increasingly sophisticated with the improvement of depth perception to further enhance recognition accuracy, which is the key technology in near-depth imaging. There are many machine learning applications as following. Xu et al. [46] proposed based on the human skeleton map for fall prediction. Xu et al. [45] proposed a novel multi-feature fusion (MFF) CNNs framework for the Drosophila embryo of interest detection. Xu et al. [44] proposed a novel edge-oriented framework to improve the performance of existing salient detection methods. Xu et al. [43] considered the cross-modal retrieval task, from the perspective of optimizing ranking model. Xu et al. [42] described the body’s adjustment process from the physical point of view. Xu et al. [41] proposed a new algorithm of phase consistency detection based on dimensionality reduction. Xu and Li [39] proposed an algorithm exploiting the transaction data behind the social media stocks. Liu et al. [23] discussed an in-depth exploration for the most popular DCGAN. Liu et al. [24] discussed the relationship between image semantic segmentation and animal image research. Xu et al. [40] proposed a new recommended method using collaborative filtering. Xu [38] proposed a two-dimensional numerical model for machine learning to simulate major U.S. stock market index.

The accurate prediction of environmental depth information is important in predicting geometric relationships within an environment. Straub et al. [34] proposed real-time inference algorithm to evaluate the real environment. There are some blind dog systems developed to help the blind [2, 33, 47]. Knowing the geometric relationship of objects helps to provide rich object features and environments, such as in 3D modeling [13, 29, 30], physical and support models [17, 27], and robotics [6, 31]. Saxena et al. [30] used markov random field (MRF) to infer a set of plane parameters and used supervised learning for training to obtain the image depth values and segmented images after image color segmentation and build a model.

Saxena et al. [29] developed a method of depth prediction by using a single image. Supervised learning was also used to solve the predicted depth information requirement, and a single image (including unstructured outdoor environments, such as forests, trees, and buildings) was collected as the training dataset with a corresponding ground depth map as the desired output. The panoramic depth value information of the image was still needed because local features cannot predict the depth of a single point. In turn, MRF was used to combine multiscale local and global image features and simulate the depth of each point and the relationship between depths at different points through the local and global depth values for predicting depth information.

Silberman et al. [31] used color depth (RGBD) images to predict the major surfaces of an indoor scene and its object environment and related relations. In terms of depth of prediction, the depth of the scene was established via multiscale deep learning, whereby the area was shaded in the environment, and depth information in the complex scene was inferred through fine-tuning.

Liu et al. [22] put forward a model for depth prediction from a single image (RGB) that combined the strength of deep CNN and continuous CRF to predict the depth of new images efficiently. They also proposed a complete convolutional network and an improvement model of a novel superpixel pooling method, which accelerates the overall training speed by using deeper networks to obtain enhanced prediction performance.

In the field of image labeling, Long et al. [25] used a convolutional network as a powerful visual model and proposed a new fully convolutional network (FCN) for segmentation that combines feature structure and improves the output of spatial precision; the fully convolutional classifier can be fine-tuned for segmentation, as shown in Section 4.1. Although the score of the standard metric is high, the output is unsatisfactory.

PSPNet [49] is a network based on the FCN [25] architecture. The traditional FCN uses the input image as the CNN input to deconvolute the same feature as the input image [4, 35]. In the process, global information is not added to the network. This condition leads to a lack of global semantic information in the FCN, which results in errors. Therefore, Zhao et al. [49] suggested the addition of a global-scene-level to multiscale feature ensembling on the basis of the FCN architecture; this addition allows the network to contain local and global information. Among them, PSPNet optimization and loss function are based on ResNet [12].

He et al. [12] observed that the continuous deepening of the network theory can yield better results; however, the experimental evidence does not necessarily indicate improvement. Deepening the network can be done easily but causes the gradient to disappear and results in decreased accuracy. Deep networks have a degradation problem, which makes them difficult to train. Accordingly, He et al. [12] proposed residual learning to solve such problems. The network structure of ResNet is based on the VGG19 network modification. The residual learning method mentioned above is added to the VGG19 network. When the feature map size is decreased, the number of feature maps will increase to maintain the network complexity and solve the problem of network degradation.

3 Proposed system

The wearable guide device with deep learning for blind or visually impaired persons developed in this study aims to provide environmental information to the wearer while using a white cane and allow them to easily and safely navigate unfamiliar environments. CNN is used in model development to make preliminary predictions on the depth information of RGB images, further reference and improve a multiscale deep network [29], and strengthen the prediction of environmental depth information. This environmental depth information is then used to predict a safe route for the wearer to walk in, complete a fast algorithm to determine flat routes and depth-marking areas on the basis of deep learning, inform the wearer the distance of the safe path, and ensure that every single step taken by the wearer is safe. The main research includes (1) body device design, (2) indoor depth prediction, (3) depth detail adjustment, and (4) plane detection and establishment, as shown in Fig. 2.

Fig. 2
figure 2

Architecture of the proposed system

3.1 Hardware design

This study adopted the lightweight computing power of an Nvidia Jetson TX2 as the wearable operating core to reduce the size of the hardware device worn by the user. This system can run a complete operating system (Linux), save power, is lightweight, and has excellent control performance, which make it suitable as the central control server for video streaming in the proposed system. The streaming process uses real-time streaming protocol (RTSP) and the H.264 format to stream images and introduces the images into the trained model enabling the system to determine the range of planes in front of the user that can be walked on and guide the user to a safe path. The magnitude of the sound reminder is used to tell the user whether they should continue walking. The communication protocol is shown in Fig. 3.

Fig. 3
figure 3

Communication protocol of the proposed system

3.2 Design of the wearable guide device

The proposed wearable device was 3D printed using a stable ABS to allow the user to hang the device near their chest and listen to the information provided with headphones. The lens in the device provides a video stream at a 20 horizontal downward position. The device is shown in Fig. 4.

Fig. 4
figure 4

Proposed device components

The middle of the device has an adjustable elastic buckle belt, which can be used by blind or visually impaired persons to strap the device around their waist. Figure 5 shows how the device is worn.

Fig. 5
figure 5

Wearing the device

3.3 Indoor depth prediction

This study refers to the CNN network described by Saxena et al. [29] with four stages, as shown in Fig. 6, wherein the input layer is a 304 × 228 RGB image; the first and second stages are a 9 × 9 convolutional filter, and the moving step filter is course 2 2 × 2 max-pooling; and the third and fourth stages are a 5 × 5 convolutional filter and a 5 × 5 convolution, respectively. The depth value is predicted by using the following equation:

$$ \hat{d}_{i,j,k}=w^{T}_{lr}F_{i,j,k}+b_{l}, $$
(1)

where the depth value is \(\hat {d}_{i,j,k}\).

Fig. 6
figure 6

Depth image generation

The feature vector Fi,j,k and baseline (bl) can be used to compute the pooling parameter Hl as follows:

$$ F_{i,j,k}=f(I_{i,j,k},\theta_{f})=W_{l}H_{l-1}, $$
(2)
$$ H_{l}=pool(nonl(W_{l}H_{l-1}+b_{l})). $$
(3)

The loss function L(𝜃f,𝜃lr) is expressed as (4). Then, the stochastic gradient descent algorithm is adopted to compute the CNN weight as follows:

$$ L(\theta_{f},\theta_{lr})=\frac{1}{N}\underset{i,j,k}{\sum}(d_{i,j,k}-\hat{d}_{i,j,k})^{2}. $$
(4)

3.4 Depth detail adjustment

Given that this study uses only a single RGB camera to achieve computer vision [19], determining the means of adding depth information to the RGB images is the core topic of this article on indoor computer vision. For outdoor prediction, [28] adopted the multiscale deep network method, which is expected to enhance the efficiency of indoor depth information. This neural network is composed of the following subnetworks [8, 14, 16]: (1) coarse-scale network roughly predicts the depth of the panorama, and (2) the value is then input to the fine-scale network for local adjustment to achieve accurate predictions.

3.4.1 Coarse-scale network

This subnetwork has seven layers, including five convolutional and two fully connected layers. The first and second convolutional layers involve downsampling with max-pooling to reduce the feature map dimension and accelerate the operational efficiency and reduce overfitting. The sixth and seventh layers are fully connected and use upsampling to improve the output feature map dimension and achieve the final output image resolution and the 1/4 effect of the input image. Although the conversion between down- and upsampling will result in blurry predictions, the final output will have a better predictive effect than the direct output of the fifth layer, as shown in Fig. 7.

Fig. 7
figure 7

Architecture of coarse-scale network

3.4.2 Fine-scale network

The main task of this subnetwork is to fine-tune the output of the coarse-scale network and improve the blurring generated during down- and upsampling. The fine-scale network is divided into four convolutional layers with an original input in the first layer. An input image is also downsampled in a max-pooling manner, and the second layer merges the output of the first layer and the coarse-scale network, as shown in Fig. 8.

Fig. 8
figure 8

Architecture of the fine-scale network

The depth information pair is detected by the above-mentioned CNN architectures, and the relationship between the panorama and the point is estimated using a scale-invariant error. The scale-invariant mean squared error (in log space) is defined as

$$ D(y,y^{*})=\frac{1}{2n}\sum\limits^{n}_{i=1}(\log y_{i}- \log y^{*}_{i}+\alpha (y,y^{*}))^{2}. $$
(5)

where \(\alpha (y,y^{*})=\frac {1}{n}{\sum }_{i}(\log y^{*}_{i}-\log y_{i})\). For any predicted value y, eα is the parameter that best corresponds to the actual distance. To ensure that the overall error size remains unchanged, the scalar multiplication for all y will have the same scale-invariant.

3.5 Plane detection and development

This study uses the MIT ADE20K scene parsing dataset as the indoor dataset. This dataset is used to separate the images into objects in the environment; distinguish the floors, walls, and obstacles through the color blocks; and apply the trained parameters. The picture taken by the device wearer while walking is shown in Fig. 9.

Fig. 9
figure 9

Architecture of plane detection

Scene segmentation is a basic task in computer vision, and its goal is to classify each pixel in the image, which is potentially used in areas such as autonomous driving and robot perception. The main advantages of the PSPNet [49] segmentation method are as follows.

  1. (1)

    Based on FCN (Fully Convolutional Network) target segmentation framework, complex background features are embedded.

  2. (2)

    Based on the deep supervision loss function, an effective optimization strategy is proposed for ResNet.

  3. (3)

    A state-of-the-art scene parsing and semantic segmentation system have been established, and it contains many practical implementation strategies.

  4. (4)

    One of the routes is multi-scale feature extraction because the higher-level features in the deeper network contain more semantic information. However, it contains less spatial location information.

  5. (5)

    Another route is based on structural prediction, for example, by using CRF (Conditional Random Field) as a subsequent step to extract the segmentation results.

The pyramid pooling module proposed in [49], as shown in Fig. 9, is also utilized. In this network architecture, ResNet is adopted to extract features after inputting the images, and the feature map size is 1 for 1/8 input images. The feature map is divided into three blocks of 1 × 1, 2 × 2, and 3 × 3. This three-level pooling kernel fuses the small feature maps into the global information. Then, the original feature map (feature map) is connected with the output of the pyramid pooling module. Auxiliary loss is added to improve network training. The experiment in [49] obtained auxiliary and master branch loss values of 0.4 and 0.6, respectively. Figure 10 shows the auxiliary loss on ResNet101, wherein each brown block is the responsibility block, followed by auxiliary loss.

Fig. 10
figure 10

Architecture of Res101

The depth of the network is crucial to the performance of the model. When the number of network layers is increased, the network can extract more complex feature patterns, so theoretically better results can be obtained when the model is deeper. The deep network has a degradation problem. When the network depth increases, the network accuracy becomes saturated, or even decreases such as that the 56-layer network is even worse than the 20-layer network. This will not be an overfitting problem. We know that deep networks have problems with gradients disappearing or exploding, which makes it difficult to train deep learning models. Therefore, the problem of the degradation of deep networks is very surprising.

The degradation problem of deep networks at least shows that deep networks are not easy to train. We build a deep network by stacking up new layers. An extreme case is that these added layers do not learn anything, just copy the features of the shallow network, because the new layer is identity mapping. In this case, the deep network should be at least as effective as the shallow network, and there should be no degradation. The ResNet [12] proposed residual learning to solve the degradation problem. For a stacked layer structure, the learned feature is recorded as the input. ResNet hopes that it can learn the residual as the original learning feature. The residual learning is easier than direct feature learning. When the residual is 0, the accumulation layer only performs identity mapping, the network performance will not decrease. In fact, the residual will not be 0, which will also make the accumulation layer learning based on the input features.

The aforementioned model has obtained the color segmentation map. Thus, this study obtains the floor area and detects the boundary with the plane edge to detect and draw. The Canny algorithm is adopted to detect the edges on the basis of the upper and lower thresholds of the input between 2:1 and 3:1. A threshold is used to determine whether a pixel is an edge. The following criteria are used in detection:

  1. (1)

    If the pixel gradient intensity is greater than the upper threshold, then the pixel is an edge.

  2. (2)

    If the pixel gradient intensity is less than the lower threshold, then the pixel is not an edge.

  3. (3)

    If the pixel gradient intensity is above and below the threshold and a pixel exists with a gradient intensity greater than the upper threshold, then the pixel is an edge; otherwise, the pixel is not an edge. The flow is shown in Fig. 11.

    Fig. 11
    figure 11

    Plane edge detection

4 Experimental results and discussion

The performance of the proposed device is compared with those of other available methods and devices to provide users with real-time safe and reliable guidance. The main research results are divided into (1) depth image combined with label results, (2) comparison of distance results, (3) comparison of plane information and detection.

4.1 Depth image combined with label results

At the beginning of the experiment, the original RGB image was input into the coarse- and fine-scale network layers. The coarse-scale network layers were used to perform global depth prediction on the input image for outputting the prediction results of the coarse-scale network layers to the fine-scale network layers. The first-layer output of the fine-scale network layers was combined with the prediction results of the coarse-scale network layers and output to the second layer. A preliminary predicted depth map was subsequently derived. The fine-scale network layers were then used to fine-tune the prediction results of the coarse-scale network layers. The final output predicted image depth is shown in Fig. 12.

Fig. 12
figure 12

Depth prediction image

Figure 12 shows the scene depth map of an office and a corridor. This study used the color gradation method to differentiate between the near and distant scenes for ensuring that the system had a single RGB image to determine the far-reaching ability in the scene.

The same image was then labeled with the result of the depth prediction map, and the training model was trained using the MIT ADE20K scene parsing dataset. The image was then subjected to semantic segmentation, and the objects in the environment were segmented by color, as shown in Fig. 13.

Fig. 13
figure 13

Semantic segmentation image

Figure 14 shows the result of image labeling. By using various colors to distinguish between items, the flat plane was depicted in brown, and the edges were then used to describe the threshold of the brown area of the ground plate to remove unnecessary information. The result is shown in Fig. 14.

Fig. 14
figure 14

Plane semantic image

With the plane feature map, the label and depth information was used to combine two pixels and derive the combination of plane and depth. This ground depth value was used by the system to measure the user’s path surface, length, and other information, as shown in Fig. 15.

Fig. 15
figure 15

Resulting image

This trained predictive model was then applied to the wearable device designed in this study. The wearers can hang the device on their chest, and the system will change the direction in which they can safely walk with the assistance of a white cane and increase mobility and safety for the wearer.

4.2 Comparison of distance results

The results of the proposed method were similar to those achieved by laser-based range finding in the scene with an accuracy of up to 98.52%. Compared with the accuracy of the ultrasonic sensor at 97.19%, the proposed method exhibited an improvement in the accuracy rate of 1.32%. Compared with Kinect’s accuracy of 94.76%, the proposed method exhibited an accuracy rate improvement of 3.75%. Ultrasonic methods are only suitable for detection of less than 4 m in general applications, and obstacles more than 4 m away are difficult to detect. Kinect recommends measuring distances from 1.2 m to 3.6 m in official data. The detection range is 57. According to [18], the distance error at 5 m is 7 cm, and the large distance will result in additional vanishing points because blind spots are likely to appear in plane detection. The proposed method has decreased sensitivity to distance restrictions. Table 1 presents the distance comparison of the target obstacles measured in different scenarios.

Table 1 Distance comparison

4.3 Flat information detection comparison

At present, common semantic segmentation models, such as FCN8s, do not have sufficient global information, and the local information causes errors when resegmentation is removed. However, this study used PSPNet with global-scene-level and introduced the ResNet optimization method to detect the walkable plane. The comparison results are shown in Fig. 16.

Fig. 16
figure 16

Comparison of floor label results. a Original map, b ground truth marked by hand, c result plot of the floor framed via FCN8s, and d result plot of the floor framed via PSPNet

Given that the purpose of this experiment is to provide additional information about accessible planes for blind or visually impaired persons, ensuring that the range enclosed by the model overlaps with the area where the ground truth overlaps is important. This study divided the two selected areas into points, which were categorized into boxes. The overlap rate was calculated for 10 different scenarios. After many experiments, the FCN8s model caused too much ruggedness, too many turns, or too many obstacles. According to the statistical results shown in Fig. 17, PSPNet is better than FCN8s because the overlap rate of the average frame selection increased by 10.54%.

Fig. 17
figure 17

Correct ratio of plane detection (FCN8s vs. PSPNet)

This study also lists 15 real experiment environments to ensure that the proposed method can identify the floor plane. From Fig. 18, the found walking plane is shown by the edge boundary. Less noise influence is observed in the different environments, and the method can also avoid obstacles, such as people, chairs, and stairs.

Fig. 18
figure 18

Plane detection result

Our proposed plane detection method performance was compared with the 3D-KHT [10]. We adopted the real environment as experiment results and show the precision, recall of our method and 3D-KHT. The estimation result of our performance is better than 3D-KHT. The recall score of our method got higher scores than 3D-KHT. According to compare with the 3D-KHT, our proposed method got better precision and recall than the 3D-KHT shown in Table 2.

Table 2 Compared with 3D-KHT [10]

5 Conclusions

This study successfully designs a plane detection system for indoor operation. The proposed system is applied to a wearable device for the accurate and safe determination of a walkable plane in the space in front of a visually impaired wearer by means of streaming images. The wearable device can also determine the length of the plane. The wearer can walk to a destination under the safe guidance of the device, which improves the shortcomings of external infrared interference cameras or laser-assisted ranging from the external environment interference. The use of the proposed wearable device with a white cane allows blind or visually impaired persons to achieve safe and independent movements similar to those provided by a guide dog.

Additional efficient algorithms will be required due to the limitations of the hardware process. Future studies will focus on the use of streamlined hardware devices and efficient algorithms to develop smaller and more lightweight and accurate devices similar to the one proposed in this study; such studies will provide detailed information to their users and offer increased independence and safety.