Keywords

1 Introduction

A significant population in the world is visually impaired. As per World Health Organization (WHO) world report on vision published in 2019, at least 2.2 billion people are visually impaired [1]. It becomes challenging for the person with a vision disability to do their chores. The vision-impaired condition makes the person's life dependent on a caregiver. That is very expensive and difficult in this fast-moving world.

According to a study, visually impaired people face falls, traffic-related injuries, and occupational injuries [2]. People with reduced visual acuity are 1.7 times more prone to fall and 1.9 times more prone to multiple falls than those with full sight. A hip fracture is between 1.3 to 1.9 times more likely for persons with visual impairment than an average person. As per another study [3], 15% of people with vision disability collide obstacles every month on average, and 40% of people with vision disability fall every year because they hit obstacles. In particular, aerial obstacles, such as awnings, tree branches, and similar objects typically have no projection on the ground or floor [4]. Visually impaired people are generally facing two types of dangers. One is a collision with aerial obstacles in the front and the second fall. Addressing these problems can prevent mishappening.

Traditionally white stick and guide dogs are used to provide guidance when visually impaired people go out independently. The aerial obstacle cannot be localized using the white stick or guide dog. The solution to these problems is to have an assistance system that gives information regarding aerial or ground obstacle well in advance to the visually impaired person so that s/he can protect self from the obstacle. There is much scope in the improvement of the assistance system for visually impaired people. There have been various smart assistant systems proposed by researchers from different parts of the globe that address ground obstacle avoidance. There are solutions in which addresses both aerial and ground obstacle avoidances have.

A general approach towards vision-based assistive systems includes processing of camera input using image processing-based algorithms. The processed output and other outputs from other sensors are used for decision-making; based on this, the VI user receives feedback. Figure 1 shows a generalized approach to the vision-based assistive system for VI users. The processing includes feature extraction from the frames capture by the camera. The decision making utilizes a basic thresholding technique to a sophisticated machine learning or deep learning-based approach. The VI user receives feedback based on the decision made through various means.

Fig. 1
Steps involved in vision-based assistive system include Processing of camera input; decision-making based on processing and other sensory inputs; feedback to visually impaired user illustrated as a person holding a walking stick.

A general approach to the vision-based assistive system

As the camera mimics the task performed by the human eye, vision-based assistive systems are most suitable for assistance to VI people. The advancement in the development of algorithms and extensive use of deep learning in computer vision further makes it a promising candidate for the solution.

The remainder of the paper is organized as follows: Sect. 2 discusses the literature review of the existing vision-based solutions for visually impaired people based on sensors, processing techniques, and wireless communication techniques. Section 3 concludes this paper with future directions for the said problem.

2 Investigation of AI-Based Vision Assistive System for VI People

A mobile camera-based solution for visually impaired people is reported [5] for indoor Fig. 2(a). The pre-defined paths were marked with colour tapes, and the mobile camera was used to track the path. Extended Kalman Filter (EKF) and Weighted Moving Average Filter (WMA) are used to overcome optical flow errors. Arianna is a framework for determining a safe walking path in interior environments. The solution is based on a video camera incorporated inside a smartphone device at the hardware level. The user feedback is positive. Vibration patterns are used to transfer information. A series of interest locations, denoted by arrows, are used to design the walking path. QR codes can be scanned, or a path on the floor can be followed.

Fig. 2
Photographs of 15 different advanced technological solutions to aid visually impaired people.

Vision-based assistive systems reported for VI users [5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]

It [6] introduced a new marker-based technique called mobile vision (MV). The technology is incorporated on a smartphone device in an indoor context and uses special colour markers Fig. 2(b). The user is directed via red, green, and blue colour markers to locate interesting sites, such as restrooms, elevators, or exits. Feedback messages are delivered via text-to-speech transcripts.

The Smart Vision navigation framework is presented in [7], which combines GPS, Wi-Fi localization with GIS (Geographic Information System) [8], passive RFID tags, and computer vision algorithms for outdoor scenarios. The system is not intended to replace the white cane but rather to supplement it by alerting the visually impaired (VI) user to impending dangers. Here, a database with prospective objects of interest (e.g., elevator, welcome desk, plants, cash machine, and telephone booth) is being created Fig. 2(c). The reference images stored a priori are sought among video frames captured by the camera during the test. The approach, however, is extremely sensitive to camera movement and is strongly reliant on the amount of the training sample. Furthermore, it experiences adaptability issues, since for a bigger dataset with various objects of interest, the computational time increments fundamentally.

A deterrent location and order strategy totally coordinated on a standard cell phone is presented in [9] that is more, reached out in [10] Fig. 2(e). The framework is intended to work with the VI client route in both indoor and open-air conditions. In [9], creators propose distinguishing the block's area by extricating interest focuses that are followed between progressive casings utilizing the standard Lucas-Kanade algorithm Fig. 2(d). The object's movement is recognized from the camera development with the assistance of many homographic changes grouped by applying the Random sample consensus (RANSAC) calculation [11]. The identified objects are additionally arranged by joining the Histogram of Oriented Gradients (HOG) descriptor into a Bag of Visual Words (BOW) portrayal. Even though the framework generally returns great outcomes, it can’t identify huge, level designs or effectively gauge the distance between the VI client and a check. In [10], the creators proposed addressing the previously mentioned constraints by coordinating ultrasonic sensors inside the framework. The methodology shows promising outcomes; however, it demonstrates to be delicate when various moving obstructions are available in the scene.

In [12], a computer vision-based way-finding technology is suggested that helps independent access to indoor but unfamiliar locations Fig. 2(f). The system consists of a camera, microphone, computer, and Bluetooth earpieces on the hardware level. The framework uses a geometric design mixed with a corner and edge recognition method to detect doors, elevators, and cabinets. The system may then discriminate between foreground and background objects using an optical character recognition approach. A Canny edge detector is utilized for the detection of doors and Optical Character Recognition for text classification.

A system proposed in [13] includes a module for identifying textiles Fig. 2(g). Four clothing textures can be distinguished using a new Radon Signature descriptor: Plaid, striped, patternless, and irregular and eleven clothing colours. Although both modules were created with individuals with disabilities in mind, no studies or tests with actual VI users have been conducted too far. Furthermore, the framework is incapable of handling object occlusion or operating in real-time.

Developments in the Crosswatch system for providing guidance to visually impaired travellers at traffic intersection is discussed, and also new functionalities are described [14] Fig. 2(h). The panoramic image processing was used for the analysis of the crossroad view. The VI user captured the panoramic image of the viewpoint by rotating the camera for 360°. Another traffic light recognizer is proposed to detect traffic light signals for VI users [15] Fig. 2(i). The Active Optical Unit (AOU) is extracted from the image captured, and based on the AOU, the distance between VI users and the traffic light is calculated.

ShopMobile II has been proposed for supermarket grocery shopping for VI users [16]. The navigation is based on the barcode scanner on the products in the supermarket. The barcode localization and decoding are done by using computer vision algorithms. The localization of the barcode is based on the number of zero to one and one to zero transitions on two horizontal lines on the image.

Molina et al. proposed the use of visual nouns for VI user navigation in both indoor and outdoor situations in [17]. The system generates mosaic images, which are then used to help the VI navigate around streets and corridors. Signage, visual text, and visual icons are considered visual nouns. However, a number of open conditions must be met for the system to be beneficial to VI people: (1) development of an appropriate human-machine interface; (2) integration into a wearable assistive device; and (3) development of an acoustic or haptic interface.

Another system for VI people is proposed utilizing a smartphone camera to capture panoramic images and a Graphic Processing Unit (GPU) server to extract features from an image or a short video [18] Fig. 2(j). The modelling of images was done by converting the image into the HSI colour model and then taking the projection of H, S, I and gradients to calculate the Omni-projection. The Fast Fourier Transform (FFT) of the normalized projection curves was taken. In the query stage, the frame is processed the same and compared with all the modelled images. The closest matching frame is obtained using phase curves of the Omni-directional images. The use of multi-core CPUs or GPUs is proposed for enhancing computational speed.

A robust banknote recognition system is proposed based on computer vision for blind people [19] Fig. 2(k). The banknote dataset was collected in various circumstances and labelled with note values. The Speeded Up Robust Features (SURF) have been utilized for matching for banknotes. The authors claim to have achieved 100% true recognition and 0% false recognition rate. Similarly, another smartphone-based USA currency note recognition system was proposed [20] Fig. 2(l). The system utilizes the Principal Component Analysis (PCA) based image recognition method, Eigenfaces, to recognize currency notes. The authors have achieved a 99.8% of accuracy rate and 7 frames per second processing speed. The processing was done on a Grayscale image that was converted from an RGB image [20]. Another mobile application based Indian currency note recognition system is proposed [21] Fig. 2(m). A median filter and histogram equalization are utilized for noise removal and image enhancement. Morphological operations are performed for feature extraction, and these features are used for currency note matching and recognition.

A vision-based system is proposed for VI users during walking, jogging, and running [22] Fig. 2(n). The system utilizes image processing for lines and lane detection on the road in the outside environment. The system uses a camera and haptic gloves for feedback. The haptic gloves were fitted with vibration motors, and the commands to the VI user were encoded in the form of sequences of vibration in the haptic gloves. The extraction is done by using probabilistic Hough Line Transform.

The Charge Coupled Device (CCD) camera-based assistance gadgets have been more convenient and comfortable to manage than sensory systems. However, when estimating the real distance between the VI user and a detected obstacle, these solutions have low accuracy. Any monocular system has the drawback of being unable to determine the global object scale only on a frame. The concern is exacerbated when dealing with outdoor environments since scale drifts between map sections and their projected motion vectors are more common [23].

The E-vision system is proposed for VI users for three distinct daily activities: Supermarket visit, public administration Building visit, and outdoor walk [24] Fig. 2(o). The system exploits the classification and Optical Character Recognition (OCR) for supermarket visits, OCR and object detection & face and emotion recognition for administrative building visits, and face recognition & text-to-speech conversion techniques for outdoor environments.

A Convolutional Neural Network (CNN) based wearable travel system for VI users for indoor and outdoor environments has been proposed [25]. The system is capable of providing environment perception and navigation for VI users. The system utilizes an Inertial Measurement Unit (IMU) for the acquisition of the altitude angle of the camera. A smartphone is utilized for position acquisition, navigation, object detection, and acoustic feedback to the VI user. A lightweight CNN based PeleeNet [26] object detection model trained on the MS COCO dataset has been used in the system. Another similar deep earning based wearable assistive system for VI users to enhance the environment perception has been proposed [27]. The CNN-based segmentation and obstacle avoidance system is proposed that utilizes CPU and GPU computation power for real-time performance. The smartphone is utilized for touch interface to provide environmental information to the VI user. A CNN based FuseNet [28] has been utilized for the segmentation of captures image frames.

Table 1 summarises the literature based on sensors used, image processing techniques used for decision making, and wireless communication techniques used for feedback to the VI user.

Table 1 Summary of vision-based assistive systems for VI users

3 Conclusion

In this research, a literature review of computer vision-based solutions for visually challenged people is presented. Table 1 shows the survey summary, categorizing the studies based on the sensors used, image processing algorithms used, and communication techniques used by the individual study. This study suggests that standard digital image processing techniques were used in the early days of computer vision-based assistive solutions for VI users, and machine learning and deep learning techniques recently. Wi-Fi and Bluetooth have been used in the majority of studies where wireless communication techniques have been used. Many assistive systems have used a simple camera, and others have used RFID, GPS, GSM, Ultrasonic sensor, sound output, and other technologies along with the camera. Researchers have begun to apply deep learning approaches for assistive solutions for VI users as machine learning, and deep learning techniques have grown with the arrival of good computation powers to machines. However, carrying computational power devices for vision-based assistive solutions is inconvenient for VI users. The deep learning models may be optimized for edge inference using current optimization techniques. Quantization and layer pruning are part of the deep learning model optimization process. With the optimization of deep learning models for inference on edge devices, vision-based assistive solutions can be upgraded.