Keywords

1 Introduction

Advancements in technology have led the researchers to develop systems for medical applications especially in the surgical environment [1]. Surgical procedures have become increasingly dependent on a variety of digital imaging systems for navigation, documentation and diagnosis. The necessity to analyze and examine these images during surgical procedures proposed some barriers demanding the need to maintain a sterile surgical environment, especially under the current Coronavirus (COVID-19) pandemic. Maintaining sterility in the operating room is of utmost importance to deal with the safety and security of the healthcare staff in addition to the right treatment given to the patients. Hence, a controlled environment is crucial to minimize the risks which are likely to prevail. Traditional input machines like keyboard, touchscreen, mouse, etc. are dependent upon physical contact. Howsoever, such contact-based interaction poses a danger for the transfer of contaminated materials. To maintain a sterile environment in operating rooms as well as to avoid scrubbing, doctors rely upon their assistants who help them out in controlling the equipment as well as manipulating the images. But this may sometimes lead to an increase in operating time, errors, and chances of failures. Researches and surveys carried out depict that these practices might lead to chaos and risk to the patient’s life. Recently, a touchless interaction device named Kinect was launched by Microsoft in 2010. As shown in Fig. 1, the proposed system makes the use of this device and tries to minimize the loopholes and discrepancies as mentioned above. It provides an efficient and good alternate to doctors. In this direction, researchers have discussed ways and means of smart touchless healthcare [2].

Fig. 1.
figure 1

Kinect setup in healthcare [3].

Contributions:

The scenario in a modern operating room is very cumbersome these days. Hence, good technologies that are reliable as well as efficient are in great demand, especially in the current pandemic of COVID-19 to look after the safety and security of the patients as well as the healthcare staff at the same time. Although new techniques have come up in the medical domain for the ease of doctors, still most of the work is done manually. This type of system, which operates via gestures and speech, is not currently in implementation. The proposed system is developed basically to help doctors to deal with conditions in the modern operating room. To control the video, doctors can touch the system physically or just wave their hands. Furthermore, video commands (such as play, pause, stop, volume up, and volume down) can be controlled by speech for smooth and efficient controls. The advantages of this system are (i) minimizing the need of supporting staff in operating rooms, (ii) access to the real-time video (assistance), and (iii) recording of important data for doctors’ assistance to avoid inaccuracy in data. To overcome that speech to text conversion features is also one of the requirements of the system [10]. Kinect can act as a 3D scanner and scanned copies of organs can later be 3D printed and act as organs for organ transplants. We demonstrate that if this idea is successfully brought into implementation can work more efficiently compared to the presently used system.

2 Related Work

This section presents existing work related to the healthcare operating room. Saxena et al. [4] developed a proof-of-concept prototype NHealthIoT testbed and shown its usability. Yadav et al. [5] discussed cursor movement by hand gesture and proposed a healthcare application that requires only a simple webcam for implementation. For recognition based on hand gesture, there are generally two main approaches, one is hardware-based and the other one is vision-based. It uses a data glove to achieve gesture recognition. The proposed system effectively eliminated the necessity for a mouse pad [5].

Pei Xu described a Human-Computer Interaction (HCI) system and a recognition-based on hand gesture that are working in real-time [6]. Human-computer interaction, gesture recognition and hand detection are the three components of the proposed system. Preprocessing of images is the first step and thereafter a hand-based detector was attempted to filter the hand image on this image itself. A Convolutional Neural Network (CNN) classifier has been integrated with the Kalman estimator to recognize gestures from the processed image. Finally, the output results are submitted to a control center in order to decide the good probabilistic model [6]. Authors have used Microsoft Kinect for hand detection and gesture recognition to simulate the mouse events. There are two methodology glove-based and vision-based that have been studied. The main disadvantage is the cost of the devices used. However, a more natural way of HCI is offered by a vision-based approach as no physical contact is required. The major drawback of this approach is occlusion [21]. Sawai and Shandilya [14] reviewed gesture and speech recognition using the Kinect device and explored algorithms for the Kinect sensor. They have studied Kinect’s gesture recognition, microphone-based speech recognition, and tracking of objects and 3D mapping. A system is built to identify and understand gestures and speech using Kinect sensors. Due to the limitation of the short distance between the human operator and the phone device, voice control still suffers from inconvenient operations. Presently voice control by Kinect voice sensing provides a way that human operators can control the phone device without carrying it far away from the user. McKay and Clement [12] investigated the application of Microsoft Kinect for recognition of visual-only automatic speech and reviewed the ability of Kinect. They tried to use Kinect as an automatic speech recognizer. A program is implemented to identify spoken words and report confidence with which the words are recognized to test the ability of Kinect. They stated that Microsoft Kinect API gives 90% of accuracy for the word recognition system made by them.

Tiangang et al. [20] highlighted 3D surface reconstruction based on the Kinect sensor. They described the 3-D reconstruction route. There are four steps, starting with the preprocessing of data. After this pose estimation of the sensor, fusion of the depth data, and finally, extraction of a 3D surface. Authors in [7] presented results on fusion 4D with real-time observations and then formed a method that establishes incremental restorations to enhance the surface estimation over the period, and parametrizing non-inflexible scene motion. They merged the concept of volumetric fusion and estimation of the smooth deformation field across RGBD views. This method is extremely vigorous to topology changes and also large frames to frame motion, permitting recreating exceptionally difficult scenes. The main advantage of this technique is that it either distorts an online produced template or continually combines intensity data non-firmly into a particular position model. The frame to frame motions will be imprecisely anticipated or the non-inflexible orientation will not succeed to congregate for slow frame rate and large frame to frame motion.

Casino et al. [13] described the approach of using recommendation systems to offer healthcare services so that citizens could collaborate within the city to upgrade their value of life. Sholla et al. [15] presented a new approach that incorporates ethics in IoT based connected smart healthcare. Jangra et al. A multilayered framework [16] has been proposed to enhance the utilities of biosensor-based data collection and aggregation. Alabdulatif et al. [17] demonstrate a framework by evaluating a case study for the patient biosignal data. Authors in [18] proposed a system to enhance the capabilities of IoT-based healthcare systems with fast response time and low latency. Pathinarupothi et al. [19] introduced a smart edge system based on IoT-based to handle remote monitoring of the patients where data is transmitted to software engines using wearable vital sensors. D. Webster and O. Celik [31] reviewed applications of Kinect sensor for elderly care and stroke rehabilitation. They have presented a Systematic review of different Kinect applications.

Kinect sensor is also used in many other applications such as Interactive Educational Technology [23, 28], gesture recognition system designed for severe intellectual disabilities [27, 29], motor rehabilitation [24, 30], etc. Researchers have put in enormous efforts in developing methods for diagnosis and intervention for children with Autism [22, 25, 26].

3 Preliminaries: KINECT

Microsoft launched Kinect, which is a line of motion sensing input devices for Xbox 360 as shown in Fig. 2. In Xbox 360 Microsoft Kinect there is a small motor working as the base to enable the device to be tilted in a horizontal direction is attached to the flat box. The important components of Kinect are mainly infrared (IR) emitter, color camera, tilt motor, LED, microphone array and IR depth sensor.

Fig. 2.
figure 2

Kinect device model.

A stream of colored pixels has been captured by the Kinect along with data about pixels. The value of each pixel represents the distance from the sensor to an object in that direction [21]. Thus, Kinect provides developers a means to create an application based on touchless experience through gesture, voice and movement.

4 Proposed Methodology

The proposed system uses the Kinect sensor. The Kinect sensor mainly consists of three parts:

  • For capturing the color images an RGB camera is used which stores three-channel data in a 1280 × 960 resolution.

  • An IR depth sensor and an infrared (IR) emitter.

  • A multi-array microphone, which consists of 4 microphones for capturing sound. We can record audio and also find the direction and location of the sound source.

4.1 Using Kinect to Control the System Cursor

The system cursor is controlled by gesture recognition. Gesture recognition is implemented in the following phases:

  • First, the skeletal joints are recognized by Kinect and are sent as input data to the system.

  • Next, this joint data is used to recognize certain gestures.

  • The recognized gesture is then interpreted to perform the task mapped to it.

In the proposed system, two mouse events namely mouse click and mouse motion are implemented. This is done by mapping mouse coordinates for mouse motion with one of the hands and mapping the click event with the other hand. To implement this, the depth image has been sent to the host device by the Kinect sensor and software implemented on the host works on the decoding of information present in an image. Before the cursor is assigned to the hands and moved, some messages are sent to the control inputs. The X and Y position of the cursor has been assigned after recognition of a user’s hand. If the left hand is mapped to the mouse movement and the right one is mapped to click event, then the distance between the left wrist and the left shoulder is obtained and is scaled accordingly with the mouse coordinates. The distance between the left wrist and the left shoulder is obtained if the distance is less than 0.2f (threshold value), the left-click event is raised.

Gesture recognition using Kinect can also be used for posture recognition, fall detection, security and surveillance in modern healthcare operating rooms. This can be done by using one or more cameras, intruder detectors and communication devices to notify alarms. We can also use high end PC which makes it suitable for real-time security.

4.2 Speech Recognition

Microsoft Kinect contains a set of microphone array which acts as voice receiver for performing speech recognition, as shown in Fig. 3. The data acquired by these sensors will be recombined into a single set of voice data. The developer first needs to add the voice command keywords in the grammar (XML) file before they can be recognized. The data will be continuously received by sensors until data involves voice control command keywords as stated in the grammar file of the code running behind these sensors. The Kinect's speech recognition is used in command mode i.e. in this mode command is said and the speech is recognized by the speech recognition engine (SRE). For example, one may want to play or pause a video by just saying “start” and “wait”. To develop any speech-enabled application, one typically performs the following basic steps:

  • Kinect audio source has been enabled.

  • Capturing the audio data stream.

  • Identification and starting of the SRE.

  • Attach the speech audio source to the recognizer.

  • Registration of the event handler.

  • Finally, handling the different events invoked by the SRE.

Fig. 3.
figure 3

Speech recognizer grammar.

Fig. 4.
figure 4

Kinect system interface control for the experiment setup.

4.3 3D Scanning of Real Objects

For scanning of real-world objects, Kinect can act as a 3D scanner. Microsoft Kinect has been used in the proposed system which provides 3D object scanning and model creation. By integration of depth data taken from Kinect sensors over time from various viewpoints, the dense surface models are reconstructed into smooth surfaces. By moving either the object or the Kinect sensor the multiple viewpoints of objects are fused to give a single reconstruction voxel volume. As the sensor is moved around, various scenes are integrated to create the 3D model. The first stage of scanning the 3D models of real objects is the depth map conversion. Raw depth data from Xbox 360 Kinect is taken and then converted into floating-point depth in meters. The surface normal at these points are used with AllignedPointCloud functions.

A Kinect system interface control has shown in Fig. 4. Global camera pose is calculated in the second stage and pose has been tracked when the sensor moves in each frame. So, the current sensor pose relative to the initial starting frame is always known by the system. Kinect Fusion has two algorithms:

  • NUI FusionAlignPointClouds

  • AlignDepthToReconstruction

The third stage is the fusing of depth data with a running average to reduce noise. In this step, the integration of depth data is implemented per frame. It also handles some dynamic changes in the scene. With this, we get a.obj file of the scanned 3-D module. This obj file of the scanned organ is imported into the mesh lab and some enhancements could be performed on it. 3D bioprinting of the scanned organ could be performed and hence can be used for transplantation.

5 Performance Analysis

This section discusses the performance efficiency of the proposed approach with traditional manual work.

5.1 Comparison with Manual Work

To scan through patients’ reports and medical records, there are two ways, i.e., via traditional means the one that includes physical touch and the other one is the system which has been proposed to avoid physical touch.

Table 1. Average User Time

As shown in Table 1, the average time for each user was recorded to be 1.78 s., 1.48 s. and 1.59 s., respectively, using the proposed system. The time of 1.48 s. is found to be the minimal delay when compared to the removal of scrubs recorded at 16.07 s. This is considered an improvement to the time efficiency of the operation. The total average time for this analysis comes out to be 1.61 s., which gives reasonable results.

5.2 Comparison with Existing Schemes

For gesture and action recognition, there are many approaches proposed by researchers. Using the MEMS inertial sensors, we were able to:

  • Miniaturize any delay as they are lightweight.

  • Reduced overall cost, as the cost of such sensors, has actually been dropping.

  • Utilize them for capturing the movement of a human in real unconstrained environments to find accurate results.

  • Use for near real-time feedback.

However, Kinect comes with an RGB color camera and a depth sensor, which provide us with gesture recognition along with full-body 3D motion capture capabilities in the form of skeleton points. Real-time feedback can also be obtained using Microsoft Kinect. This new advance technology is widely being used due to the following reasons:

  • There is no requirement for body sensors.

  • Skeleton data can be extracted using software such as OpenNI and Kinect SDK.

Thus, due to the above-mentioned reasons using Kinect or other such cameras proves to be better while implementing a gesture recognition system compared to any wearable technology. The gesture recognition was compared with other state-of-the-art methods to evaluate the advantages and disadvantages of this method. The system is compared with [8, 9] and [11]. In [8] the author advised a method based on the ZCam and an SVM-SMO classifier. Furthermore, in [9], the authors proposed the hand gesture recognition system using a range camera with a satisfactory real-time ability. Motion tracking has been combined with the mean-shift algorithm to capture the hand gestures in [11].

5.3 Results and Discussion

The system benefits medical fields and communities by making the operation process more efficient and keeps it in an aseptic and sterile condition with an added benefit of reduced time consumption. It increases the efficiency as well as reliability thereby, saving much of the doctor’s time as well as supporting staff required. The system was also subjected for examination by Doctors from a renowned hospital where it was appreciated and was found beneficial by the surgeons themselves.

Table 2 presents comparative experimental results between the contact system and the touchless system. The traditional contact system includes attributes such as removal of scrubs (16.07 s.), changing of plates (8.00 s.), walk for or viewing room/area (18.8 s.), and re-scrubbing (26.17 s.). Whereas the modern contactless system includes program startup (5.38 s.), detection of gestures (1.52 s.), image flow view startup (16.7 s.), and image flow view startup (26.1 s.). Observing from Table 2, the touchless system outperforms the traditional contact system.

This work also compares the proposed technique with the methods proposed in [8, 9], and [11]. This comparison is presented in Table 3 where the average recognition accuracy and miss rate are computed for those methods. The values of average recognition accuracy for methods [8, 9, 11], and proposed method are 0.78, 0.74, 0.77, and 0.85, respectively, whereas the miss rate values for these methods are 0.18, 0.20, 0.16, and 0.14, respectively.

Table 2. Experimental results: contact vs. touchless system
Table 3. Comparison of kinect with other methods

6 Conclusion and Future Work

The current scenario of the COVID-19 pandemic really demands highly efficient techniques and procedures to handle a large number of patients per day for several months around the world. The proposed approach very much falls under the scope of the current healthcare situation. The proposed system is developed basically to help doctors to deal with conditions in modern operating room. This system minimizes the need of supporting staff in operating rooms. Sometimes doctors also need real time video assistance. The proposed system will help to eliminate the traditionally used interaction methods with the system by eliminating physical contact or any wearable technology, which is one of the key requirements today. Ultimately, the system proves to be beneficial in medical fields and to the community by making the operational process more efficient and also keeps it an aseptic and sterile condition with an added benefit of reduced time consumption. The proposed system can be further improved with time in the future to enhance other healthcare applications including stroke rehabilitation to recover stroke patients, fall prevention in elders is about predicting and preventing falls in the elderly.