Keywords

1 Introduction

The elderly represent the population group with the highest level of dependency and need for care. They are prone to physical and mental disability or deterioration [7]. Moreover, they need adequate supervision in case of an accident or any other need. However, continuous supervision leads to work overload for carers, both in specialised centres and at home. In addition, due to the high costs of specialised care, family members often have to take care of the dependent persons themselves. All this has led to research of strategies to reduce the work overload of caregivers [8].

Several research projects have focused on supporting caregivers through the use of various technologies for monitoring dependent persons. Among them is the use of computer vision in real time. One such project was to detect whether a person was eating, observing or taking their medication, and to notify the caregiver of the occurred events [21]. This proposal demonstrates the usefulness of image processing where objects and actions are distinguished through the use of colour-based computer vision algorithms.

Other works have focused on the search for human activity recognition using different devices, from smartphones, wearables, video and electronic components to more innovative systems based on WiFi or assistance robots [19]. Interestingly, 60% of the technological monitoring solutions are based on computer vision through different camera types. Although visual monitoring has proven to be a viable and popular option, its implementation within a home would require a large number of cameras. Innovative approaches are therefore emerging in which unmanned aerial vehicles (UAVs) using robust trajectory planning to fly safely in indoor environments are able to monitor dependent people via an on-board camera [5].

Conducting these kinds of experiments indoors with drones can be dangerous in real environments. For this reason, research has initially focused on 3D environments through the development of a virtual reality (VR) platform [2, 3]. Thanks to this approach, the benefits of using drones as assistant robots for monitoring dependent people in a realistic and safe virtual environment can be evaluated. The platform is based on real-time communication through the Message Queue Telemetry Transport (MQTT) protocol of the various modules implemented to recreate the behaviour of an autonomous vision-based UAV for monitoring dependent people. One of the main modules is the computer vision module in charge of processing the images captured by the UAV’s on-board camera to detect various states of the person [12]. This article complements previous research, focusing on image processing to detect postures of the monitored person. We also use the MQTT protocol to transfer the images from the virtual environment to image processing in Python.

2 Monitoring Dependent People at Home from UAVs

Our research revolves around the use of small vision-based UAVs to assist dependent people at home [2,3,4,5,6, 12]. The main objective is to monitor the person in order to determine their condition and possible required assistance. Another alternative for monitoring would be the use of static cameras. However, this would require deploying multiple cameras in the home to avoid dead spots and, in addition, the ability to detect the person would be reduced as the person moves away from the camera. Therefore, a moving aerial robot has the potential to cover a larger area and monitor more closely and efficiently than a number of static cameras.

Fig. 1.
figure 1

The three postures to be identified (standing, sitting and laying down).

UAVs are useful tools that have been used in conjunction with image processing in various research areas. For instance, UAVs have been used with image processing based on colour detection algorithms in outdoor environments for human body detection [18]. However, in indoor scenarios people monitoring is problematic when using only colour-based algorithms due to the complexity and large number of objects and colours found in a house. On the other hand, through image processing it is possible to detect objects based on colour [21] or recognise mood on faces [12], but the detection of human posture is more complex.

In addition to cameras to effectively monitor a person, some hardware devices incorporating different senors have been used. For example, the Microsoft Kinect has a depth sensor that allows a more optimised tracking of the person human skeleton [13, 20]. Some UAVs currently carry depth sensors to avoid obstacles and even track people, as is the case of the DJI drones [22]. However, for the detection and estimation of human poses solutions solely using conventional cameras have also been proposed. This is the case of a model for suspicious movements detection in people through posture estimation. The algorithm takes 3.4 s for the detection of a single person, without considering the extra time required to process and estimate the postures [15].

For an efficient monitoring of dependent people less time is required. This is why our proposal focuses on using a promising algorithm for human detection and pose estimation that was born from a very recent research [23]. The ultimate objective is to determine if a person is standing, sitting or laying down though only processing colour images captured by a UAV’s camera. Figure 1 illustrates the three postures to be identified during the monitoring process in the VR platform. It should be noticed that dependent people are sitting or laying down most of the time. Having a system that allows the person to be detected efficiently and differentiated from other objects and recognise their current posture is an advance in monitoring, supervision and alarm of elderly or dependent people through affordable devices.

3 Computer Vision Algorithms

This section describes the computer vision algorithms implemented to detect whether the person is standing, sitting or laying down. First, the MediaPipe framework used to obtain the required key points of the human skeleton is introduced. Then, the key points are used to detect the human avatar in the images obtained from the virtual scene in Unity. Finally, we describe how the three different postures are identified.

3.1 MediaPipe

MediaPipe is a framework for building multimodal applied machine learning pipelines. It provides solutions for different kinds of applications like face detection [11], hand bones detection and tracking [10] and so on. In this paper, the Pose library of MediaPipe is used to map the 3D pose landmarks which estimate the joints of the human skeleton. The library generates up to 33 landmarks, each one with a unique id. But, in our solution only seven relevant joints are used to determine one of the three searched postures of the human being (standing, sitting and laying down). The landmarks used are illustrated on the human avatar of Fig. 2.

Fig. 2.
figure 2

Landmarks highlighting the relevant joints: nose = id(0); left shoulder = id(11); right shoulder = id(12); left hip = id(23); right hip = id(24); left knee = id(25); right knee = id(26).

3.2 Human Detection

The VR visualiser module developed in Unity as part of the VR simulation platform of the assistant UAV transmits the images captured by the UAV’s on-board camera via MQTT to the new computer vision module. This module programmed in Python processes each image with MediaPipe and obtains the relevant points on the skeleton of the human avatar. The recognition of the avatar skeleton is shown in Fig. 3 for the three different positions to be estimated. It should be noted that the algorithm perfectly scans and generates the avatar skeleton in each of the positions evaluated in this work in a room that has several background colours. This would represent an obstacle for any algorithm where false positives could be generated [9], requiring additional processing to reduce these false alarms [16]. The efficiency of the visual computation in the MediaPipe algorithm is remarkable, as complex systems or multi-camera scanning are usually utilised to achieve this type of complete human mapping [17].

Fig. 3.
figure 3

Skeleton landmarks in the three postures (sitting, standing, and laying down).

Figure 4 shows a block diagram of the solution implemented in Python to detect the position of the avatar from an image received from the UAV camera monitoring the avatar at the virtual house. Once the image captured in Unity reaches Python, it is scanned by the MediaPipe algorithm where the skeleton reference points are obtained. Of these, only seven relevant points are considered to estimate the posture of the avatar corresponding to shoulders, hips, knees and nose. From the appropriate identifiers, the (x, y) position of these points in the 2D image plane is calculated. These coordinates are used to determine the posture of the avatar, as will be detailed in the next section.

Fig. 4.
figure 4

Block diagram of the implemented solution.

3.3 Posture Detection

Once the coordinates of each point in the 2D plane are obtained, it is possible to estimate the avatar’s posture from the position of the selected points and the angles generated among all the points by using simple trigonometry, which represents an enormous simplicity compared to the use of more advanced algorithms or classifiers [1, 14].

Firstly, the height of the shoulder points, id(11) and id(12), are used to determine the direction of the avatar, since with a slight rotation both shoulders differ in height in the 2D plane. The direction of the avatar determines the points in the left or right side of the avatar that are used in the remaining calculations: left points (ids 11, 23, 25) when the avatar is turned to its right (left of the image), or right points (ids 12, 24, 26) when the avatar is turned to its left (right of the image).

Secondly, the difference in height between the points of the nose and the knee on the side of the skeleton as determined by the avatar’s direction is analysed. After trial and error adjustment, it has been determined that if the distance is less than 20 pixels, the person will be laying down (see Fig. 5a). However, if this distance is greater or equal than 20 pixels, the person will be in a standing or sitting posture. It should be noted that the tests have been carried out when the person is laying down with a horizontal direction of the body (the imaginary line that would join the head with a foot). When this direction changes and approaches vertically, the laying posture looks very similar to the standing posture in a 2D image (see Fig. 5b), leading to estimation errors that will need to be resolved in future work. Finally, between the standing and sitting positions, the angles formed among the shoulder-hip-knee points are very noticeable (see Fig. 5b and 5c). Therefore, the differentiation of these postures is possible by simply measuring this angle. If the angle formed is higher than 140 \(^\circ \)C, the avatar’s posture is standing while if the angle is less than 140 \(^\circ \)C, the avatar’s posture is sitting.

Fig. 5.
figure 5

a) Height condition in laying down posture. b) Angles in the standing posture. c) Angles in the sitting posture.

4 Preliminary Results

This section introduces some preliminary results of the solution implemented to detect the human pose from the monitoring process of the assistant UAV. The tests have been performed on the VR platform, where the images from the UAV camera in the virtual scenario in Unity are sent via MQTT to the computer vision module programmed in Python. Here, the objective is to first determine the direction of the avatar in the 2D image captured by the camera, meaning the side to which the avatar’s body is turned, either to the left or to the right. Then, considering this direction and the angles formed by the three points of the shoulder, knee and hip joints, as well as the height of the nose in relation to the knee, the aim is to determine and differentiate the avatar’s posture.

In order to evaluate the performance of the computer vision solution, different tests have been carried out considering the three possible positions of the avatar, and also considering that the avatar is turned and placed on opposite sides of the room, where its direction and skeletal points change. The results in all cases have been positive, as it has been possible to correctly determine the direction and posture of the avatar, as shown in Fig. 6. This figure shows the windows generated by the OpenGL library from the Python program. In the upper part you can see the result of the avatar in the three positions in which it is slightly turned to its left, and in the lower images the same result is shown, but in another side of the room in which it is turned to its right.

Fig. 6.
figure 6

Posture detection results

5 Conclusions

This article has introduced a computer vision solution for the detection of a person’s posture monitored by a UAV for home care. The developed solution is based on the use of the Pose library of MediaPipe and allows differentiating between three possible postures: standing, sitting and laying down. It performs a series of trigonometric calculations by considering relevant reference points in the human skeleton. The solution has been implemented in the computer vision module programmed in Python for the VR platform, which simulates the process of monitoring a dependent person from a small drone in a virtual home.

The first evaluation results of the programmed solution are satisfactory. Furthermore, it should be noted that MediaPipe Pose library promises an optimal and fast recognition of the human body and can therefore be implemented in real-time systems. In addition, the skeleton is generated progressively as the human body is displayed on the camera, which prevents the algorithm from stopping due to incomplete visualisation of the body. The points obtained from the library cover the entire human skeleton and can be used in future work for more extensive posture recognition. One of the main areas for improvement is to extend the recognition of the laying position to other situations where the direction of the body is not horizontal. Another future work will be the estimation of the distance of the person with respect to the UAV’s camera in order to use this information for the drone’s trajectory planner.