1 Introduction

The global population is experiencing rapid growth, accompanied by a significant increase in life expectancy, particularly in developed countries. Bloom and Luca [1] note that life expectancy in China and India has surged by nearly 30 years since 1950. Consequently, a substantial portion of the population in developed nations, approximately 20%, is aged 60 and above, a figure projected to surpass 30% in the next four decades.

With this demographic shift comes a growing concern for elderlyFootnote 1 care, as the need for assistance and support rises proportionately. Among the myriad challenges faced by the elderly, falls represent a particularly prevalent and perilous occurrence. The World Health Organization highlights alarming statistics on falls, identifying them as the second leading cause of unintentional injury deaths worldwide. Each year, an estimated 684,000 individuals succumb to fall-related injuries globally, with an additional 37.3 million falls severe enough to necessitate medical attention [2]. Apart from the physical harm incurred by the elderly, the economic ramifications are substantial, with fall-related treatment costs comprising a significant portion of healthcare expenditures in various countries such as the USA, Australia, EU15 and the United Kingdom [3].

Automated fall detection for the elderly is feasible through data collected from wearable or environmental devices, such as accelerometers, gyroscopes, and cameras. Furthermore, Human Activity Recognition (HAR) holds promise for diverse applications, ranging from automatic life-logging to identifying patterns indicative of illness [4, 5]. Vision data from cameras is increasingly utilized for fall detection and HAR tasks due to its numerous advantages over wearable devices or other sensors. These advantages include the ability to detect multiple events simultaneously, suitability for various subjects, environments, and tasks, as well as ease of installation and visual verification of data [6].

From an algorithmic standpoint, Deep Learning (DL) has revolutionized digital image processing, emerging as the state-of-the-art approach in numerous domains [7]. Over recent years, a plethora of DL architectures have been developed and evaluated for computer vision tasks, prompting the exploration of suitable models for HAR and fall detection among older adults.

In this study, we conduct a Systematic Literature Review (SLR) focusing on DL-based HAR and fall detection using vision data for elderly care. Our review strictly adheres to the guidelines outlined for conducting SLRs in Software Engineering by Kitchenham and Charters [8], providing a structured methodology and rigorous analysis. The document is structured as follows: we first delve into the background of the study, encompassing previous reviews and defining key concepts; we then enumerate the review questions; next, we elaborate on the review methods, detailing data sources, search strategy, study selection, quality assessment, and data extraction; subsequently, we analyze the resulting studies comprehensively to address the review questions; the discussion section synthesizes our findings and addresses the review questions; finally, we present the conclusions derived from the SLR.

2 Background

In accordance with the guidelines provided by Kitchenham and Charters [8], it is imperative to summarize previous reviews prior to conducting the SLR, thereby substantiating its necessity. Hence, we briefly outline related reviews and surveys from the past three years, as older reviews cannot encompass the most recent studies. The full list can be found in Table 1, providing a visual comparison of the main disparities.

Guerra et al. [9] studied the current state-of-the-art of Ambient Assisted Living (AAL) for frail individuals, including the elderly and disabled, encompassing both wearable and non-wearable solutions. They explored common steps in the Human Activity Recognition (HAR) processing chain. Similarly, Kumar et al. [10] delved into various types of data used for HAR, elucidating common datasets, approaches, and challenges, but not including the elderly population or fall detection. In [11], Tay et al. investigated abnormal behavior detection, such as fall detection, repetition of activities, and accidents. They explored multiple solutions, including visual and wearable sensors, and both conventional and Deep Learning (DL) approaches. A review by Momin et al. [12] explores activity pattern monitoring using depth sensors, considering this visual data as a privacy-preserving alternative to RGB video or images for older adults. The studies are categorized based on the computing technique utilized and the datasets used are analyzed. Olugbade et al. [13] conducted a scoping review on datasets utilized for HAR and fall detection, resulting in an extensive compilation of over 700 datasets of various modalities. Multiple taxonomies were developed to categorize the datasets by population groups, data types, and creation purposes, among others, although only four datasets include elderly subjects. Alam et al. [14] conducted a review specifically on DL-based fall detection systems, analyzing different fall types, popular datasets, evaluation metrics, and architectural variations. Rastogi et al. [15] reviewed a broader range of tasks, including falls and other relevant information extracted from video sequences, such as body shape changes, posture, and gait. Another relevant review is by Gutiérrez et al. [16], also centered on fall detection, which describes common processing steps, ML models, datasets, metrics, and tracking techniques, with most studies utilizing RGB and depth data.

Table 1 Comparison of previous reviews with ours

While prior reviews have addressed various aspects of our research domain, notable differences underscore the necessity of our study. Table 1 sheds light upon this by displaying the pivotal aspects considered in the current review, along with whether they are addressed or not in the aforementioned reviews.

The sole review exclusively focusing on DL techniques was conducted by Alam et al. [14], which, however, omitted HAR from its scope, thus neglecting a significant portion of studies included in our analysis. In contrast, other reviews encompassed techniques employing handcrafted features or classical vision approaches, reflecting a broader scope than our exclusive focus on DL-based solutions. Furthermore, previous reviews often overlooked the importance of studying DL-related nuances, such as the significance of training datasets, architectural considerations, and feature extraction methods. In our review, we meticulously categorize and elucidate these nuances through a comprehensive taxonomy of identified techniques.

Another notable observation is the limited attention given to HAR in several previous reviews, with some omitting the task altogether. As a result, our review unveils a greater number of studies dedicated to fall detection and HAR in the elderly. Additionally, our analysis delves deeper into the intricacies of these tasks, providing a more comprehensive understanding.

Moreover, only a few reviews explored applications within AAL systems and the associated privacy implications. Hardware specifications, beyond the prevalent use of Kinect cameras, were rarely examined, and the effective deployment of fall detection or HAR systems was not thoroughly explored. In contrast, our review emphasizes these aspects, which are pivotal in facilitating the transference to society.

Finally, it is worth noting that, apart from [13], none of the previous reviews adhered to a systematic review process. By rigorously following the systematic review methodology outlined by Kitchenham and Charters [8], our study ensures a robust and unbiased selection and analysis of relevant studies. We conducted a comprehensive search across various databases, employing well-defined search strings aligned with our research questions. Each study underwent careful quality assessment, and strict exclusion criteria were applied to ensure the inclusion of only the most relevant and high-quality literature. This systematic approach minimizes potential biases and ensures that our review is based on a well-rounded selection of literature.

3 Review questions

As outlined in [8], specifying the research questions is a critical aspect of any systematic review, as they guide the entire methodology: from the search process identifying primary studies to address them, to the data extraction process extracting the required data items, and finally to the data analysis synthesizing the data to answer the questions. The review questions for this study are presented in Table 2.

Table 2 Primary and secondary research questions used for this SLR

The first research question, RQ1, aims to identify the methods used to recognize activities or detect falls among elderly individuals. The choice to specifically investigate HAR and fall detection stemmed from an exploratory initial search, where they emerged as the two most relevant recognition tasks in AAL for the elderly. Given that visual data offers numerous advantages over other sensor data types, such as visual verification and simultaneous subject recognition, and DL has become the state-of-the-art approach in computer vision, conducting an in-depth analysis of the most prevalent methods with these characteristics is crucial for informing future research in this domain. Furthermore, three research subquestions are included regarding common data types (e.g., RGB, depth, thermal, etc.), DL architectures (e.g., CNN, RNN, etc.), and datasets found in the reviewed literature. These subquestions aim to delve deeper into the solution choices at different design steps, which are closely related to various requirements such as privacy preservation, result stability, and inference speed.

The second research question, RQ2, emerges as a significantly unexplored area, as highlighted in Table 1 of the Background section. Many previous reviews have focused on the recognition phase of previous studies, enumerating common methods, processing steps, and datasets. However, the effective deployment in real-world scenarios is pivotal for the transfer of such methods to society, and this aspect remains largely unexplored. Works with implementations in real environments, whether through the use of assistive robots or camera-based setups, are expected to be found among the selected studies. Therefore, it is desirable to explore their design choices, setups, and encountered challenges in greater depth. Additionally, privacy is a particularly concerning aspect to consider when dealing with users, especially when utilizing visual data from cameras, and the approaches to addressing it are of interest for future research. For these reasons, RQ2.1 and RQ2.2 delve into common hardware choices and privacy preservation strategies.

4 Review methods

In this section, we provide a detailed description of the systematic review protocol followed, based on the guidelines outlined by Kitchenham and Charters [8]. Firstly, we list and analyze the primary data sources used, providing visualization of the distribution of studies among these sources. Next, we define the search strategy, which encompasses search terms, synonyms, and time restrictions. Following this, we establish criteria for inclusion and exclusion of studies, followed by the design of a quality assessment checklist to identify and remove low-quality studies. Finally, in the data extraction and synthesis stage, we define how information from each primary study is obtained and outline the specific attributes considered of interest.

4.1 Data sources

For this systematic review, we selected five primary data sources: SCOPUS, Web of Science (WOS), IEEE Xplore Digital Library, ACM Digital Library, and PubMed.

SCOPUS and WOS were chosen as comprehensive digital libraries covering a wide range of disciplines, while IEEE Xplore focuses on engineering and technology, ACM Digital Library specializes in computer science, and PubMed is centered on biomedical studies. This selection ensures the inclusion of relevant literature from diverse domains, maximizing the breadth of content considered in our review.

The distribution of studies retrieved from each source is illustrated in Fig. 1. As depicted, the majority of studies were sourced from ACM and SCOPUS, with only a small fraction (110 out of a total of 2,616) obtained from PubMed.

Fig. 1
figure 1

Number of publications obtained from each database, before duplicate removal and study selection (2616 in total)

4.2 Search strategy

We constructed different query strings tailored to match the syntax of each digital library while minimizing differences and employing consistent synonyms for the concepts being searched. Each query string connected the various concepts using logical AND, while synonyms for each concept were connected with logical OR. To account for inflection of certain keywords, we utilized the “*” operator after the root word to allow for any possible word endings. In the SCOPUS library, the search was restricted to titles, abstracts, or keywords due to the impracticality of retrieving results otherwise, with the majority being poorly relevant. Conversely, the entire text was searched for in the remaining databases. The primary concepts searched, along with their corresponding lists of synonyms, are as follows:

  • Task to perform (activity recognition or fall detection): “action recognition” OR “activit* recognition” OR “fall* detection” OR “behaviour recognition” OR “behaviour detection” OR “physical activity recognition”

  • Ambient Assisted Living: “monitoring” OR “assist* living” OR “AAL” OR “smart home” OR “activit* of daily life” OR “activit* of daily living” OR “ADL”

  • Target collective (elderly people): “elder*” OR “old* people” OR “senior”

  • Kind of data used (Computer Vision): “vision” OR “rgb” OR “video” OR “image” OR “skeleton” OR “depth” OR “camera” OR “gesture”

Initially, we included studies published from 2013 onwards in the search. However, upon further examination, we observed that the majority of relevant studies were published recently. Consequently, we decided to limit the review to the last five years. Figure 2 displays the accumulated relevant studies from 2013 to 2023. As depicted, only 19 relevant articles were found during the first six years, while 151 were discovered in the last five. This trend underscores the increasing significance of DL-based strategies for HAR and fall detection. By focusing on studies published in the last five years, we aim to gain a deeper analysis of recent trends.

4.3 Study selection

After collecting studies from various sources, limiting by year, and removing duplicates, exclusion criteria were applied to eliminate non-relevant studies. The exclusion criteria were as follows:

Fig. 2
figure 2

Accumulated publications from 2013 to 2023 (both included) after study selection and quality assessment

  • Deep Learning: Studies not utilizing DL were considered irrelevant for this review. Including this criterion in the exclusion criteria rather than in the query strings enabled the inclusion of more relevant studies, since many studies did not directly reference DL but instead used the name of a specific model.

  • Language: Studies not in English or Spanish were excluded.

  • Data Type: Studies using data types other than RGB, depth, or IR were excluded. This includes both videos and images. Skeleton data was also included, but only if computed from the other three types of data. Studies using sensory data along with visual data were also included, allowing for multimodal approaches.

  • Accessibility: Studies not accessible for various reasons, such as being part of paid content (e.g., book chapters), source website down, or retracted content, were excluded.

  • Redundancy: In cases where a journal article extended a work already presented in a conference, the conference proceedings publications were omitted, as the journal article represented an extension of the same work.

  • Task: Studies focused on tasks other than HAR or fall detection, such as velocity estimation, gait trend, level of tiredness, etc., were excluded. However, studies that did not directly perform HAR or fall detection but presented a new dataset for these tasks were included.

  • Target Collective: Studies not centered on elderly people were excluded. Merely mentioning the elderly as one of the beneficiaries of the work was insufficient; the study had to either use data from elderly people or have them in mind when designing the experiment.

  • Works in Progress: Conference proceedings about works in progress, containing only the initial stages of the study and lacking the experimentation phase, were excluded.

  • Quality: Publications with very poor quality (e.g., null reproducibility, highly biased decisions, too small datasets, etc.) were excluded. More information about quality assessment can be found in Section 4.4.

Fig. 3
figure 3

Articles collected at each phase of the systematic search process, including acquisition from various sources, removal of duplicates, and study selection, which also involved quality assessment

The results of study collection and duplicate removal are illustrated in Fig. 3. A total of 2,616 studies were collected from the different sources using the aforementioned queries, of which 633 duplicates were detected and removed, leaving a total of 1,983 studies.

As depicted in Fig. 3, only 151 studies remained after applying the exclusion criteria, comprising 89 conference proceedings and 64 journal articles. The conference proceedings were retained for analysis among the relevant studies, as they serve as a standard search strategy to address publication bias, which can lead to systematic bias in systematic reviews unless special efforts are made to address this issue [8].

4.4 Quality assessment

Given the absence of a universally agreed-upon definition of study “quality,” the proposed guidelines in [8] were adhered to, primarily focusing on bias and validity as measures of quality. Specifically, the following aspects were taken into account:

  • Reproducibility: Assessing whether the work can be replicated. This can be achieved by disclosing the dataset used, using external datasets, and either publishing the code used for the model or providing sufficient details to recreate the model.

  • Comparison with Other Works: Evaluating whether the performance of the model is compared with the state-of-the-art. It’s essential to ensure that comparisons are made under fair conditions, meaning that the models should be trained and tested on the same data to avoid introducing bias.

  • Use of External Datasets: Considering whether the model is tested on external datasets to mitigate possible bias from the data and facilitate comparison with other models for the same task. Additionally, using external datasets allows other studies to utilize the results without the need to retrain the model on different data.

These aspects were included in the list of fields during the data extraction phase (last three fields), as discussed in Section 4.5. Moreover, these quality aspects were also used as exclusion criteria, as previously mentioned in Section 4.3.

In addition to these aspects, the type of study, either conference proceedings or journal articles, was also considered as a quality indicator, with journal articles typically being longer and more mature.

4.5 Data extraction and synthesis

From each study remaining after applying the exclusion criteria, various data points were extracted to summarize the content and establish taxonomies for various aspects of interest. All data were compiled into a table, with each entry containing the following fields:

  • sep0em

  • Title

  • Author/s

  • Type (journal article or conference proceedings)

  • Publication year

  • Task (HAR or fall detection)

  • Data type (RGB, depth, IR or skeleton)

  • Auxiliary sensor data type (accelerometers, gyroscopes, etc.)

  • Camera used

  • Dataset (name of external dataset/s or “custom”)

  • DL model/s and task (skeleton joints estimation, feature extraction, classification, etc.)

  • Other ML models or computer vision techniques used

  • System integration in a robot (yes/no) and which one

  • System integration in a framework (yes/no)

  • How is privacy preserved? (depth or IR only, low resolution, etc.)

  • Reproducible (yes/no)

  • Test with external datasets

  • Comparison with other approaches

The complete list of relevant studies is provided in Tables 4 and 3, which display only basic information for each study. The remaining information will be synthesized in Section 5 through tables and plots, allowing for an overview of the distribution of works by used data types, DL model families, datasets, etc. Additionally, particularly relevant or interesting aspects of the works will be summarized, and important concepts will be addressed in more detail.

5 Results

This section provides an overview of the primary studies discovered through the systematic search process and presents the findings. Each study is thoroughly examined, and summaries are presented in the form of tables and graphs where applicable. Subsections are structured to address individual research questions, enhancing readability and organization.

Table 3 Full list of relevant studies examined in this systematic review

5.1 RQ1: fall detection and human activity recognition

The review primarily focuses on two main tasks: fall detection and Human Activity Recognition (HAR). It is worth noting that fall detection can be viewed an especially important activity of HAR. As illustrated in Fig. 4, fall detection has received the most attention in the past five years, with a total of 72 studies, while HAR has been explored in 52 studies. This discrepancy highlights the significance of fall detection when concerning the elderly population. Many works emphasize the importance of accurately and swiftly identifying falls among the elderly, given the potential for injuries and health implications if prompt actions are not taken. Consequently, several studies mention integrating fall detection into systems or applications capable of alerting medical personnel [41, 49].

Only 27 out of 151 studies (approximately 18%) address both tasks simultaneously. This disparity arises from the emphasis placed on fall detection compared to other activities (such as walking or standing up), as well as the limited availability of data concerning fall scenarios, often resulting in an imbalanced problem. However, some studies manage to address both tasks. For instance, in [24, 67, 104], both tasks are computed using the UP-FALL dataset [168], which includes five types of falls and six common activities. This balanced dataset allows for the preservation of the importance of accurately detecting falls amidst other activities. A similar approach is adopted in [106], where a custom dataset with egocentric videos is utilized. Nevertheless, there are studies that treat falls as just another task to recognize [30, 47, 161].

5.2 RQ1.1: data type

Among the studies collected, three types of vision data were considered: RGB, depth, and infrared (IR). The distribution of these data types is illustrated in Fig. 5. RGB data were the most prevalent for fall detection and HAR among the elderly (132 studies), followed by depth data (30 studies), with IR data being the least utilized (6 studies). This discrepancy can primarily be attributed to the accessibility of common cameras compared to specialized ones equipped with depth or infrared sensors. Additionally, RGB cameras offer benefits such as lower costs and easier visual data inspection. Notably, infrared cameras are less frequently employed, typically positioned overhead (top-down perspective) and characterized by very low resolutions, allowing for the use of simpler CNN models [45, 122], as well as non-convolutional models like LSTM [119, 159] and Transformer [119]. Depth cameras are more commonly used than infrared ones, although they are often employed to extract skeleton joints rather than directly performing fall detection and HAR. Specifically, 67% of studies utilizing depth data computed skeleton joints before classification [38, 152, 152], while the remaining 33% did not [20, 148, 155].

Fig. 4
figure 4

Distribution of studies by target task: fall detection, HAR or both

Fig. 5
figure 5

Distribution of studies by data type used. Note that more than one data type was used in some studies

Skeleton poses and sequences emerged as prevalent data types across the reviewed studies, with 67 studies incorporating skeleton data in some form. Given the human-centric nature of HAR and fall detection tasks, skeletal data represent logical features, offering efficient information compression while maintaining interpretability. Skeletons are typically represented as ordered sets of coordinates of body landmarks, either in 2D [24, 61, 107] or 3D [22, 38, 152] positions, depending on whether they were estimated from RGB or depth data, respectively. When skeleton estimation is performed on videos, the result is a sequence of skeleton poses with an added temporal dimension, enabling exploration of pose evolution over time intervals. 35 studies employed the evolution of one or more body landmarks for fall or HAR recognition [49, 107], while the remaining 32 studies performed recognition using static poses exclusively [42, 118].

In addition to vision data, some studies utilized sensor data to enhance system performance, employing different models or strategies for classification and subsequently fusing the results. Fourteen studies, listed in Table 4, utilized at least one of five types of sensor data, including Inertial Measurement Unit (IMU)Footnote 2, audio, barometer, luminosity, radar, electrocardiogram (ECG), GPS, and network traffic. IMU data was the most commonly used, featured in 10 of the 14 studies, particularly for fall detection (in 6 out of 10 studies using IMUs), owing to its effectiveness in identifying abrupt movements and subsequent immobility [64, 149]. Barometer, GPS, radar, luminosity, and ECG data were consistently employed in conjunction with IMU data. Barometer and luminosity data served to acquire auxiliary or redundant information to enhance recognition consistency [74, 149]. ECG data in [47] was utilized to identify inconsistencies in recognition and trigger specific further computations. In [85], four types of data (IMU, audio, radar, and GPS), along with visual data, were used for federated learning, where independent models were trained using different data modalities.

Table 4 Studies using multi-modal approaches and type of fusion with visual data

Regarding data fusion, no instances of early fusion were found. Instead, intermediate (7 studies) and late (5 studies) fusion methods were prevalent. Late fusion involved using a model for each data modality to produce a classification result, with the final classification determined using either voting [74, 89] or weight attribution methods [44, 69, 149]. In intermediate fusion, different models extracted features from various modalities, with a final model performing classification based on concatenated feature inputs. Various final models were utilized, including CNN [64], fully connected layers [54, 72], SVM [99], and stacked classifiers [77]. In two studies, no fusion was performed, with different options provided for classification using distinct data modalities [47, 127].

Table 5 DL models utilized in the reviewed studies, tasks they are employed for, input data they process, and number of studies in which they are featured

5.3 RQ1.2: DL models

Table 5 provides a summary of all DL models utilized in the analyzed studies. These models are often employed for various specific tasks, including skeleton joints estimation, optical flow computation, and feature extraction. Moreover, the input data for these models encompasses not only images or videos but also features frequently computed by other DL models, such as 2D or 3D skeleton poses and optical flow. A taxonomy of the identified DL models, based on different characteristics, is presented in Figure 6, offering a total count for each category. There is considerable diversity in the utilization of these models, regarding datasets used for evaluation, data types, and methodology. As demonstrated in the next section, in Table 6, a wide range of datasets was employed across the analyzed studies, with many utilizing custom datasets. Additionally, prominent datasets like URFD and UP-FALL offer various data types, including RGB recordings, depth, skeletons, accelerometers, etc., which may lead to data differences even when studies are evaluated on the same dataset. The methodology for training and testing DL methods also varies across studies, with some employing k-fold cross-validation, leave-one-out cross-validation, or no cross-validation at all. Consequently, due to the lack of standardized conditions for a fair comparison, quantitative metric results were not included in Table 5.

As mentioned in Section 5.2, many studies utilize skeleton joints as features for fall detection and HAR. To estimate these joints, various DL models are employed, with OpenPose [169] and AlphaPose [170] being the most prevalent (appearing in 25 and 10 studies, respectively). OpenPose utilizes a non-parametric representation (referred to as Part Affinity Fields) to detect skeleton joints from all humans in the image simultaneously, while AlphaPose performs human detection first and then predicts the skeleton joints for each individual. Subsequently, multiple models are used for fall detection and HAR with these skeleton joints:

  • Recurrent Networks: Long-Short Term Memory (LSTM) [93, 95, 180] and Gated Recurrent Unit (GRU) [95, 114] are commonly used, with others grouped as RNN [40, 66].

  • Graph-Based Network: The Graph-Convolutional Network (GCN) was the only one found, which treats skeletons as graphs rather than sequences [44, 107, 112]. Additionally, graph-based networks have the potential to perform collective activity recognition by leveraging interactive relations [213].

  • Convolutional Networks: Various models like VGG architectures [127, 146, 165], MobileNet [42, 81, 141], ResNet family [81, 141], among others [118, 150, 156], are employed.

Fig. 6
figure 6

Taxonomy of the DL techniques used in the found studies. The number of studies where each category was used is displayed in bold. Note that multiple models were used in many studies, and hence the same study can be counted in more than one category

Only one DL model, LiteFlowNet, was used for optical flow estimation across the studies [149]. However, 11 additional studies utilized optical flow at some stage of the recognition pipeline through non-DL-based methods [50, 55, 56, 58, 70, 83, 96, 109, 128, 146, 155].

Object detection was a prevalent task in the reviewed studies (found in 35 studies), with models from two families: R-CNN [190] and YOLO [191]. R-CNN involves a multi-step process including region proposal, feature extraction, object classification, bounding box regression, and non-maximum suppression. Conversely, YOLO focuses on real-time object detection with a single pass through the image. Both models received several ameliorations in later versions. These models were utilized for various purposes across the studies:

  • Obtaining a sequence of bounding boxes from scene objects, which can serve as features in next steps [46, 52, 99].

  • Triggering computation of fall detection or HAR upon detection of human presence, saving computation time [100, 105].

  • Reducing data complexity by putting the focus on the target person [27, 102, 161].

  • Getting features from the humans in the scene, like height-to-width ratio, used for fall detection or HAR in further steps [31, 78, 136].

  • Direct detection of falls or recognition of activities [25, 76, 139].

Additionally, object segmentation plays a crucial role in several studies. The most commonly used model is Mask R-CNN [214], which extends the capabilities of the R-CNN family to object segmentation. Another notable model is PointRend [192], a neural network module that enhances the granularity of segmentation models by treating image segmentation as a rendering problem. Conversely, a novel model proposed in [67] specifically addresses object segmentation as part of the processing pipeline for fall detection and post-fall classification, named MSSkip. MSSkip builds upon common ideas from other segmentation models but incorporates multi-scale skip connections and depth-wise separable convolutions in the decoder to minimize computation. Object segmentation serves various purposes in the reviewed studies: in [103], averaged output masks are utilized as spatio-temporal features for further recognition steps; [88] performs direct classification into fall or not fall based on the segmentation of fallen individuals; segmentation masks are fed to a convolutional LSTM in [67] and to a CNN followed by an LSTM in [153] to extract spatial and temporal features for fall detection; in [108], segmentation masks are input to different machine learning models to identify falls. Conversely, in [41], segmentation is used solely to anonymize images before feeding them to an autoencoder for fall detection.

Table 6 Comprehensive list of publicly available datasets used in the reviewed studies, along with their basic specifications

Moreover, alongside the aforementioned DL-computed features, other features are predominantly computed using convolutional models such as VGG-16, VGG-19 [27, 143, 153], ResNet [55, 145], or InceptionV3 [157]. Less frequently, non-DL-based features like Histograms of Oriented Gradients (HOG) [55, 134], Local Binary Patterns (LBP) [55, 86], and Bag of Words (BoW) [50, 138] are also utilized. Following feature extraction, multiple DL models are employed for classification. However, at this stage, it is common to use non-deep machine learning models such as Support Vector Machine [55, 58, 106, 136], Random Forest [39, 55], Decision Tree [39, 106], and KNN [106].

Furthermore, fall detection is frequently approached as a normal/abnormal classification task in the reviewed studies, with normal activities modeled and falls treated as abnormal data. This involves performing feature extraction, either using pre-trained models to extract spatio-temporal features from video/images or utilizing estimated skeleton joints, followed by training a model to identify normal activities. Various approaches are employed for this task, such as utilizing an MPED-RNN network on skeletal data [94], employing DeepFall on multiple data modalities (RGB, depth, and IR) [20], using autoencoders after obtaining spatio-temporal features from other networks [41, 111, 144], and employing Generative Adversarial Networks (GANs) by utilizing the discriminator as the normal/abnormal classifier [86, 103].

Finally, the choice of architecture in the analyzed studies often depends on the data dimensionality, with recurrent neural networks (RNNs) primarily used when considering the temporal dimension and feedforward neural networks (FFNNs) when not. RNNs are well-suited for problems involving sequential data due to their ability to remember input data using internal memory. As such, they are often employed for fall detection and activity recognition from skeleton sequences [49, 180] and feature sequences computed frame-wise by CNNs [24, 143, 148]. While CNNs are commonly used for extracting visual features from images, transformers have also been utilized in the FFNNs category, particularly for tasks involving low-resolution images [119], 3D skeleton data [81], and video by adapting Vision Transformer (ViT) [215] to video formats [53, 79]. Additionally, multilayer perceptrons (MLPs) are consistently employed for skeleton data [34, 38, 92, 140] or visual features [108].

5.4 RQ1.3: Datasets

Table 6 provides a comprehensive list of datasets used in the reviewed studies for activity recognition and fall detection. Emphasizing the importance of reproducibility and comparability, only publicly available datasets are included, aiming to facilitate future research in the field. Each dataset is categorized based on several common characteristics:

  • Elderly: Despite fall detection and activity recognition often targeting elderly individuals, only a small fraction of datasets (12%) include samples from this demographic. This scarcity highlights the challenge of collecting real-life data from the elderly population, especially genuine fall incidents.

  • Falls: The majority of datasets (58%) include falls as a class, with 23% specifically focusing on binary classification between fall and not fall activities, underscoring the significance of this task in eldercare.

  • Type: Video data is predominant (85% of datasets), aligning with the temporal nature of activities like falls, where temporal context is crucial for accurate recognition. Furthermore, video allows for the rapid acquisition of a large quantity of images in the form of frames, which can then be utilized by data-driven solutions, such as DL-based methods.

  • Data types: While RGB data is ubiquitous, depth frames, skeleton joints, and inertial data are found in 38%, 29%, and 13% of datasets, respectively. Other data types such as infrared data and motion history volumes (MHV) are less common. The presence of RGB data in all datasets allows for the discovery of the exact conditions of the recordings (environment, perspective, users, etc.) and serves as a visual check of the data, a feature not offered by other types of data.

  • Samples: Dataset sizes vary significantly, ranging from less than 50 samples (e.g., FDD-Chen) to over 500,000 samples (e.g., Kinetics 700-2020), reflecting the diversity in data availability.

  • Classes: The number of classes also varies widely, from binary classification to datasets with hundreds of classes, though the latter are typically not focused on AAL.

  • Studies: Half of the datasets are utilized in only one study, while only five are used in more than ten studies, indicating varying degrees of dataset popularity and usage.

The University of Rzeszow Fall Detection (URFD) dataset [216] stands out as the most extensively used, featuring in 40 studies [41, 89, 153]. Focused on fall detection, URFD offers 70 sequences capturing falls and activities of daily living (ADL) from two perspectives, along with various data modalities including RGB, depth, skeleton joints, and inertial data. The UP-FALL dataset [168], appearing in 17 studies [24, 39, 103], provides data from 17 subjects performing 11 activities, offering RGB video, infrared images, and inertial data for both fall detection and human activity recognition (HAR). In contrast, the Le2i dataset [217], used in 16 studies [47, 93, 137], focuses solely on fall detection, featuring 143 videos with falls and 48 with normal activities, with varying actors, scenery characteristics, and illumination conditions. Similarly, the MultiCam dataset [218], utilized in 16 studies [27, 30, 72], provides RGB video from 24 sequences captured from eight perspectives, facilitating the study of falls and confounding events. The NTU RGB+D dataset [219], used in 14 studies [112, 118, 131], offers a vast collection of samples from 40 subjects performing 60 activities, recorded using Kinect cameras, thus providing RGB video, depth images, and skeleton joints. An extended version of this dataset also exists: the NTU RGB+D 120 dataset [230], which expands upon it by adding 60 additional classes. However, it is only utilized in two of the reviewed studies [107, 135]. The remaining datasets were utilized fewer than 10 times, with approximately half of them being employed in only one study.

While most datasets are collected from real environments, two exceptions are noted: [101] and [135], offering synthetic images and videos, respectively. Despite the advantages of synthetic data, such as ease of acquisition and controlled conditions, models trained solely on synthetic data may lack adaptability to real-world scenarios.

Notably, some studies opted for custom datasets instead of utilizing existing ones. Figure 7 illustrates the proportion of studies using custom, external, or both types of datasets. Only 19 studies provided evaluations on both custom and external datasets, with a greater frequency of evaluations conducted solely on external datasets (86 studies) compared to those exclusively using custom datasets (46 studies).

Fig. 7
figure 7

Distribution of studies by dataset used

5.5 RQ2: Framework integration

In 18 of the reviewed articles, frameworks were proposed to integrate the tasks of HAR or fall detection into real environments, addressing various aspects such as security, utilization of cloud services, client-server configuration, network communications, IoT devices, etc. Below, we provide brief descriptions of the proposed frameworks.

In [42], a custom robot is suggested to integrate the HAR task into the environment, alongside other functionalities like language processing to enable chatbot interactions. In [161], a camera system is employed to capture visual data, which is then sent to a central server for computation. Subsequently, notifications, reports, and alerts are dispatched to a designated “guardian”.

In [74], a Docker-based system is proposed to manage the flow between various programs involved in fall detection, distributing resources, and regulating communications. Docker is also utilized in [78], where the NAO robot is suggested for data acquisition and user interaction to prevent falls. In [30, 32], an intermediary step between recording and DL computation is introduced to preprocess video data and reduce bandwidth consumption.

In [18, 33, 49, 52, 58, 93, 105], the proposed frameworks integrate the collection of visual data through camera monitoring systems, centralized server-based recognition of fall detection or various activities, and trigger various responses based on the severity of the situation, such as contacting health services. For instance, [33] utilizes the third-party service ’Twilio’ to send phone messages in case of a fall, while in [105], the system transfers recordings to a computer for human inspection upon fall detection.

In [123, 127], activity recognition results, along with recorded video data, are transmitted to a mobile application used for monitoring system users. Similar capabilities are offered in [63], with the addition of face blurring anonymization. [77] conducts all experiments in a connected environment, exploring the use of network traffic from multiple smart appliances combined with visual data to recognize various activities. Additionally, to assess the transferability of their approach across environments, they experimented with a smart residential apartment.

In [85], federated learning is employed to ensure privacy preservation of users. The system incorporates three sensor modalities (depth, mmWave radar, and audio) and was tested in the homes of 16 elderly subjects.

5.6 RQ2.1: hardware

A list of the hardware used in the reviewed studies (when mentioned) is presented in Table 7. Specialized cameras such as thermal, depth, and wearable cameras, as well as social assistive robots, were included. Information regarding datasets not created in the reviewed studies was excluded. Hardware related to computation or common RGB cameras was omitted due to the wide range of possibilities available in these areas.

Table 7 Special cameras and social robots found in the reviewed studies

For depth video retrieval, the most commonly used camera is the Microsoft Kinect (7 studies), followed by the Orbbec Astra Pro (3 studies), and Intel RealSense (1 study). These cameras share similar specifications, offering RGB-D recording using an IR camera for the depth channel, which provides accurate depth estimation at short distances. Additionally, they enable reliable 3D skeleton joint estimation.

There is less consensus in the use of thermal cameras, with multiple camera models employed. Consequently, there is considerable variation in the retrieved data, including differences in resolution, sensitivity to temperature, maximum and minimum effective distances, etc.

Only five studies deployed HAR or fall detection in an AAL system using a social assistive robot. Among these, two studies utilized the Pepper robot, one employed the NAO robot, and the remaining studies used custom-made robots.

5.7 RQ2.2: privacy protection

Figure 8 illustrates the various privacy protection methods identified in the reviewed studies. Among the 151 studies reviewed, 75 did not address privacy concerns, opting for the use of unmodified RGB video or images of elderly users. Among the remaining studies, the majority employed skeleton data computed from RGB images, while four offered specific methods to anonymize RGB data, and others chose to utilize thermal or depth data instead.

Fig. 8
figure 8

Distribution of studies by method used to preserve privacy. The total does not add up to 151 studies because in some studies different options were given

The most effective privacy-preserving methods avoid the deployment of RGB cameras in AAL settings. This is typically achieved through the use of visual data types that do not allow for subject identification, such as thermal and depth imaging. Among the collected studies, five exclusively employed thermal data [20, 45, 119, 122, 159]. In all cases, DL-based methods utilized CNNs to extract visual features and perform classification. Additionally, 21 studies utilized solely depth data, with 17 of them using it to estimate 3D skeleton poses, as demonstrated in [38, 43, 81, 152]. Notably, Microsoft Kinect was utilized in all 17 studies to estimate skeletons from depth maps through randomized decision forests [251], leaving RGB data unused for this estimation. Four studies exploited depth data without skeleton estimation, instead relying on the extraction of human silhouettes [148] and visual features using CNNs [20, 113, 117].

A total of 51 studies utilized RGB data at some stage, applying anonymization techniques. In contrast to the aforementioned studies, the input data used by these studies can be used to identify subjects, as conventional video recording is involved at the beginning of the processing pipeline. Among these, 47 studies relied on 2D skeleton estimation methods like OpenPose [169] and AlphaPose [170] to protect privacy, removing visual data that can be used to identify users, as illustrated in [24, 44, 104, 107, 137]. There were four studies in which privacy was protected through other methods. In [144], an IR camera is used to detect the face region of frames and remove it from the RGB frames. In [86], the RGB frames are modified in such a way that individuals cannot be identified, while fall detection can still be applied effectively. In [55], a wearable camera providing a first-person perspective is used to avoid recording the user of the system. Human silhouettes are computed in [41] and used for future recognition steps.

6 Discussion

This section utilizes the discovered results and the responses provided to the review questions to underscore common strengths and weaknesses of the reviewed studies. It also compiles a comprehensive list of recommendations for future reference based on the findings of this systematic literature review. In Figure 9, the search process and key findings from the reviewed studies are summarized.

Fig. 9
figure 9

Summary of the search process and found results

6.1 Strengths and weaknesses of the reviewed studies

Upon reviewing the 151 relevant studies and addressing the research questions, the main strengths and weaknesses observed are discussed in this subsection, which we believe can provide valuable insights for future studies in the field.

A notable benefit of utilizing skeleton joints is their ability to significantly reduce data size compared to raw image or video data, while also offering user anonymization, maintaining data interpretability, and achieving satisfactory results in fall detection and HAR. Furthermore, there is a growing number of methods to derive human skeletons from RGB or depth data, with 13 different skeleton estimation DL models identified in the reviewed studies (as shown in Table 5).

The primary strength of studies employing only depth or infrared data lies in the privacy protection they afford, as RGB footage is not recorded at any point in the system pipeline. However, these studies also face two major weaknesses: a reduced amount of data for detection or recognition tasks, particularly pronounced in the case of IR recordings where resolution tends to be much lower, and less interpretable data, which may pose challenges when manual intervention is required to address errors.

Among the reviewed studies, 27 perform both fall detection and HAR tasks (refer to Fig. 4). This integration is particularly significant, as it is often desirable to detect accidental falls while conducting HAR on elderly individuals. It is important to note that while fall detection can be integrated as another class during HAR, it should be computed separately due to its critical nature. Therefore, most studies including fall detection implement it differently than the recognition of other classes.

Numerous studies have overlooked the temporal dimension when conducting HAR, thereby constraining the task significantly. This omission poses a significant weakness, particularly when incorporating activities that are challenging to distinguish without temporal data or are more effectively recognized with it, such as sitting/getting up or putting on/off clothes. Nonetheless, confining the analysis to spatial data typically offers the advantage of being faster and more straightforward.

Regarding the choice of model architecture, convolutional models were found to be predominant. Their primary strengths lie in their effectiveness in processing spatial data and their extensive history, which has led to numerous improvements and architectural refinements across various fields and tasks. Given their suitability for image-based tasks, convolutional models are widely preferred and even have 3D versions tailored for video processing. In contrast, recurrent models excel in handling sequential data, thus complementing the capabilities of CNNs by facilitating the tracking of computed features across different frames. Multi-layer perceptrons, however, do not yield favorable results with spatial or sequential data; they are typically employed for classification based on computed features, akin to fully connected layers in a convolutional neural network. Transformer-based architectures, being relatively newer, are not as ubiquitous. Despite their promise in handling sequential and vision data, their large parameter count presents challenges in training and deploying them on low-specification systems. Nonetheless, they have showcased significant potential across various domains.

Given that fall detection and HAR for the elderly aim to assist this population in AAL settings, studies offering frameworks for deploying systems in real environments are of particular interest. Eighteen studies, described in Section 5.5, fall into this category.

Utilizing only an external dataset may impact the applicability of the technique to specific situations or environments, but it allows for the comparison of different methods on the same data. Conversely, relying solely on a custom dataset yields the opposite effects. The primary drawback associated with using only custom-made datasets is the external validity of the findings, as it becomes challenging to compare results with other studies, especially if the custom data is not disclosed. Including an evaluation on external datasets not only distinguishes studies from previous ones but also enables future studies to build on the obtained performance. While the majority of the reviewed studies evaluate on existing datasets for fall detection and HAR, 46 exclusively perform evaluations on new custom datasets (as depicted in Fig. 7), limiting the reliability of the results without comparisons with existing techniques or models. Conversely, 19 studies utilize both custom and external datasets, leveraging the strengths of each approach: specialization on custom data and comparison with other methodologies.

From a data perspective, three common weaknesses are evident in the datasets utilized: the absence of elderly individuals, a limited number of samples, and the inclusion of numerous classes unrelated to activities of daily living (ADL), which may render them less suitable for fall detection and HAR among elderly populations. Primarily, the majority of datasets (88%) lack elderly participants, presenting challenges during deployment as they represent the target users of the system but are not represented in the training data. In this regard, datasets such as ETRIActivity3D, ToyotaSmartHome, MUVIM, and FPDS-Elderly would be more suitable. Additionally, a limited number of samples may prove insufficient for DL models to generalize effectively. Three of the four most extensive datasets contain fewer than 200 samples, while the remaining dataset contains fewer than 600, with approximately half of the utilized datasets containing less than 1,000 samples. Instead, datasets such as ETRIActivity3D, NTU RGB+D (or NTU RGB+D 120), or ToyotaSmartHome, all offering more than 10,000 samples, would yield better generalization results. Lastly, datasets should be tailored to focus on ADL rather than general HAR to avoid unnecessary classes for monitoring elderly individuals in their daily lives. For instance, Kinetics (400, 600, or 700) or UCF101 would not be suitable for the considered tasks as they comprise videos collected from the internet, potentially containing irrelevant activities and cuts.

6.2 Recommendations for future works

Based on the results of this SLR, a series of particularly important considerations, in our understanding, should be taken into account when conducting new studies on the topic.

First and foremost, it is crucial to assess user privacy. As observed, the approach to privacy protection will likely influence the type of data used, ranging from conventional RGB data to modified RGB, depth, IR, or skeleton data, which prevent user identification in the footage. Therefore, we recommend considering privacy protection as a fundamental aspect from the outset of the study.

Selecting an appropriate DL model for fall detection and HAR requires consideration of the deployment conditions. For embedded systems or edge-deployments, such as in social robots or mobile applications, compact models are preferred, such as MobileNet or EfficientNet—well-known CNNs specifically tailored for such devices. These models can be augmented with recurrent models like LSTM to accommodate temporal data. Conversely, if model size is not a constraint, 3D CNNs like I3D, TPN, TANet, SlowFast, and C3D are suitable for video data, while GCN can be applied to skeleton data. Alternatively, Transformer-based architectures like TimeSformer or VST are also an option for processing video input data.

For model evaluation, utilizing a publicly available dataset is essential to enable comparison with existing models or techniques. Prominent datasets for fall detection include URFD, UP-FALL, MultiCam, and Le2i, while for HAR, UP-FALL and NTU RGB+D are commonly used. However, we encourage the adoption of ETRIActivity3D or ToyotaSmartHome, which offer a more extensive collection of video samples and include elderly participants. Both datasets support HAR, with ETRIActivity3D additionally containing falls and providing multiple perspectives from elderly users, diverse classes (at least 30), and various data modalities, including RGB, depth, and skeleton joints.

In cases where a custom dataset is provided, authors are encouraged to make it publicly available. This facilitates its use in future studies, either directly or by merging it with other datasets to form a larger dataset, enhances the reproducibility of experiments, and enables comparison with newer models or techniques. RGB-D cameras, such as Microsoft Kinect, Orbbec Astra Pro, and Intel RealSense, are recommended for collecting custom datasets as they facilitate experimentation with various types of data, with depth data offering privacy preservation capabilities.

When deploying the system in a real environment, the most common approach, as indicated by the reviewed studies, involves establishing a camera setup within the environment. This setup records data and transmits it to a central server for processing. It is also the most cost-effective option, depending on factors such as camera type, resolution, and processing requirements. Alternatively, for those preferring to use an assistive robot, both NAO and Pepper robots are viable solutions. These commercial robots come equipped with cameras, speakers, microphones, and other necessary components, offering customizable options to adapt to different projects and environments.

7 Conclusions

In this systematic literature review, we have investigated fall detection and human activity recognition for the elderly, with a particular focus on deep learning techniques applied to computer vision data. Our study aimed to address two primary research questions related to the implementation of DL methods for these tasks and their deployment in real-world environments, considering hardware and privacy concerns.

Throughout the review process, we analyzed 151 relevant studies, providing a structured overview of the main findings to facilitate accessibility for practitioners and researchers. The findings offer valuable insights into the effective implementation of DL techniques for fall detection and HAR in elderly care, which are becoming increasingly important in the context of Ambient Assisted Living (AAL) systems.

Privacy emerged as a common concern, with 50% of the reviewed studies lacking any measures to address it. The most prevalent privacy protection method identified was the use of skeleton joints estimation, employed in 45% of the studies.

Convolutional DL models were found to be predominant, owing to their effectiveness in processing spatial data and extensive history of refinement. However, we observed a lack of consideration for the temporal dimension in many studies, which limits the recognition of some activities.

Regarding datasets, we identified three common weaknesses: the absence of elderly individuals, a limited number of samples, and the inclusion of numerous irrelevant classes for the AAL systems. We recommend datasets such as ETRIActivity3D and ToyotaSmartHome, which offer extensive samples and include elderly participants.

Moving forward, we emphasize the importance of privacy assessment from the outset of studies and recommend selecting appropriate DL models based on deployment conditions. Utilizing publicly available datasets for model evaluation is crucial, and authors are encouraged to make custom datasets publicly available to enhance reproducibility and facilitate future research.

In terms of deployment, camera setups within the environment were the most common approach identified, offering cost-effectiveness and flexibility. Alternatively, assistive robots like NAO and Pepper provide customizable options for deployment in various projects and environments.

Overall, this SLR provides a comprehensive overview of recent advancements in DL-based fall detection and HAR for the elderly, offering valuable insights for researchers, practitioners, and policymakers involved in developing and implementing AAL technologies.