Keywords

1 Introduction

With the improvement of theories related to deep learning, artificial intelligence is flourishing in the fields of action recognition [1], object classification [2], intelligent driving [3] and so on, and more and more researchers are engaged in the research related to neural networks. Although promising results have been achieved using convolutional neural networks (CNNs) [4] for tasks such as visual recognition [5], voice recognition [6], and machine translation [7], they have not achieved the expected results when dealing with data with irregular structures. In the face of data with irregular structure, researchers have used recurrent neural networks (RNN) [8] to deal with them, with a significant improvement compared to CNNs. In fact, CNNs are mainly used to handle image-based data and RNNs are mainly used to handle temporal data.

Although traditional neural networks have great results in dealing with data with Euclidean structures, irregular non-Euclidean structures such as graph structures cannot be handled using traditional neural network models. Traditional algorithms tend to compress the graph-structured data into a chain structure or a tree structure which is processed using a neural network. However, there is often a loss of topological structure information in graphs, and important information may be lost in the preprocessing stage, which affects the final experimental results. Therefore, researchers have proposed graph neural networks (GNNs) [9], which are widely used due to their ability to analyze graph structured data. When dealing with information with pairwise relationships, more complete information features can be obtained by constructing graph structures and using graph neural networks for training. However, with the deep development of artificial intelligence, the problems to be solved are becoming more and more complex. In practice, in addition to pairwise relations, there exist multi-modal and multi-type data containing a large number of non-pairwise relations, and such complex relations cannot be well characterized by graph structures. Therefore, researchers have introduced the concept of hypergraph [10] to characterize complex non-pairwise relationships using hypergraph structures, and hypergraph can characterize higher-order relationships of data and feature learning through hypergraph neural networks to obtain feature representations of complex data. Compared with traditional graph neural networks, hypergraph neural networks [11] are more general representation learning frameworks capable of handling complex higher-order correlations through hypergraph structures, so as to effectively deal with multi-modal and multi-type complex data. Hypergraph learning has been widely used in tasks such as image retrieval [12], 3D object classification [13], video segmentation [14], person re-identification [15], hyperspectral image analysis [16], landmark retrieval [17], and visual tracking [18].

Action recognition is one of the representative tasks for computer vision, and accurate action recognition is an important prerequisite for intelligent interaction and human-computer collaboration, which has become a widely concerned research field in recent years, such as in the application fields of action analysis, intelligent driving, and medical control [19], and the research on body language interaction is of great significance. However, for the extraction of action features in complex environments, traditional GNN-based methods can no longer meet the practical needs, and many higher-order semantic information is ignored, thus reducing the accuracy of action recognition. Therefore, how to use hypergraph neural networks to achieve action recognition in complex environments has become a problem that has received widespread attention in recent years.

The rest of this paper is organized as follows: Sect. 2 outlines the theory related to hypergraph neural networks, and analyzes in detail hypergraph neural networks proposed in recent years. Section 3 introduces some applications of hypergraph neural networks in action recognition. Section 4 concludes the whole paper and proposes possible directions for hypergraph neural networks in the future.

2 Hypergraph Neural Networks

The hypergraph structure breaks traditional restriction that an edge can only connect two vertices in a graph structure, and expands the concept of edge into a hyperedge, which means that a hyperedge can connect multiple vertices. Complex relationships among things are connected by hyperedge, and thus can be represented using a hypergraph structure with higher-order semantics. In this subsection, we review the theory related to hypergraph in three aspects, including the definition of hypergraph, the generation methods of hypergraph, and the learning methods of hypergraph. In addition, we describe and compare the recent hypergraph neural network algorithms.

2.1 The Definition of Hypergraph

The comparison between the graph and the hypergraph is shown in Fig. 1 [20]. Similar to the definition of a simple graph, a hypergraph is defined as \( \mathcal {G}=(\mathcal {V},\mathcal {E},{\textbf {W}}) \). Where \( \mathcal {V} \) is the set of vertices in the hypergraph, and the element in the set are denoted as \( v\in \mathcal {V} \); \( \mathcal {E} \) is the set of hyperedges in the hypergraph, and the element in the set are denoted as \( e\in \mathcal {E} \); \( {\textbf {W}} \) is the hyperedge weight matrix, which records the weight of each hyperedge, denoted as \( \omega (e) \). The relationship between hyperedges and vertices is represented by constructing the incidence matrix \( {\textbf {H}} \), which is a \( |\mathcal {V}|\times |\mathcal {E}| \) matrix.The elements in the incidence matrix H are defined as follows:

$$\begin{aligned} h(v, e)=\left\{ \begin{array}{ll} 1, &{} v \in e \\ 0, &{} v \notin e \end{array}\right. \end{aligned}$$
(1)

Specifically, if the vertex v exists in the hyperedge e, then \( h(v,e)=1 \), otherwise \( h(v,e)=0 \). In addition, we can denote the hyperedge degree as the number of vertices contained in the hyperedge e, which can be defined as:

$$\begin{aligned} \delta (e)=\sum _{v \in v} h(v, e) \end{aligned}$$
(2)

And the vertex degree as the sum of the hyperedge weights associated with the vertex v, which can be defined as:

$$\begin{aligned} d(v)=\sum _{e \in \varepsilon } \omega (e) h(v, e) \end{aligned}$$
(3)

We also can define \( {\textbf {D}}_e \) and \( {\textbf {D}}_v \) to denote the diagonal matrix of hyperedge degree and vertex degree, respectively. Then,the standardized Laplacian matrix can be defined as:

$$\begin{aligned} \varDelta =\textbf{I}-\textbf{D}_{v}^{-1/2} \textbf{H} \textbf{W} \textbf{D}_{e}^{-1} \textbf{H}^{T} \textbf{D}_{v}^{-1/2} \end{aligned}$$
(4)

where \( \textbf{I} \) is a Unit Matrix.

Fig. 1.
figure 1

Comparison of graph and hypergraph.

2.2 The Generation Methods of Hypergraph

In order to make a connection between the hypergraph and the data, it is necessary to construct a hypergraph based on the data. Since different data have different characteristics, it is crucial to choose a suitable hypergraph generation method according to the characteristics of the data. The applicability of the constructed hypergraph to the provided data directly determines the ability of the hypergraph to represent higher-order relationships among the data. The hypergraph generation methods can be generally classified into four categories, including distance-based methods [13], representation-based methods [21], attribute-based methods [22], and network-based methods [23]. Specifically, the distance-based approach is characterized as simple and effective in many applications, but is very sensitive to hyperparameter settings; the representation-based approach is characterized as avoiding the effect of noisy vertices through sparse representation, but calculating reconstruction coefficients increases network computation; the attribute-based approach is characterized as applicable to samples with specific attributes, but has the limitation of considering only single-attribute features; and the network-based approach is characterized as applicable to graphically represented data, but requires the construction of specific hypergraph.

2.3 The Learning Methods of Hypergraph

After constructing the hypergraph, the hypergraph needs to be learned to extract data features. Hypergraph learning was first introduced in [10], which allows for transductive learning [24] and can be seen as a propagation process on the hypergraph structure. Transductive learning on the hypergraph aims to make the differences in labels of the more strongly associated vertices on the hypergraph as small as possible. In recent years, hypergraph learning has been widely developed and applied in many fields. Wang et al. [20] constructed a complex hypergraph containing global and local visual features and label information to learn the relevance of images in a label-based image retrieval task. To model the functional connectivity network (FCN) of the brain, Xiao et al. [25] proposed weighted hypergraph learning, which is capable of capturing the relationships among brain regions compared to traditional graph-based methods and existing unweighted hypergraph-based methods. Inspired by deep learning, some researchers have developed hypergraph learning methods based on deep learning. For example, Feng et al. [11] proposed a hypergraph neural networks (HGNN) to model and learn complex associations in non-pairwise data. Significantly, Gao et al. [26] proposed a tensor-based dynamic hypergraph representation and learning framework that can effectively describe higher-order correlations in hypergraphs. In addition, they developed and published a toolbox called THU HyperG, which provides a collection of hypergraph generation and hypergraph learning algorithms.

2.4 Hypergraph Learning Method Based on Hypergraph Neural Networks

In order to train hypergraph and obtain higher-order semantic features of nodes by using hypergraph learning methods based on deep learning, researchers are proposing more and more neural networks suitable for extracting hypergraph features.

As the pioneer of hypergraph neural networks, Feng et al. [11] proposed a hypergraph neural network framework (HGNN) that utilizes the hypergraph structure for feature learning to effectively extract the higher-order correlations of data. HGNN extends the spectral domain-based convolution operation from the graph learning process to the hypergraph learning process by using the hypergraph Laplacian operator to convolve on the spectral domain. Specifically, the network framework uses the HGNN convolutional layer to perform a “vertex-hyperedge-vertex" transformation to iteratively update the vertex features to efficiently extract higher-order correlations on the hypergraph. The experimental results show that HGNN can extract higher-order features of vertices more effectively than traditional graph neural networks. Yadati et al. [27] proposed HyperGCN and used hypergraph-based spectral theory to train graph convolutional networks (GCNs) on hypergraph, so as to model complex relationships. HyperGCN is more effective compared to the hypergraph-based semi-supervised learning (SSL) method. And their proposed method has been applied to SSL and combinatorial optimization problems on hypergraph. Numerous experiments have shown that HyperGCN is effective for extracting features from complex data and improves the results of SSL. Jiang et al. [28] proposed a dynamic hypergraph neural network framework (DHGNN) to solve the problem that the hypergraph structure cannot be updated automatically in hypergraph neural networks, thus limiting the lack of feature representation capability of changing data. Notably, the framework consists of two important parts, dynamic hypergraph (DHG) and hypergraph convolution (HGC). Specifically, DHGNN uses the k-NN method to generate the basic hyperedges and a clustering algorithm to extend the set of adjacent hyperedges, and extracts local and global relationships by constructing the dynamic hypergraph. The experimental results demonstrate that the model has better performance, stronger robustness for different data, and significantly better than some static construction methods. Bai et al. [29] proposed two end-to-end operators, Hypergraph Convolution (HC) and Hypergraph Attention (HCA). Both operators can be inserted into most graph neural networks for model training when non-pairwise relationships are present in the data. Notably, the network uses dynamic transfer matrix instead of incidence matrix for convolutional operations. Specifically, the dynamic transfer matrix represents the importance of a vertex for a certain hyperedge, and adaptively identifies the importance of different vertices in the same hyperedge, which can more accurately describe the relationship among vertices and thus improve the performance of the neural networks. Graph embedding is a commonly used method to analyze network data. However, existing methods do not fully utilize and integrate both topology and attributes of nodes. Wu et al. [30] proposed a dual-view hypergraph neural network model for attribute graph learning, which solves the problems of inadequate modeling of nonlinear relationships among nodes in the semantic space and heterogeneity of structure and attribute information. Specifically, they address the limitations of traditional graph embedding by sharing specific hypergraph convolutional layers to model and unify the representation of different information sources. Gao et al. [31] proposed a hypergraph neural network framework (HGNN+) for hypergraph learning, which mainly consists of two processes, hypergraph modeling and hypergraph convolution, where the operational process of hypergraph convolution is performed on the spatial domain. In the process of hypergraph generation, different data use different hyperedge generation strategies to generate the hypergraph structure. In the hypergraph convolution process, a message propagation mechanism based on spatial domain, which includes two-stage hypergraph convolution. It can propose to flexibly define the convolution and aggregation operations in each stage and naturally extend it to directed hypergraph. The experimental results show that HGNN+ can obtain more gain with fewer training samples, which indicates that the proposed method can work well with limited training samples. Table 1 summarizes the classification jieg of different hypergraph neural networks on the citation dataset.

Table 1. Classification structures of different hypergraph neural networks on citation datasets.

In addition, the commonly used citation datasets are summarized in this paper, as shown in Table 2.

Table 2. Overview of data statistics.

3 Action Recognition Based on Hypergraph Neural Networks

With the continuous development of computer vision, human action recognition has shown a widespread application prospect and research value in many fields such as video surveillance, video retrieval and human-computer interaction. Methods based on deep learning have achieved excellent results in RGB data, with performance far superior to traditional methods of manually extracting features. In addition, the depth skeleton sequence has rich spatial and temporal information. As a result, there are many researchers who have also tried to combine deep learning and skeleton data for human behavior recognition. However, the GCN-based approach focuses only on the local physical connections among the joints and ignores the non-physical dependencies among the joints. Therefore, more and more researchers try to use hypergraph to model the human skeleton and use hypergraph neural networks to obtain a feature representation of human action for better human action recognition. This subsection will analyze and summarize the recently proposed action recognition methods based on hypergraph neural networks.

To capture the higher-order information of the skeleton and improve the accuracy of behavior recognition, Hao et al. [32] proposed a hypergraph neural networks (Hyper-GNN) framework to obtain the higher-order feature representation of the skeleton. Specifically, firstly, they divide the skeleton data into three different data input forms, including joints, bones and motion trends, and use them as vertices to construct different types of hypergraph; then, they construct hypergraphs and input them into Hyper-GNN for feature extraction to obtain higher-order feature representations of skeleton information; finally, in order to make full use of the complementarity and diversity among the three types of features, they fuse the three types of features and classify the action represented by the skeleton according to the fused features, so as to further improve the performance of action recognition. Notably, Hyper-GNN introduces a hypergraph attention mechanism and an improved residual model which has temporal convolution to extract more accurate and abundant skeleton features in the residual model. The experimental results show that the accuracy of action recognition using this method is significantly improved compared to the GCN-based method. He et al. [33] proposed a dual-stream hypergraph convolutional network (SD-HGCN) that adds a dual skeleton stream to a single skeleton stream, so as to recognize the action of interactions. Specifically, the model mainly consists of a multi-branch input adaptive fusion module (MBAFM) and a skeleton perception module (SPM). Among them, MBAFM distinguishes the input features more easily by two GCNs and an attention module; SPM adaptively learns the incidence matrix of hypergraph according to the semantic information in the skeleton sequence, identifies the relationship between skeletons, and builds the topological knowledge of human skeleton. The experimental results show that the SD-HGCN algorithm is less time-consuming, has high accuracy, and can be interacted in real time. Wei et al. [34] proposed a dynamic hypergraph convolutional network DHGCN for action recognition, which effectively extracts motion information from skeleton data, thus significantly improving the accuracy of dynamic action recognition. The algorithm constructs both static hypergraph and dynamic hypergraph based on the skeleton information, which can obtain higher-order information than a single static construction method. In this case, each joint in the dynamic hypergraph is assigned a corresponding weight according to its motion state, so as to better learn the dynamic features of the skeleton. The experimental results show that the method has better recognition ability for continuous dynamic actions. For a single skeleton data, it is not possible to adequately represent the details of human action and it is difficult to accurately identify human-object interactions. Chen et al. [35] proposed Informed Patch Enhanced HyperGCN to simultaneously learn skeleton information and local visual information, and perform multi-modal feature learning through multi-modal data fusion to obtain better behavioral features, so as to effectively improve the accuracy of action recognition. Specifically, the network framework obtains visual information near the five joints of the head, left hand, right hand, left foot, and right foot, respectively, so as to obtain part of the semantic features related to the behavior. This visual information is complemented with skeleton information to further enhance the data information required for action recognition. Experimental results show that the method reduces the computational and memory consumption of the network while improving the accuracy of action recognition.

As can be seen in Table 3, we compare recent approaches to action recognition using hypergraph neural networks on the NTU-RGB+D 60 dataset and NTU-RGB+D 120 dataset. In addition, we compare with common GCN-based approaches. It is clear that action recognition methods based on hypergraph neural networks achieve satisfactory results on different datasets, with most of the results better than the GCN-based methods. This is because hypergraphs can characterize more complex higher-order relationships and more fully exploit the complex behavioral information in real scenes. A comparison of the NTU-RGB+D 60 dataset and the NTU-RGB+D 120 dataset is shown in Table 4. With this information, hypergraph neural networks models can obtain more complete action features. It is very important to choose the right algorithm according to different features and applicability range to improve the human action recognition.

Table 3. Comparison of action recognition methods based on hypergraph neural networks and graph neural networks (Accuracy).
Table 4. Comparison of NTU-RGB+D dataset and NTU-RGB+D 120 dataset.

4 Conclusion

This paper summarizes the research on hypergraph neural networks in recent years and discusses the design ideas of various hypergraph neural networks in detail to help researchers better understand different hypergraph neural networks. In addition, this paper aims to facilitate researchers to choose the appropriate hypergraph neural networks structure for modeling according to their practical needs.

Although various hypergraph neural networks have made promising progress in extracting features from complex data, the extraction of node features in complex environments still needs to be further explored in the future. We believe that in addition to improving a single hypergraph neural networks, it is also possible to combine hypergraph with different graph structures to model problems related to complex environments and obtain higher-order and more complete feature information. For example, Jiang et al. [28] combined dynamic graph and hypergraph to encode higher-order information and improve the accuracy of twitter sentiment prediction; Sun et al. [36] combined heterogeneous graph and hypergraph to achieve excellent results in tasks such as node classification in complex environments.

Action recognition methods based on hypergraph neural networks are the current hotspots of computer vision research, which have practical application needs and good application prospects, and related issues deserve further research. Notably, the action recognition methods based on hypergraph neural networks can be applied to the unmanned field to make self-driving vehicles into learnable and interactive wheeled robots [37, 38].