1 Introduction

Estimation of 3D hand pose is useful in many human-computer interaction applications such as recovery progress for hand rehabilitation systems [22], social robotics [33], user authentication [18] and virtual reality games [48]. With the availability of affordable depth sensors and high-quality hand pose datasets in recent years [25, 34, 35, 37, 45, 46] and advances in convolutional neural networks (CNNs) [16, 19, 21], there has been significant progress in 3D hand pose estimation. However, the task remains a challenge due to severe finger self- occlusion, poor quality of depth images, variations in viewpoint and complex hand shapes [44].

Discriminative methods for 3D hand pose estimation regress 3D hand joint coordinates directly or output heatmaps from a depth image using 2D CNNs [5, 15, 24, 28, 37]. But these methods do not fully exploit the 3D spatial information in the depth image which is intrinsically 3D data [44]. To address this shortcoming, several works have studied 3D methods for hand pose estimation [7, 10,11,12, 14, 26, 32, 40]. In a recent work, Xiong et al. [42] proposed a novel anchor-based approach named Anchor-to-Joint (A2J) regression network to regress 3D joint coordinates from depth images. In the anchor proposal procedure, anchor points that are densely set on the depth image are assigned weights to discover informative anchor points for a certain joint.

The introduction of the Transformer neural network [38], which replaces recurrence with self-attention to learn long-range dependencies, has led to the wide adoption of self-attention in natural language processing tasks and the increasing use of self-attention in computer vision tasks in recent years. While there are works that employ other attention mechanisms for 3D hand pose estimation [14, 41], studies that model self-attention for the task are limited.

In this work we propose to extend A2J [42] by augmenting convolution with self-attention [2] for 3D hand pose estimation. Moreover, we extend anchor points to the depth dimension in an attempt to better model the 3D spatial geometric characteristics in the depth image. Using the proposed approach, we developed a prototype system for real-time estimation of hand joint angles. This system can be used to assess the range of hand motion in patients with certain disorders that lead to impairment of the hand, such as stroke or rheumatoid arthritis [4].

To summarize, the contributions in this work are as follows:

  1. 1.

    Anchor points that are set in three-dimensional space are used to regress 3D hand joint locations,

  2. 2.

    Self-attention is modelled for 3D hand pose estimation, and

  3. 3.

    A novel user interface is developed to evaluate the range of motion of the hand in clinical practice.

The rest of the paper is organized as follows. Section 2 briefly summarizes related work in the area of hand pose estimation and attention. Section 3 details the proposed approaches which utilize self-attention and 3D anchor points for 3D hand pose estimation. Section 4 reports the details of the experimental results and Section 5 describes the prototype system for rehabilitation. Finally, Section 6 concludes this paper with a summary of the contributions and limitations of the work.

2 Related work

Deep neural networks are commonly used in 3D hand pose estimation to regress 3D joint locations or heatmaps encoding probability distributions of hand joints. One drawback is that the depth image is treated as 2D data and that spatial information in the depth image is under-utilized. To address this problem, several works converted the depth image into 3D data structures such as points [10, 12] or voxels [7, 14, 26]. Ge et al. [12] processed point clouds directly to obtain point-wise estimations of hand joint locations. Moon et al. [26] used a 3D CNN to estimate the per-voxel likelihood for each hand joint and achieved performance that surpassed existing approaches by a large margin. However, 3D CNN methods incur high memory and computational costs. Other methods have been proposed to capture spatial representations with 2D CNNs [11, 32, 40]. Ge et al. [11] projected the depth image into three orthogonal planes with each projection fed into a 2D CNN to regress a heatmap. The heatmaps were then fused to produce 3D hand joint coordinates. Ren et al. [32] incorporated spatial-aware representations that are based on 3D offsets into a 2D CNN consisting of multiple stacked regression modules. In a recent work, Xiong et al. [42] proposed A2J regression network which uses anchor points that are densely set on a depth image to extract global-local spatial context information for 3D hand and body pose estimation.

The attention mechanism was first proposed in Bahdanu et al. [1] in a neural sequence-to-sequence model for neural machine translation. With the advent of the self-attentional Transformer by Vaswani et al. [38], self-attention has now become an integral component in natural language processing tasks. In self-attention, attention is applied to a single context [33]. By attending to all input positions and computing the contextual information of each output, self-attention captures the dependencies between different positions in the input in a single layer. In contrast, convolutional layers are limited by a restricted receptive field and impose translation invariance through weight sharing [2]. Capturing long-range interactions is a challenge with convolution and the global context of images is typically ignored.

The ability of self-attention to encode long-range dependencies and its parallelizability has led to rapid advances in natural language processing tasks such as machine translation [23]. Although convolutional neural networks have been widely used in computer vision tasks, self-attention models are gaining in popularity in various visual tasks including action recognition [13], video object segmentation [36], semantic segmentation [17] and image generation [29, 47]. Bello et al. [2] combined convolution and self-attention in a visual discriminative task by concatenating convolutional feature maps with a set of convolutional maps produced via self-attention, using multi-head attention to attend to distinct representations of an input. This method achieved competitive results on image classification tasks, obtaining higher accuracy than the ResNet-50 baseline on ImageNet. Ramachandran et al. [31] proposed a fully attentional vision model for image classification, using self-attention layers entirely in place of convolution layers. Thus we hypothesize that the attention mechanism proposed in Bello et al. [2] could improve the accuracy of the 3D hand pose estimation task.

3 Methods and materials

In this section, we first discuss the self-attention mechanism. Next we introduce the proposed approaches which utilize self-attention and 3D anchor points for 3D hand pose estimation.

3.1 Self-attention

In self-attention, an input tensor of shape \(\left (H, W, F_{in} \right )\) is flattened to a matrix \(X\in \mathbb {R}^{HW\times F_{in}}\) where H, W and Fin refer to the height, width and number of input filters respectively. Attention is performed using the matrix and the output of an attention head h is computed as follows [2]:

$$ O_{h} = Softmax\left (\frac{\left (XW_{q} \right )\left (XW_{k} \right )^{T}}{\sqrt {{d_{k}^{h}}}} \right )\left (XW_{v} \right ) $$
(1)

where \({d_{k}^{h}}\) refers to the depth of keys/queries of the attention head and \({d_{v}^{h}}\) refers to the depth of values of the attention head. Wq, \(W_{k}\in \mathbb {R}^{F_{in}\times {d_{k}^{h}}}\) and \(W_{v}\in \mathbb {R}^{F_{in}\times {d_{v}^{h}}}\) are learned linear transformations and map X to queries Q = XWq, keys K = XWk and values V = XWv. In multihead attention, the self-attention mechanism is replicated with multiple attention heads. Each attention head focuses on a different part of the input using different query, key and value matrices. The outputs from all heads are concatenated and projected as follows [2]:

$$ MHA\left (X \right )= Concat\left [ O_{1},...,O_{N_{h}} \right ]W^{o} $$
(2)

where Nh and dv refer to the number of heads and depth of values respectively in multihead attention and \(W^{o}\in \mathbb {R}^{d_{v}\times d_{v}}\) is a learned linear transformation. \(MHA\left (X \right )\) is reshaped to return a tensor with the original spatial dimensions \(\left (H,W,F_{in} \right )\). To enable translation equivariance, relative position encoding is implemented by independently adding relative height information and relative width information. The strength of attention between pixel \(i= \left (i_{x},i_{y} \right )\) and pixel \(j= \left (j_{x},j_{y} \right )\) is computed as [2]:

$$ l_{i,j}= \frac{{q_{i}^{T}}}{\sqrt{{d_{k}^{h}}}}\left (k_{j}+r_{j_{x}-i_{x}}^{W}+r_{j_{y}-i_{y}}^{H} \right ) $$
(3)

where qi is the query vector for pixel i, kj is the key vector for pixel j, \(r_{j_{x}-i_{x}}^{W}\) is the learned embedding for relative width jxix, and \(r_{j_{y}-i_{y}}^{H}\) is the learned embedding for relative height jyiy. The attention head h with relative positional embeddings is [2]:

$$ O_{h}= Softmax\left (\frac{QK^{T}+S_{H}^{rel}+S_{W}^{rel}}{\sqrt{{d_{k}^{h}}}} \right )V $$
(4)

where \(S_{H}^{rel}, S_{W}^{rel}\in \mathbb {R}^{HW\times HW}\) are matrices of relative positional embeddings for each pixel pair that satisfy \(S_{H}^{rel}[i,j]={q_{i}^{T}}r_{j_{y}-i_{y}}^{H}\) and \(S_{W}^{rel}[i,j]={q_{i}^{T}}r_{j_{x}-i_{x}}^{W}\). Lastly, the convolutional operator and output from multihead attention are concatenated as follows [2]:

$$ AAConv\left (X \right )= Concat\left [ Conv\left (X \right ),MHA\left (X \right ) \right ]. $$
(5)

\(\upsilon = \frac {d^{v}}{F_{out}}\) is the ratio between the number of attentional channels and number of output filters in the original convolution operator while \(\kappa = \frac {d^{k}}{F_{out}}\) is the ratio between the key depth and number of output filters in the original convolution operator. In this work, the hyperparameters υ and κ are set to 0.1 and 0.65 respectively.

3.2 Proposed approaches

The A2J regression network proposed in Xiong et al. [42] uses anchor points that are densely set on a depth image and it consists of a ResNet-50 backbone pretrained on ImageNet. Three branches extend from the backbone: an in-plane offset estimation branch, a depth estimation branch and an anchor proposal branch. The common trunk of the ResNet-50 backbone passes a feature map to the anchor proposal branch while a feature map from the regression trunk of the backbone is forwarded through the in-plane offset estimation branch and depth estimation branch.

The two proposed approaches in this work involve modifications to A2J. In the first approach, the self-attention mechanism in Bello et al. [2] is incorporated into A2J and this modified network is named AA-A2J. It has the same framework as A2J, except that its three branches are modified to augment convolution with self-attention (Fig. 1). The depth estimation branch regresses the depth position of the hand keypoints following A2J.

Fig. 1
figure 1

Architecture of the in-plane offset estimation branch, depth estimation/depth offset estimation branch and anchor proposal branch in AA-A2J and AA-3DA2J

In the second approach, aside from incorporating self-attention into A2J, anchor points are extended to the depth dimension. The new network, referred to as AA-3DA2J, has a framework similar to AA-A2J except that it has a depth offset estimation branch instead of depth estimation branch. Figure 2, adapted from Xiong et al. [42], shows the framework of AA-3DA2J. As 3D anchor points are now utilized, the depth offset estimation branch is used to predict the depth offset with respect to a certain joint from each anchor point. The branches in AA-3DA2J and AA-A2J share the same design (Fig. 1). In addition, the ResNet-50 backbones in both AA-A2J and AA-3DA2J are pretrained on ImageNet.

Fig. 2
figure 2

Framework of AA-3DA2J [42]. The regression network consists of a ResNet-50 backbone which is connected to three branches. The in-plane offset estimation branch and depth offset estimation branches are used to predict the offsets between the anchor points and ground truth while the anchor proposal branch helps discover informative anchor points for a certain joint

To determine the individual contribution of self-attention and 3D anchor points to the performance of 3D hand pose estimation, a separate regression network named 3DA2J is also investigated in an ablation study. The 3DA2J network is produced by extending anchor points in A2J to the depth dimension.

The anchor proposal branch in AA-A2J, AA-3DA2J and 3DA2J discovers informative anchors for each joint by assigning weights to the anchor points. These weights are used to predict the contribution of the anchor points to a specific joint. The weights are normalized using the softmax function [42]:

$$ \widetilde{P_{j}}\left (a \right )= \frac{e^{P_{j}\left (a \right )}}{{\varSigma}_{a\in A}e^{P_{j}}(a)} $$
(6)

where A is the anchor point set and \(P_{j}\left (a \right )\) is the response of anchor point aA towards joint j. The estimated in-plane position, estimated depth position and loss functions in AA-A2J are defined according to A2J in Xiong et al. [42].

Next, the estimated in-plane position, estimated depth position and loss functions of the networks which utilize 3D anchor points, AA-3DA2J and 3DA2J, are defined. The estimated in-plane position \(\widehat {S_{j}}\) is formulated as:

$$ \widehat{S_{j}}= {{\varSigma}}_{a\in A}\widetilde{P_{j}}\left (a \right )\left (S^{i}\left (a \right )+{O_{j}^{i}}\left (a \right ) \right) $$
(7)

where \(S^{i}\left (a \right )\) and \({O_{j}^{i}}\left (a \right )\) are the in-plane position of anchor point a and predicted in-plane offset towards joint j from anchor point a respectively. The estimated depth position \(\widehat {D_{j}}\) is as follows:

$$ \widehat{D_{j}}= {{\varSigma}}_{a\in A}\widetilde{P_{j}}\left (a \right )\left (S^{d}\left (a \right )+{O_{j}^{d}}\left (a \right ) \right) $$
(8)

where \(S^{d}\left (a \right )\) and \({O_{j}^{d}}\left (a \right )\) are the depth position of anchor point a and predicted depth offset towards joint j from anchor point a respectively.

The regression loss function for the in-plane and depth positions is as follows:

$$ \begin{array}{ll} loss_{1}=& \alpha {{\varSigma}}_{j\in J}L_{\tau_{1}}\left ({{\varSigma}}_{a\in A}\widetilde{P_{j}}\left (a \right )\left (S^{i}\left (a \right )+{O_{j}^{i}}\left (a \right ) \right )-{T_{j}^{i}} \right )\\ &+{{\varSigma}}_{j\in J}L_{\tau_{2}}\left ({{\varSigma}}_{a\in A}\widetilde{P}\left (a \right )\left (S^{d}\left (a \right )+{O_{j}^{d}}\left (a \right ) \right )-{T_{j}^{d}} \right ) \end{array} $$
(9)

where α is assigned 0.5 according to [42]. Different from A2J and AA-A2J, both the in-plane position and depth position contribute to the informative point surrounding loss in AA-3DA2J and 3DA2J. The informative point surrounding loss is defined as:

$$ \begin{array}{ll} loss_{2}=& {{\varSigma}}_{j\in J}L_{\tau_{1}}\left ({{\varSigma}}_{a\in A} \widetilde{P}\left (a \right )S^{i}\left (a \right )-{T_{j}^{i}} \right )\\ &+{{\varSigma}}_{j\in J}L_{\tau_{1}}\left ({{\varSigma}}_{a\in A} \widetilde{P}\left (a \right )S^{d}\left (a \right )-{T_{j}^{d}} \right ) \end{array} $$
(10)

where \(T_{j}^{i }\) and \({T_{j}^{d}}\) are the ground-truth in-plane position of joint j and ground-truth depth position of joint j respectively. τ is the smooth L1-like loss and is defined as follows [42]:

$$ \tau= \left\{\begin{array}{ll} \frac{1}{2\tau}x^{2}, & for \left | x \right |< \tau\\ \left | x \right |-\frac{\tau}{x}, & otherwise \end{array}\right. $$
(11)

where τ1 is set to 1 and τ2 is set to 3 as in [42]. The two loss functions are combined in end-to-end training as follows:

$$ loss = \lambda loss_{1}+loss_{2} $$
(12)

where λ is set to 3 following [42].

4 Experiments and results

Center points are used to crop the hand region from the depth image, following the approach of other works [26, 42]. The cropped image is resized to 176 x 176 and passed as input to the ResNet-50 backbone of the proposed approaches after performing data augmentation. The networks are trained end-to-end under the supervision of two loss functions: joint position estimation loss and informative anchor point surrounding loss.

4.1 Datasets

Experiments are conducted on four public hand pose datasets: NYU dataset, ICVL dataset, MSRA dataset and HANDS 2017 dataset.

NYU Dataset [37]

The NYU dataset consists of 72K training and 8.2K testing depth images. In each image, 21 joints are annotated. The dataset has a diverse range of hand poses. In line with previous works [5, 15, 26, 42], 14 joints are used during training and testing.

ICVL Dataset [35]

The ICVL dataset has 330K training depth images with in-plane rotation augmented frames. There are 6.5K testing depth images and 16 joints are annotated.

MSRA Dataset [34]

The MSRA dataset consists of 76.5K images from nine different subjects and 21 joints are annotated. The leave-one-subject-out cross validation method is used for evaluation and the results are averaged over the nine subjects.

HANDS 2017 Dataset [45]

The dataset consists of 957K training and 295K testing depth images sampled from BigHand2.2M dataset [46] and First-Person Hand Action datasets (FHAD) [9]. It is the largest hand pose dataset that is available and provides annotations for 21 hand joints. There are five subjects in the training set and ten subjects in the test set. Five subjects in the test set are seen in the training set.

4.2 Evaluation metrics

We evaluate the performance of our approaches with two standard metrics.

Average 3D distance error

This is the average Euclidean distance between the predicted 3D joint coordinates and the ground truth.

Percentage of successful frames

This metric measures the fraction of test samples that have all predicted joints below a given maximum Euclidean distance from the ground truth.

4.3 Implementation

The networks are implemented in PyTorch. Data augmentation is performed according to Xiong et al. [42], including rotation, scaling and addition of random gaussian noise to depth values. The images in the NYU, ICVL and HANDS 2017 datasets are normalized using the mean and standard deviation values provided in the A2J GitHub repository at https://github.com/zhangboshen/A2J. A2J is not trained on the MSRA dataset and we normalize images in this dataset separately for each subject by computing the mean and standard deviation values of images from the same subject. Weights are updated by the Adam optimizer and the learning rate is set to 0.00035 with a weight decay of 0.0001 for all datasets. A batch size of 16 is used for the NYU dataset with the learning rate decreased by a factor of 0.2 every 7 epochs for 35 epochs. A batch size of 64 is used and the learning rate is decreased by a factor of 0.2 every 5 epochs for the ICVL dataset, MSRA dataset and HANDS 2017 dataset. The networks are trained for 10, 50 and 16 epochs on the ICVL dataset, MSRA dataset and HANDS 2017 dataset respectively. All networks are trained and validated on a Tesla V100 GPU.

4.4 Comparison with state-of-the-arts methods

We compare AA-A2J and AA-3DA2J with the state-of-the-art 3D hand pose estimation methods [3, 5,6,7,8, 10, 12, 15, 20, 24, 26, 27, 30, 32, 34, 35, 39, 40, 42,43,44, 49] on the four public datasets. Figure 3 shows the performances of various methods on the NYU dataset, ICVL dataset and MSRA dataset.

Fig. 3
figure 3

Evaluation on hand pose datasets. Left: 3D distance errors per hand joint. Right: Percentage of successful frames over different 3D distance error thresholds

On the NYU dataset, both approaches achieve better performances than baseline A2J. Moreover, AA-3DA2J achieves a mean 3D distance error of 8.37 mm, lower than other methods except SRN [32] (Table 1). The approaches also produce higher percentages of frames with a mean error under 20 mm, as compared to other methods (Fig. 3). On the ICVL dataset, AA-A2J obtains superior performance to other methods except V2V [26]. Similarly, AA-3DA2J achieves better performance compared to other methods except P2P [12], AA-A2J and V2V [26] (Table 2). On the MSRA dataset, AA-A2J and AA-3DA2J obtain mean errors of 8.08 mm and 8.16 mm respectively and achieve comparable performances to other methods (Table 3). Both approaches outperform other methods on the HANDS 2017 dataset, with AA-3DA2J achieving a mean 3D distance error of 8.13 mm (Table 4).

Table 1 Performance of different methods on NYU dataset
Table 2 Performance of different methods on ICVL dataset
Table 3 Performance of different methods on MSRA dataset
Table 4 Performance of different methods on HANDS 2017 dataset

The runtime speeds of different methods are shown in Table 5. AA-A2J is found to have a faster runtime speed than all other methods except SRN [32] while AA-3DA2J has a slower runtime speed compared to AA-A2J, SRN [32], CrossInfoNet [20] and CrossingNets [39].

Table 5 Runtime of different methods

4.5 Ablation study

To ascertain the individual contribution of self-attention and 3D anchor points to performance, we perform experiments using the NYU dataset, a challenging dataset with a diversity of hand poses.

Self-attention

AA-A2J, which utilizes self-attention, demonstrates better performance than baseline A2J as shown in Table 6. Compared to 3DA2J which utilizes 3D anchor points but not self-attention, AA-3DA2J which incorporates both self-attention and 3D anchor points achieves a lower mean error. These results show that self-attention improves the performance of 3D hand pose estimation.

Table 6 Effect of self-attention and 3D anchor points on performance on NYU dataset

3D anchor points

A2J and 3DA2J have similar performances as shown in Table 1 and Table 6. In contrast, AA-3DA2J produces superior performance to AA-A2J (Table 6). This suggests that 3D anchor points offer negligible performance advantage over anchor points set in the depth image in the absence of self-attention.

4.6 Runtime analysis

AA-A2J and AA-3DA2J have runtime speeds of 151.06 fps and 79.62 fps respectively on a Tesla V100 GPU (Table 7) whereas A2J has a higher runtime speed of 164.44 fps (Table 7) on the same GPU.

Table 7 Number of trainable parameters in different methods

Incorporating self-attention to the network leads to a small increase in the number of trainable parameters and decreases runtime speed marginally (Table 7). For instance, AA-A2J has 1.05 times as many trainable parameters as A2J and a slightly slower runtime speed.

Using 3D anchor points increases the number of trainable parameters by a large extent which in turn reduces runtime speed (Table 7). 3D-A2J has 3.68 times as many trainable parameters as A2J and its runtime speed is 0.50 times that of A2J. Similarly, AA-3DA2J has 3.56 times as many trainable parameters as AA-A2J and its runtime speed is 0.53 times that of AA-A2J.

5 Real-time 3D hand pose estimation

Real-time hand pose estimation is useful in assessing the degree of hand impairment for rehabilitation purposes. A real-time 3D hand pose estimation system is implemented using two depth cameras Intel RealSense SR300 and Intel RealSense D415 (Fig. 4). Owing to its faster inference time compared to AA-3DA2J, AA-A2J is used in the system for 3D hand pose estimation. Depth images from both cameras are passed into AA-A2J which has been trained on the HANDS 2017 dataset to estimate the joint locations and the range of motion in terms of flexion. Predicted angles from both cameras are averaged to improve the accuracy. Figure 5 shows the predicted joint locations in the real-time system. Depth images from both cameras are retrieved and processed simultaneously. The HANDS 2017 dataset annotates the center of the wrist (W), metacarpal phalangeal joint (MCP), proximal interphalangeal joint (PIP), distal interphalangeal joint (DIP) and tip joint (TIP). The annotations are used to compute the flexion hand angles as follows:

$$ \widetilde{MCP_{x}}= arcos\left (\overrightarrow{MCP_{x}-W} \right ).\left (\overrightarrow{PIP_{x}-MCP_x} \right ) $$
(13)
$$ \widetilde{PIP_{x}}= arcos\left (\overrightarrow{PIP_{x}-MCP_{x}} \right ).\left (\overrightarrow{DIP_{x}-PIP_x} \right ) $$
(14)
$$ \widetilde{DIP_{x}}= arcos\left (\overrightarrow{DIP_{x}-PIP_{x}} \right ).a\left (\overrightarrow{TIP_{x}-DIP_x} \right ) $$
(15)

where for each digit x, MCPx is the angle between W, MCPx and PIPx, DIPx is the angle between MCPx, PIPx and DIPx, and DIPx is the angle between PIPx, DIPx and TIPx.

Fig. 4
figure 4

Setup of depth cameras Intel RealSense D415 and Intel RealSense SR300

Fig. 5
figure 5

Real-time hand pose estimation

The processing speed of the system is 12.2 fps on a 2070 Super GPU.

6 Conclusion

In this work, two networks are proposed to recover 3D hand poses from a single depth image. The first network, AA-A2J, uses a self-attention mechanism for 3D hand pose estimation whereas the second network, AA-3DA2J, utilizes 3D anchor points in addition to self-attention. The two approaches AA-A2J and AA-3DA2J obtain performances that are comparable to the other state-of-the-art methods and are superior to the baseline A2J regression network. In addition, both AA-A2J and A2J have similar runtime speeds. Modelling self-attention helps capture spatial context information from depth images and is beneficial for 3D hand pose estimation. This advantage is demonstrated by the performances of AA-A2J on the four hand pose datasets. The use of both self-attention and 3D anchor points in AA-3DA2J further boosts performance over AA-A2J on the NYU dataset and HANDS 2017 dataset. However, this comes at the expense of runtime speed.

There are several limitations in our work. First, the approach with 3D anchor points, 3D-A2J, is evaluated on the NYU dataset but not on the other hand pose datasets. In future work, this approach should be evaluated on the other datasets to determine whether extending anchor points to the depth dimension has a marginal effect on performance across all datasets. Second, a relatively small amount of data is used to train and evaluate the proposed approaches. Future work includes evaluation on a bigger dataset comprised of normal subjects and subjects with hand impairment due to stroke. This would enable further studies on the robustness of the proposed approaches in 3D hand pose estimation for stroke rehabilitation.