1 Introduction

Human activity recognition (HAR) or action recognition is a challenging but popular field of research in signal and image processing. HAR basically includes automatic detection, recognition and analysis of human actions from data from different sensor types such as range sensor, RGB camera, depth sensor or inertia sensor [1]. In recent years, the widespread use of mobile devices [2] has made HAR a new research area in artificial intelligence and pattern recognition based on wearable sensors [3]. The purpose of action recognition or analysis is to determine which action appears in the data. Sensors such as accelerometers, gyroscopes and magnetometers [4] built into mobile devices can generate time series data for HAR. Over the last few years, research on HAR has gained considerable popularity and is becoming increasingly vital in various disciplines. Information extraction using artificial intelligence applications is very important for HAR applications [1]. HAR has been successfully applied in many areas such as sports training, remote health monitoring, health self-management, military practice, play, home behavior analysis, gait analysis and gesture recognition [5, 6]. Various sensor methods have been presented to monitor people and their activities. HAR approaches can generally be divided into two main categories, depending on the type of sensors used. These categories are: vision-based HAR and inertial sensor-based HAR. Vision-based HAR has developed rapidly in recent years. Thanks to the use of cameras, activity classification was made by monitoring and recognition [7]. There are vision-based studies using Kinect camera to analyze indoor activities [8]. With the development of deep learning techniques, convolutional neural networks have been used for vision-based HAR studies and successful results have been obtained [9]. The major disadvantage of vision-based approaches is that poor results cannot be achieved in dark environments and are not suitable for installation in areas where personal privacy is to be protected. In addition, the fact that it is installed in a fixed location has the disadvantage that it allows only the activities to be determined in the area where it is established [10]. With the development of electronic systems, small size and light inertia detection devices, which have lower power consumption, are widely used in various digital products such as mobile phones and computers. Therefore, sensor-based HAR systems are used in most of today’s studies [6, 11,12,13]. Sensor-based systems can be divided into single sensor-based systems and body network-based systems. Body sensor network-based systems have been used for crowd analysis, gait analysis or training of sports branches [14, 15]. Even if the body sensor network-based activity recognition system provides an increase in generalization performance, it is impossible to use it for a long time in real life. It is very difficult for the body to remain unchanged for a long time and to remain unchanged against external influences. In addition, the high cost is another disadvantage. Researches are widely developed on single sensor-based systems. In order to make the drop detection, the researchers used a single accelerometer and searched four positions in the human body to place this accelerometer [16]. In another study, a single accelerometer placed on the wrist and the performance of using template matching method in activity recognition were analyzed [17]. In another study, the researchers compared the waist-mounted smartphone and the power of accelerometer and gyroscope used to recognize human physical activity [18].

Deep learning methods are phenomena of machine learning. For big datasets, deep learning methods have achieved high success rates. Therefore, they have been used in many areas and HAR is one of them. However, million parameters should be set in deep learning methods, and they need high-cost hardware, for instance graphical processing unit, tensor processing unit. One of the most important problems of the deep networks is weight assignment. Deep learning methods need big dataset to assign weight correctly, and this process has long execution time. In order to overcome this problem, pre-trained networks have been used, and these networks have generally been trained in ImageNet dataset. In the pre-trained networks, the calculated final weights are used. Therefore, high classification rates can be achieved in a short execution time by using pre-trained network. The main motivation of this study is to propose an ensemble feature extractor method by using pre-trained deep networks. HAR is a signal processing-based research area, but we converted all signals to image and we used pre-trained CNN. As we know from the literature, CNNs are very effective methods for computer vision. Our aims are to use this effectiveness of the CNNs on the HAR and propose a novel hybrid feature extractor by using pre-trained deep networks. A brief explanation of this method is given as below.

In this method, a novel ensemble ResNet is proposed. This method uses ResNet18, ResNet50 and ResNet101 together. Fully connected 1000 (FC1000) layers features are extracted from these networks, and these features are concatenated. Then, 1000 most distinctive features are selected. The selected feature set is forwarded to cubic SVM classifier. The main motivation of the proposed study is to obtain high classification accuracy from sensors signals by using ResNets. Therefore, we defined two cases, and the main aim of these cases is to classify daily sport activities and genders. By using these cases, highly classification capability of the proposed ensemble ResNet-based sensor signal classification method is shown. Our main motivation is to use ResNets as feature extractor and evaluate performance of them. ResNets have high classification capabilities. The used ResNets (ResNet18, ResNet50 and ResNet101) are lightweight networks (they have 18, 50 and 101 layers, respectively); hence, we selected these networks. They have both small number of layers and high classification ability. Therefore, an ensemble feature extractor is presented by using these three ResNets. The proposed ensemble feature extractor was tested on signal datasets. The key contributions of this research work are as follows:

  • ResNets [19] have been used for computer vision and image classification. In the signal classification methods, recurrent neural networks have been widely used. The main problem of the recurrent neural networks is to set parameters. CNNs are very effective methods for computer vision, and there is no need parameters selection in the CNNS. Therefore, we used pre-trained CNNs (ResNet18, ResNet50 and ResNet101) as feature extractor, and effectiveness of this feature extractor is shown by using two cases.

  • In the CNN-based signal classification methods, spectrograms of the signals are utilized as input of the CNNs. In this study, vector-to-matrix transformation is used and signals are converted to images by using this transformation.

  • To demonstrate success of the proposed ensemble deep feature extractor, two cases were defined and these are gender and daily sport activities recognitions. These are two signal classification problems. The proposed ensemble ResNet-based feature extractor achieved 99.96% recognition rate for gender recognition. It also achieved 99.61% recognition rate for 19 daily sport activities recognition. These results prove success of the proposed method.

  • Effectiveness of the proposed ensemble deep feature extractor is showed by using conventional classifiers.

2 Residual network

Nowadays, information extraction from big data has been very important research area. Therefore, many artificial intelligence and machine learning methods have been proposed in the literature. Deep learning is one of the machine learning methods developed with these parameters [20, 21]. Deep learning has become most popular branch of artificial intelligence and machine learning. Deep learning allows us to train a system to predict meaningful outputs from a large data set. Deep learning has been widely used in many research areas such as defense and security, medical researches and industrial systems [20, 22,23,24].

In deep learning, different architectures have been proposed such as convolution neural network (CNN), recurrent network (RN) and restricted Boltzmann machines (RBM) [25,26,27,28,29]. The basic logic of these architectures comes from the idea of creating structures that mimic human beings. The most widely used of these architectures is CNN, a mixture of biology and computer science. Like other architectures, CNN is based on neural networks. When defining an object, CNN tries to obtain properties that make it unique. While the curves and edges are first detected in an CNN object, the abstract concepts are created. CNN features convolutional, nonlinearity, pooling, flattening and fully connected layers to achieve features in an image. Commonly used networks in CNN architecture are LeNet [30], ResNet [19], GoogleNet [31], AlexNet [32], ImageNet, Visual Geometry Group Network (VGGNet) [33] and DenseNet. ResNet [19] has been widely used in the image and signal processing. The main aim of the ResNet is to solve vanishing gradient problem. Therefore, ResNets have many layers and use \(F\left( x \right) + x\) equation. The block diagram of ResNets is shown in Fig. 1.

Fig. 1
figure 1

Residual networks schematic explanation

The widely used ResNets are ResNet18, ResNet50, ResNet101 and ResNet152, and ResNet-based deep learning methods have still been presented. For instance, ResNet18 has 18 layers; hence, it is called as ResNet18. By using ResNet-based deep learning methods, high success rates have been achieved. ResNet18, ResNet50, ResNet101 and ResNet152 have 11 M, 25.6 M, 44.5 M and 60.2 M parameters, respectively.

In this study, we used ResNet18, ResNet50 and ResNet101 together to propose ensemble ResNet.

3 Material

In this study, daily and sports activities data set were used. Data were used by four women and four men between the ages of 20 and 30, for a total of 19 different activities over five minutes [6]. The data obtained from torso, right arm, left arm, right leg and left leg units with the help of nine sensors includes 60 segments and eight subjects. In the data, the speed and width values of some activities differ due to the fact that subjects perform the activities in their own way. At the 25 Hz sampling frequency, data were obtained from the sensors and the five-minute signals were fragmented into five-second segments. The 19 activity types in the data are as given in Fig. 2.

Fig. 2
figure 2

The defined daily activities in the UCI HAR dataset

All the above data were obtained by the Xsens MTx sensor. This sensor has been developed for the orientation measurement of parts of the human body [34]. In Fig. 3, the used sensor is shown. The data obtained by the help of triaxial accelerometer, gyroscope and magnometer within the sensors are programmed with MT manager interface.

Fig. 3
figure 3

Xsens MTx sensor and coordinates

The sensors are placed in the human body at five different locations, as shown in Fig. 4. Sensors placed around the knees, chest, wrists with duct tape are connected to a device called Xbus Master in the belt by means of cables. By using a BluetoothTM connection, data from the device to the receiver are obtained by connecting the receiver to the computer via USB. The data collected in accordance with the ethics committee were presented in the UCI machine learning repository [35]. This dataset consists of 9120 observations.

Fig. 4
figure 4

Connection type of the sensors on the body

4 Ensemble ResNet-based signal recognition method

In this study, a novel ensemble signal recognition method is presented. We proposed a simple and effective learning method for signal classification. The main aim of this method is to achieve high classification accuracy both big and small datasets. Also, we used pre-trained networks. Any training phase was not used. We used optimal weights of the ResNets, and these weights were obtained by training ImageNet. The main aim of the proposed ensemble ResNet is to propose signal-to-matrix conversation (preprocessing), feature extraction using ensemble ResNet, feature selection by ReliefF [36] and classification phases. The graphical outline of the proposed is shown in Fig. 5.

Fig. 5
figure 5

Block diagram of the proposed ensemble ResNet-based signal recognition method

Brief explanation of the proposed ensemble ResNet is given in Algorithm 1.

figure a

4.1 Preprocessing

In this study, sensors signals are utilized as input. However, we used pre-trained deep convolutional networks as feature extractor. Therefore, the used 1D sensor signals should be converted a 2D matrix or spectrograms of them should be extracted. Vector-to-matrix conversation is chosen as preprocessing method in this study. Pseudo-code of the used vector-to-matrix transformation is shown as Algorithm 2.

figure b

As seen from Algorithm 2, the 1D raw signal is transformed to 125 × 45 sized image. The used dataset consists of text files and each text file has 125 rows and 45 columns. Therefore, we selected 125 × 45 as size of matrix. Mathematical explanation of this phase is also shown in below.

$$\text{Im} = vec2mat\left( {S, \left[ {125 \times 45} \right]} \right)$$
(1)
$$\text{Im} = {\text{round}}\left( {\frac{{\text{Im} - \text{Im}_{\text{min} } }}{{\text{Im}_{\text{max} } - \text{Im}_{\text{min} } }} \times 255} \right)$$
(2)

where \(vec2mat\left( {.,.} \right)\) is vector-to-matrix transformation, and Eq. 2 defines min–max normalization. We used min–max normalization to code calculate 8-bit gray-scale image from sensor signals.

4.2 Feature extraction by using the proposed ensemble ResNet

Deep learning methods have been widely used method in the literature. Especially, convolutional neural networks (CNN) are very hot topic for artificial intelligence. The CNNs are utilized as both learning method and feature extractor. One of the mostly used CNN is ResNet. ResNet has many variations, and widely used ResNets are ResNets18, ResNet50 and ResNet101.

In this study, pre-trained ResNets (ResNet18, ResNet50, ResNet101) are used. These networks pre-trained on ImageNet. These networks have 18 (72 sublayers), 50 (177 sublayers) and 101 (347 sublayers) layers, respectively. These networks are utilized as feature extractors in this study. Therefore, the softmax and classification layers of these networks are not used. All of these networks have FC1000 (1000 fully connected layer). The graphical explanation of the proposed ensemble ResNet is shown in Fig. 6.

Fig. 6
figure 6

Graphical explanation of the proposed ensemble ResNet

As seen from Fig. 6, 1000 features are extracted. Then, these features are concatenated and 3000 final features are obtained. The mathematical explanation of this section is:

$$F_{1} = {\text{ResNet}}18\left( {\text{Im} } \right)$$
(3)
$$F_{2} = {\text{ResNet}}50\left( {\text{Im}} \right)$$
(4)
$$F_{3} = {\text{ResNet}}101\left( {\text{Im} } \right)$$
(5)

where \({\text{ResNet}}18\), \({\text{ResNet}}50\) and \({\text{ResNet}}101\) feature extraction function of the feature extraction method. We used FC1000 layer of them, and each feature extraction function generates 1000 features. \(F_{1}\), \(F_{2}\) and \(F_{3}\) are feature vectors of the deep \({\text{ResNet}}18\), \(ResNet50\) and \({\text{ResNet}}101\) feature generation methods, respectively. These features are concatenated using Eq. 6.

$${\text{feature}} = F_{1} \left| {F_{2} } \right|F_{3}$$
(6)

where feature is final feature with size of 3000 and \(|\) is concatenation operator.

4.3 Feature selection

In this section, the obtained 3000 features are used as input, and redundant 2000 features are eliminated by using ReliefF. ReliefF is one of the mostly used feature selector in the literature. It uses distance-based feature weighting and generates weights for all features. ReliefF generates weights by using distance metrics. In the ReliefF method, Euclidean distance is used but Manhattan distance is used in the ReliefF method. ReliefF method generates both negative and positive weights. The big weights imply distinctive features, and small weights describe redundant features [36, 37]. By using the generated weights, the feature selection is processed. Equations of weights are generation process of ReliefF.

$$W\left( {ft_{i} } \right) = W\left( {ft_{i} } \right) - \frac{{\mathop \sum \nolimits_{j = 1}^{k} {\text{dist}}\left( {A,R,H} \right)}}{n*k} + \frac{{\mathop \sum \nolimits_{C \ne class\left( R \right)}^{{}} \left[ {\frac{P\left( C \right)}{1 - P\left( R \right)}*\mathop \sum \nolimits_{l = 1}^{k} {\text{dist}}\left( {A,R,M} \right)} \right]}}{n*k}$$
(7)
$${\text{dist}}\left( {A,L_{1} ,L_{2} } \right) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {L_{1} = L_{2} } \hfill \\ {1,} \hfill & {L_{1} \ne L} \hfill \\ \end{array} } \right.$$
(8)
$${\text{dist}}\left( {A,L_{1} ,L_{2} } \right) = \frac{{\left| {L_{1} - L_{2} } \right|}}{{A_{\text{max} } - A_{\text{min} } }}$$
(9)

Equations 79 mathematically define weights generation process of ReliefF, where \(W\left( {ft_{i} } \right)\) is weights of ith feature, \(k\) is missing number of classes, \(R\) is selected data in cycle, dist defines distance, \(H\) represents nearest class, \(n\) is number of cycles and \(P\) is probability.

After weights generation, the generated weights are sorted by descending and 1000 most weighted features are selected.

The steps of the used ReliefF-based feature selection are shown as below.

Step 1: Calculate weights of the concatenated features by using ReliefF and target values.

$${\text{weight}} = {\text{ReliefF}}\left( {{\text{feature}},{\text{target}}} \right)$$
(10)

Step 2: Sort the generated weights descending.

$$\left[ {{\text{weight}}_{\text{sorted}} ,{\text{indices}}} \right] = {\text{sort}}\left( {\text{weight}} \right)$$
(11)

Step 3: Select the 1000 most discriminative features by using Algorithm 3.

figure c

4.4 Classification

Classification is the final phase of the proposed ensemble ResNet-based signal recognition method. In order to show strength of the proposed ensemble ResNet-based feature extraction method, a conventional classifier is used and this classifier is cubic SVM [38, 39]. SVM is one of the conventional classifiers and it is an optimization-based classification method. It uses various kernels for instance linear, quadratic, cubic and Gaussian. SVM can also be used in nonlinear classification tasks by implementing this kernel functions. Thus, each \(n\)—dimensional input vector \(x_{i} = 1,2,3, \ldots , M)\), in which \(M\) represents the sample number, is mapped to a \(L\) dimensional property field \(\varPhi x = \left[ {\phi_{1}^{x} , \ldots , \phi_{L}^{x} } \right]\), where \(K\left( {x_{i} , x_{j} } \right)\) is a kernel function. Cubic SVM uses the third-degree polynomial activation function. To obtain test results by using cubic SVM, stratified tenfold cross-validation is used. The attributes of the cubic SVM are given as follows. Box constraint level is 1, multiclass is selected as one-vs-all. The used cubic SVM calculates high-dimensional relationships, and equation of the used kernel is shown in Eq. 12.

$$\left( {a \times b + r} \right)^{3}$$
(12)

where a and b refer to the two observations we want to calculate and r determines cross-validation.

5 Experimental results

In order to test the proposed method, we used a dataset which is explained in Sect. 2. By using this dataset, two cases are defined and these are explained in below.

Case 1

Daily sport activity recognition. In this case, 19 classes are defined.

Case 2

Gender recognition. In this case, two classes are defined.

To obtain numerical results from these cases, accuracy, F1-score and geometrical mean are used. The mathematical notation of these performance metrics is given as below.

$${\text{Accuracy}} = \frac{{{\text{tp}} + {\text{tn}}}}{{{\text{tp}} + {\text{tn}} + {\text{fp}} + {\text{fn}}}}$$
(13)
$$F1{\text{-score}} = \frac{{2{\text{tp}}}}{{2{\text{tp}} + {\text{fp}} + {\text{fn}}}}$$
(14)
$${\text{Geometric}}\,{\text{mean}} = \sqrt {\frac{{{\text{tp}} \cdot {\text{tn}}}}{{\left( {{\text{tp}} + {\text{fn}}} \right) \cdot \left( {{\text{tn}} + {\text{fp}}} \right)}}}$$
(15)

where tp, tn, fp and fn are true positive, true negative, false positive and false negatives.

The proposed ensemble ResNet was repeated 1000 times to obtain comprehensively results. The obtained results for each case are given as below.

As shown from Table 1, the obtained maximum accuracy rates are 99.61% and 99.96% for Case 1 and Case 2, respectively. Confusion matrixes of the best result of the Case 1 and Case 2 are shown in Figs. 7 and 8.

Table 1 The results of the proposed ensemble ResNet
Fig. 7
figure 7

Confusion matrix of the best score of the Case 1

Fig. 8
figure 8

Confusion matrix of the best result of the Case 2

As seen from Fig. 7, 100.0% accuracy rate was achieved for 10 (2nd, 3rd, 4th, 7th, 8th, 9th, 10th, 14th, 15th and 19th classes) of the 19 classes. The worst accuracy rate was 96.25% calculated for 18th class (Jumping).

In the Case 2, gender classification was performed. According to Fig. 8, male recognition was achieved 100.0% accuracy rate by using the proposed ensemble ResNet and 99.91% success rate was calculated for female recognition.

6 Discussion

In this study, sensors signals are classified using an ensemble CNN method (ResNet). As we know from the literature, ResNet is generally used for image recognition and classification. By using a basic vector-to-matrix transformation, the proposed ensemble ResNet is applied to sensors signals by using two cases. These cases are defined to recognize daily sport activities and genders. By using these cases, success of the proposed ensemble ResNet is proved. According to results, the proposed method achieved high results for all cases. As seen from Fig. 7, the proposed method achieved 100.0% accuracy rate in 10 classes (2nd, 3rd, 4th, 7th, 8th, 9th, 10th, 14th, 15th, 19th). The worst accuracy rate was calculated as 96.25% for 18th class. In the gender recognition, 100.0% accuracy rate is achieved for male recognition and it achieved approximately 99.96% accuracy. These results clearly demonstrate success of the proposed ensemble ResNet. To show effectiveness of the ensemble ResNet, the other deep ResNets (ResNet18, ResNet50, ResNet101) are used for comparisons and the comparatively results are shown in Table 2.

Table 2 Comparatively results of the ResNets, VGGNets, GoogLeNet and the proposed ensemble ResNet

Table 2 clearly shows that the proposed ensemble method achieved best results among the used ResNets. It achieved 0.98% and 0.35% higher accuracy rate than ResNet50. ResNet50 is the best of the others. Two combinations of the ResNets are also presented and they are called as ResNet18-50, ResNet18-101 and ResNet50-101. In these couple ResNets, 2000 features are reduced to 1000 features by using ReliefF and they used to obtain comparisons. According to Table 2, the proposed ensemble ternary ResNet achieved the best classification accuracy among them. The best of the couple ensemble ResNet is ResNet50-101 because it achieved 99.25% and 99.82% success rate for Case 1 and Case 2, respectively. Success rates of the proposed ensemble ResNet are 0.36% and 0.14% higher than ResNet50-101 for Case 1 and Case 2, respectively. Also, GoogLeNet, VGGNet16 and VGGNet19 were chosen as feature extractor. The proposed ensemble ResNet-based HAR method was also resulted higher than these networks.

The proposed ensemble ResNet uses ResNet18, ResNet50 and ResNet101 features together. In the case 1, the proposed ensemble ResNet uses 96, 496 and 408 features from ResNet18, ResNet50 and ResNet101, respectively. In the case 2, 177, 406 and 417 features are used from ResNet18, ResNet50 and ResNet101.

To clearly understand success of the proposed ensemble ResNet, the proposed method is compared to the previously presented state-of-art methods.

Kuncan et al. [43] proposed local binary pattern-based gender recognition methods using the sensors signals. The proposed Case 2 is compared to Kuncan et al.’s methods, and comparatively results are given in Table 4.

Table 4 clearly demonstrates that the proposed ensemble ResNet achieved 3.92%, 3.24% and 2.68% higher classification rate than 1D-LBP, 1D-RLBP and weighted 1D-LBP, respectively. In the case 2, only 4 of the 9120 observations are false predicted (see Fig. 4). In Tables 2, 3 and 4, the best results were shown by using bold font type.

Table 3 Comparison results of the Case 1
Table 4 Comparatively results of the Case 2

Moreover, features of the proposed ensemble ResNet-based feature extractor are classified with deep neural network (DNN). DNN achieved 97.94% and 99.20% classification accuracies for Case 1 and Case 2, respectively.

We also applied the proposed ensemble ResNet method to MobiAct [44] dataset to generalize success of this deep ensemble feature extractor. The obtained results from MobiAct [44] dataset are listed in Table 5.

Table 5 Mean accuracy rates of the proposed ensemble ResNet and other methods

Table 5 clearly shows that the proposed method is successful for HAR. We achieved 1.20% higher success rates than Ferrari et al.’s method [48] (the best of the others). Table 5 shows that ResNet is good solution for HAR because ResNet-based method achieved higher 90% classification accuracy. This test was implemented for activity recognition.

According to these results, the proposed ensemble ResNet-based method achieved best results.

The advantages of the proposed ensemble ResNet are given as below.

  • A simple preprocessing algorithm is used in this work (see Algorithm 2) instead of spectrogram extraction. We used matrix-to-vector transformation as preprocessing. Time complexity of this method was calculated as \(O\left( n \right)\) (\(n\) is length of the signal). Therefore, this method was selected for preprocessing.

  • The proposed ensemble ResNet has high success rates (See Tables 15).

  • The proposed ensemble ResNet uses ResNet18, ResNet50 and ResNet101 together, and it improves success rates all of them (See Table 2).

  • In this work, a simple preprocessing method (vector-to-matrix transformation), pre-trained three networks, ReliefF feature selector and conventional classifier (SVM) are used together. The used methods are well known and basic methods. By using these, an effective learning method is proposed for signal classification (See Table 1).

  • Two cases were defined in this paper. The proposed method achieved high success rates for these cases. This situation clearly demonstrates that the proposed method is a general signal recognition method (see Table 2).

  • The presented ensemble ResNet-based HAR method achieved higher performance than other HAR methods (see Table 3).

  • The extracted features are more suitable conventional classifier (SVM) than deep classifier (DNN). DNN achieved 97.94% and 99.20% classification accuracies for Case 1 and Case 2, respectively, while SVM achieved 99.61% and 99.96% success rates for Case 1 and Case 2.

  • The proposed deep feature extractor was also applied to MobiAct [44] dataset and effectiveness of it shows (see Table 5). Table 5 clearly shows the general success of the proposed ensemble ResNet method for HAR.

The disadvantage of this method is deep networks and is not lightweight methods because millions parameters should be optimized. Therefore, computational complexity of these methods is high. To overcome this disadvantage, we used pre-trained networks and low-layered three effective ResNets.

7 Conclusion

In this article, a novel ensemble network is proposed by using pre-trained three networks. The used deep networks are ResNet18, ResNet50 and ResNet101. Therefore, the proposed method is called as ensemble ResNet. As we know from the literature and applications, ResNet-based networks are generally used for images. In this paper, a novel sensor signal recognition method is presented by using the proposed ensemble ResNet. The proposed signal recognition method consists of preprocessing, feature extraction, feature selection and classification. In the preprocessing, a basic vector-to-matrix transformation is used. Then, pre-trained ResNet18, ResNet50 and ResNet101 are used for feature extraction. FC1000 layers of these networks are selected as output and 1000 features are extracted from each network. These features are concatenated and 3000 features are obtained. In the feature selection phase, ReliefF is used and 1000 most discriminative features are selected. The selected features are forwarded to cubic SVM, and results are obtained by using tenfold cross-validation. To test performance of this method, daily sport activities and gender recognition cases are defined. The proposed method achieved 99.61% and 99.96% accuracy rates for daily sport activity recognition and gender recognition, respectively. The proposed ensemble ResNet was also compared to ResNet-based networks and the state-of-art methods. Results clearly demonstrated that the proposed ensemble ResNet increased success rates of the used ResNets, and it achieved the best results among the selected state-of-art methods. The proposed ensemble ResNets based was also tested on MobiAct dataset, and it was shown that this method increased success rate of the ResNet for HAR.

In the future work, the proposed ensemble ResNet can be used for images. Novel mobile health monitoring applications can be proposed using the proposed method. In the literature, many deep learning methods have been proposed. By using these methods, novel ensemble networks can be proposed to solve real-world recognition problems for instance image classification, gender identification, audio classification and facial expressions recognition.