Introduction

Industry 4.0 presents new types of interactions between human and machine. This interaction alters the industrial workforce and significantly improves the nature of work to accommodate increasing variability and flexibility of production (Shin et al. 2017; Oztemel and Gursev 2020). One of the key issues in Industry 4.0 is enabling human-centricity capability, which leads to Operator 4.0 (Romero et al. 2016). Operator 4.0 is characterized using automated systems to diminish the physical and mental stress of humans. In addition, this initiative plays a significant role in enabling humans to exploit and advance their creativity and innovativeness and improving job skills without sacrificing production objectives.

However, a successful paradigm shift toward Operator 4.0 is not only achieved by proposing new technologies and/or smart machines. Manufacturing companies must increase human productivity by enabling technologies, but this situation also triggers the shifting on hiring patterns and motivates high-skill, high-profit jobs (Jardim-Goncalves et al. 2016; Kiassat and Safaei 2019). By contrast, the availability of highly skilled workers cannot keep up with the human resource market. A study showed that 82% of CEOs and manufacturing executives investigated in the United States revealed that a lack of skilled manpower affects their performance of serving customers (Hill 2017). Junior operators must be timely and efficiently trained.

The lack of skilled workers prompts managers to refine their work processes and introduce new training/skill transfer approaches. Such training approach should be efficient, flexible, and self-organized by machine learning (Liu et al. 2017). In addition, the possession of transferable skills provides flexibility and mobility (Lim et al. 2018). The three types of human–machine relation (Duan et al. 2012) are as follows: relieving human operators by automated devices (physical replacement), improving work performance of human operators by machine support (physical support), and providing task information to advance the cognition process of human operators (informational support and skill transfer).

In comparison with traditional machine learning techniques, deep learning has a network structure that involves multiple hidden layers to extract the embedded features in data and building abstract concepts in a hierarchy procedure (LeCun et al. 2015). A recent report showed that deep learning outperformed human experts while conducting recognition or strategy-related tasks (Sun et al. 2014). In the domain of image recognition, deep learning provides a new approach for increasing the recognition accuracy of human motions. However, human actions and work objects should be efficiently recognized to facilitate skill transfer. Convolutional neural network (CNN) architecture successfully outperforms other deep learning models for most image recognition, classification, and detection tasks (Rawat and Wang 2017).

This study aims to develop a skill transfer support model of tasks in a manufacturing scenario. This skill transfer support model uses the following two types of deep learning as the backbone: CNN for action recognition and faster region-based CNN (RCNN) for object detection. In this model, a human operator is guided while performing tasks based on a skill representation.

The remainder of this article is organized as follows. “Literature review” section reviews previous studies related to human–machine collaboration and deep learning. “Methods” provides the framework and method. “Experiment and discussion” shows the experiment result. “Conclusion” section concludes and discusses future research.

Literature review

Operator 4.0 and human–machine collaboration

Human-centricity concept motivates the development toward Operator 4.0 (Frank et al. 2019), which is aided by cyber-physical system (Ruppert et al. 2018). In the framework of Operator 4.0, workers collaborate and are empowered by physical and digital systems to produce complex tasks (Peruzzini et al. 2020).

Smart machines facilitate human empowerment of their abilities in the following three aspects: extending cognitive strengths, assisting in complex jobs, and embodying human skills to extend physical capabilities (Wilson and Daugherty 2018). For example, human–robot interaction (HRI) focused on physical, cognitive, and social interaction between people and robots to broaden and advance human capabilities and skills (Vasconez et al. 2019). Such study focuses on designing, recognizing, and evaluating the cooperation between humans and robots in communicating and/or sharing in a physical space for job purposes. In industrial applications, collaborative robotic delivers several advantages, such as relieving from dangerous material handling, heavy tool handling, and high-precision tasks (Villani et al. 2018).

Over the past few decades, HRI has become a growing research area (Landi et al. 2018) in construction, healthcare and assistive robotics, aerospace, edutainment and entertainment, home service, and military and industrial applications (Levratti et al. 2016; Adamides et al. 2017; van Dael et al. 2017; Liu and Wang 2018; Vasconez et al. 2019). In the field of production, new methods and strategies in HRI for fast, affordable, and flexible automation have been constantly identified and developed (Koch et al. 2017; Backhaus and Reinhart 2017). An efficient HRI system can recognize the intention of human workers and provide assistance during an assembly operation (Liu and Wang 2017). The integration of collaborative robots is one of the pillars for flexible automation in the Industry 4.0 era (Koch et al. 2017; Wang et al. 2018a, b). Therefore, the initial paradigm for robot usage has shifted during the years, originating from an idea in which robots work with complete autonomy in a separate cell to a scenario where robots and humans simultaneously work and interact.

Action recognition

One of the recently investigated aggressive domains in computer vision is human activity recognition. Action recognition is generally implemented in two stages: action representation and classification (Idrees et al. 2017). The core of video action recognition is action representation, which is also denoted as feature extraction. Yao et al. (2019) suggested that an effective action representation should be discriminative, straightforward, and low dimensional. Discriminative refers to the representations of actions from the same class that provides identical information. However, the representations of action from several classes provide different characteristics. Straightforward is the action representation that is easy to compute. In addition, action representation should be low cost in terms of classification and feature saving.

The taxonomy of action recognition is shown in Fig. 1. Action recognition is classified into representation methods based on handcrafted and deep learning. Both methods recognize action classes based on the appearance and motion patterns in videos (Shahroudy et al. 2018).

Fig. 1
figure 1

Taxonomy of action recognition (Yao et al. 2019)

The study of handcrafted representation method was started by extracting global features, such as silhouette- and optical flow-based features. Subsequently, this method demonstrated a milestone in action recognition field (Yao et al. 2019). Some important works on improved dense trajectories (Peng et al. 2016) include encryption of extracted dense trajectories, trajectory-aligned histograms of oriented gradients, histogram of optical flow, and motion boundary histograms with the Fisher vector or hybrid super vector (Peng et al. 2016).

By contrast, a deep learning representation method differs from handcrafted one in terms of design (Yao et al. 2019). The handcrafted method manually designates the feature, whereas deep learning representation method can automatically learn the trainable feature from videos. Auto-encoder method enables a neural network to automatically learn a sparse shift-invariant representation of the local 2D + t salient information (Baccouche et al. 2012). Deep belief networks learn invariant spatiotemporal features from videos (Chen et al. 2010). The restricted Boltzmann machine catches various human motions based on features of action. Veeriah et al. (2015) applied recurrent neural network, which is known for constructing long short-term memory, to learn and recognize complex dynamics of various actions. Furthermore, an independent subspace analysis method learns invariant and robust spatial features of the normalized video cubes (Pei et al. 2016).

Over the past few years, the CNN-based method is the most researched approach in various fields of computer vision, including action recognition, and has shown a considerable achievement (Ciocca et al. 2018; Yao et al. 2019). CNN works effectively on image processing and understanding task due to the proximity of its layers and its rich available information. Moreover, images can be automatically extracted to produce rich correlated features (Zhang et al. 2018). Ciocca et al. (2018) also stated that features learned by CNN are analyzed to be more powerful and expressive than those of the handcrafted ones. Therefore, CNN is applied in this study to perform action recognition. Further details on the CNN-based methods are discussed in “Convolutional neural network” section.

Object detection

Object detection is a task to estimate the contexts and locations of existing objects in each image. The problem of object detection is determining the location of objects in a specific image (object localization) and the classification of each object (object classification). Based on this definition, the traditional models for object detection can be split into the following three phases (Zhao et al. 2019): informative region selection, feature extraction, and classification (Table 1). Manually constructing a robust feature descriptor to perfectly characterize all types of objects is challenging due to the variety of appearances, illumination conditions, and background.

Table 1 The phase of traditional object detection models (revised from Zhao et al. 2019)

The integration of deep neural networks with regions with CNN features (R-CNN) has resulted in a higher gain in this field compared with that of the traditional approach, which uses discriminant local feature descriptors and shallow learnable architectures (Zhao et al. 2019). CNN has deep architecture with the ability to learn more sophisticated features than that of the shallow ones. In addition, the training algorithm facilitates the learning of informative object representations without manually designating the features because of its expressiveness and robustness.

After the introduction of the R-CNN, another improvement model has been recommended (Zhao et al. 2019). The first improvement model is fast R-CNN, which simultaneously binds box regression and classification optimization tasks. In addition, faster R-CNN model is developed to propose an additional sub-network for generation region proposals. The latest developed model is You Only Look Once (YOLO), which achieves object detection by using a fixed-grid regression (Gu et al. 2018). These models not only carry different qualities of detection performance over the primary R-CNN but achieve a real-time and accurate object detection. Further explanation details on faster R-CNN are presented in “Faster regional-convolutional neural network” section.

Convolutional neural network

CNN is a variant of multilayer perceptron inspired from the biological concept, which is a feedforward artificial neural network (Yao et al. 2019). The architectures of CNN are multistage and trainable, where every stage contains multiple layers (Bhandare et al. 2016), including an input layer, an output layer, and multiple hidden layers. These hidden layers are either convolutional, rectified linear units (ReLU), pooling, or fully connected. The convolutional layer conducts a convolution operation and an additive bias to the input data, initially passing the result via an activation function and then delivering it to the next layer. The convolution operation at location (x, y) in the jth feature map in the ith layer is defined in Eq. (1) as follows:

$$ v_{il}^{xy} = \varphi \left( {b_{i,j} + \mathop \sum \limits_{m} \mathop \sum \limits_{p = 0}^{{P_{i} - 1}} \mathop \sum \limits_{q = 0}^{{Q_{i} - 1}} w_{i,j,m}^{p,q} v_{{\left( {i - 1} \right),m}}^{{\left( {x + p} \right),\left( {y + q} \right)}} } \right) $$
(1)

where φ is a non-linear activation function, w is the weight matrix, P is the height of the kernel, and Q is the width of the kernel. The ReLU layer applies a non-saturating non-linearity function or loss function (Traore et al. 2018):

$$ f\left( x \right) = \hbox{max} \left( {0,x} \right) $$
(2)

Non-linear down-sampling forms the pooling layer. Max pooling is the most frequently used pooling function, which takes over the output with the maximum activation among a rectangular neighborhood (Carrio et al. 2017). Finally, after going through the convolutional and pooling layer, the high-level reasoning in the CNN is finalized via fully connected layers, in which each neuron is connected to all activations in the previous layer (Yao et al. 2019).

The classification task is the major function of output layer in CNN architecture. Logistic regression model is commonly used as the output layer for a CNN model. In addition, for multiclass classification task, the logistic regression model is then established as multinomial logistic function, which is mostly termed as softmax function. For j possible classes, a weighting vector W, and a bias b, the probability that vector x is a member of class i in softmax function can be defined as follows (Dewa and Afiahayati 2018):

$$ P\left( Y=i|x,W,b \right)= \frac{{{e}^{{{W}_{i}}x+{{b}_{i}}}}}{\mathop{\sum }_{j}{{e}^{{{W}_{j}}x+{{b}_{j}}}}} $$
(3)

CNN has been widely used in some domains of research, including the medical field. For example, CNN is applied to detect and/or classify breast cancer in breast histopathology (Bejnordi et al. 2017). In the manufacturing field, CNN is introduced to classify the defect of circuit board (Iwahori et al. 2018). The CNN is also adapted to classify the existing defect in the electronic circuit board into multiple types of defect based on its shape. Over the past few years, CNN has reached a substantial improvement in image classification and object detection. Some CNN architectures, such as ZFNet (Zeiler and Fergus 2014), VGG (Zhao et al. 2019), GoogLeNet (Szegedy et al. 2015), BN-Inception (Jaderberg et al. 2015), and ResNets (He et al. 2016), have been constructed. These architectures can produce pre-trained models (represented by weights) on large-scale datasets. By contrast, an additional training step (transfer learning) is executed to fine-tune the pre-trained model of the network for learning new dataset with small scale or a new modality (Yao et al. 2019). Image-based action recognition using CNN was also conducted by Qi et al. (2017). They investigated the transfer of CNN from object to action recognition and achieved 82.2% of mAP. The VGG-16 model makes an improvement over AlexNet (Simonyan and Zisserman 2014) and is utilized as the basic model to construct the neural network by using a dataset of people playing musical instruments to evaluate the proposed method.

Inception v2

AlexNet network has been successfully applied to various computer vision tasks, such as object detection, segmentation, human pose estimation, video classification, object tracking, and super-resolution (Szegedy et al. 2016). AlexNet contained 8 layers; the first five were convolutional layers, and the last three were fully connected ones. VGGNet and GoogLeNet resulted in similarly high performance in the ILSVRC classification challenge. The quality of these network architectures is further improved by utilizing deep and wide networks. Both network architectures are widely utilized in many domains, including proposal generation in detection, in which AlexNet cannot compete.

Although VGGNet and GoogLeNet demonstrate high performance, the inception architecture of GoogLeNet is much lower than that of VGGNet or its high performing successors in terms of computational cost. Inception was designed to perform effectively even under limited memory and budget. For example, GoogLeNet engaged only 5 million parameters while AlexNet used 60 million parameters. Furthermore, utilizing this network in big-data scenarios is feasible due to the computational cost of inception (Szegedy et al. 2016). The layout of Inception v2 network is shown in Table 2.

Table 2 Inception v2 network architecture (Szegedy et al. 2016)

Faster regional-convolutional neural network

Faster R-CNN comprises two modules (Ren et al. 2017). The first module is a deep fully convolutional network that proposes regions. Instead of using a selective search algorithm on the feature map to identify the region proposal, a separate network is used to predict such region proposals. Meanwhile, the second module is the fast R-CNN detector that works with the proposed region. The region proposal network (RPN) module instructs the fast R-CNN module of the direction.

Ren et al. (2017) stated that an RPN captures an image (of any size) because the input and the outputs are a set of rectangular object proposals, with each set possessing an objectness score. Membership to a set of object classes versus background is measured using objectness score. They attempted to slide a small network over the convolutional feature map, which is the output of the last shared convolutional layer, to generate region proposals. This network then captures an n × n spatial window of the input convolutional feature map as input. Each sliding window is mapped to a low-dimensional feature. Finally, this low-dimensional feature is supplied into two siblings or fully connected layers, which are a box-regression (reg) and a box-classification (cls) layer.

The RPN and faster R-CNN are trained independently, and their convolutional layers are subsequently modified in different approaches. Rather than learning two split networks, Ren et al. (2017) proposed a technique for sharing convolutional layers between the two networks. A pragmatic four-step training algorithm is adopted to learn shared features via alternating optimization. In the first step, the RPN is trained via initialization with an ImageNet pre-trained model and fine-tuned end-to-end for the region proposal task. In the second step, a separate detection network is trained by fast R-CNN by adopting the proposals by RPN. The ImageNet pre-trained model is also used for initialization of the detection network. At this stage, the two networks do not share convolutional layers yet. In the third step, the detector network is used to initialize the RPN training. However, the shared convolutional layers are fixed, and only the layers unique to RPN are fine-tuned. At this stage, the two networks share convolutional layers. Finally, by maintaining shared convolutional layers, the unique layers of fast R-CNN are fine-tuned. As a result, both networks sharing the same convolutional layers comprise a unified network.

Summary

Currently, CNN and faster R-CNN are constructed independently, and combining the two as the backbone of skill transfer support model is possible. Specifically, CNN can be implemented to perform action recognition, while faster R-CNN is implemented to perform object detection. Such a model can be used as a guide for operators to adopt the new skills for assembly operations.

Methods

Research framework

A skill transfer support model framework by CNN and faster R-CNN is proposed in the present study. The framework of this research is presented in Fig. 2. In this study, human expert operations are recorded using two cameras from different angles. The videos are split into images. Each image comprises the motion of the operator and the parts/tools related to the operator’s task. The image is then trained using CNN and faster R-CNN. The context of the actions is recognized when action recognition and object detection are performed to assist or identify the intention of the operator. In addition, this study applies a formal skill representation to define alternatives for job sequences. In the skill transfer section, this model aids a junior operator by advising him/her on what should be performed next based on the skill representation.

Fig. 2
figure 2

The proposed framework of a skill transfer support model

Skill representation is developed as a precedence diagram to define the standard operation procedure of the operator doing the assembly operations. In some circumstances, there are several operation procedures of producing the same product. Therefore, all the possible sequences are drawn to guide the operator as needed. Furthermore, based on the possible sequences, the skill transfer model will aid the operator by advising them regarding the sequential operation. This advice is displayed in the form of text information of what should be done next based on their current state of motions, as well as showing them the corresponding tool/object required.

Action recognition based on deep learning approach is implemented in this study. CNN is designed to distinguish between different action classes in an assembly operation. Video action recognition is divided into two tasks: classification and detection. Classification indicates assigning a set of predefined action classes, while detection indicates temporally locating predefined action in a video.

CNN architecture for action recognition

CNN is developed specifically in this study for action recognition. As shown in Figs. 3 and 4, the inputs for this network are images with the size of 100 × 100 × 3 from two sets of cameras in different angles, whereas the output of this network is the action classification of an operator’s task. CNN architecture comprises three convolutional layers, two max pooling layers, and two fully connected layers. Three dropout layers are also used to maintain the capability of the network in demonstrating better generalization performance and less overfitting of the training data.

Fig. 3
figure 3

The present CNN framework

Fig. 4
figure 4

The present CNN procedure

CNN is constructed from an input layer, an output layer, and multiple hidden layers, where the hidden layers are either convolutional, pooling, or fully connected. The convolutional layer operates a convolution operation and an additive bias to the input data and passes the result initially via an activation function and then to the next layer. The convolution operation at location (x, y) in the jth feature map in the ith layer of this study is defined in Eq. (4), where φ is a non-linear activation function, b is an additive bias, m is the number of layers, w is the weight matrix, and P and Q are the height and width of the kernel, respectively.

$$ v_{il}^{xy} = \varphi \left( {b_{i,j} + \mathop \sum \limits_{m} \mathop \sum \limits_{p = 0}^{{P_{i} - 1}} \mathop \sum \limits_{q = 0}^{{Q_{i} - 1}} w_{i,j,m}^{p,q} v_{{\left( {i - 1} \right),m}}^{{\left( {x + p} \right),\left( {y + q} \right)}} } \right)\quad \left\{ {i \, = 1, \, 4,7} \right\}. $$
(4)

Faster R-CNN architecture for object detection

The proposed faster R-CNN applies a single yet unified network for object detection as shown in Fig. 5. Faster R-CNN has two networks: RPN for generating region proposals and a network using these proposals for object detection. The images with the size of 299 × 299 × 3 from two cameras with different angles are initially provided as an input to a CNN that produces a convolutional feature map. On the contrary, the output of this network is the image with the boundary box and classification of the object. The phase of object detection applied in this study has different orders with the phase of a traditional object detection model. Object detection starts with the feature extraction using CNN, followed by RPN, and finalized with classification. The detailed procedure of faster R-CNN is shown in Figs. 6 and 7.

Fig. 5
figure 5

The proposed faster R-CNN framework (revised from Ren et al. 2017)

Fig. 6
figure 6

The present faster R-CNN framework

Fig. 7
figure 7

The present faster R-CNN procedure

RPN carries the output feature maps from the first CNN as input. A sliding window with n = 3 is used in this study, indicating that it slides 3 × 3 filters over the feature maps to create the region proposals. The detailed procedure of the RPN is illustrated in Fig. 8. The RPN outputs feed into two separate fully connected layers to predict a boundary box and two objectness scores. The objectness measures whether the box contains the object while the classifier has two possible classes, namely, object or background. The predicted region proposals are then reshaped using an RoI pooling layer, which is then used to classify the image among the proposed regions and predict the offset values for the bounding boxes.

Fig. 8
figure 8

The proposed region proposal network (RPN) framework (revised from Ren et al. 2017)

This study conducted a transfer learning strategy for training the network. Given that the features generated by the preceding layers are more general than those generated later in the process, inception v2 adaptation is determined because the features become specific to the details of the image classes involved in the training dataset. Moreover, this strategy can reduce the problem of network overfitting because the convolutional layers in the network have been trained on a large Microsoft Coco dataset (Lin et al. 2014; Microsoft 2019).

The parameters related to the training process of the proposed CNN and faster R-CNN are shown in Table 3. The initial learning rate was set to a relatively low value for the transferred convolutional and pooling layers (these layers have been previously trained on the Coco dataset) to train the faster R-CNN network. The overall procedure of the proposed skill transfer support model is shown in Fig. 9.

Table 3 Settings of parameter related to CNN and Faster R-CNN
Fig. 9
figure 9

Procedure of the proposed skill transfer support model

Skill representation and skill transfer

A job is assumed to be done by one of many task sequences. These options offer the flexibility to the operator to facilitate the production of the desired product in many approaches, which is defined in this research as the skill representation. The sequences is based on a skill representation diagram. The sequences of tasks (A thru J) based on the skill representation diagram are illustrated in Fig. 10.

Fig. 10
figure 10

Skill representation diagram

Fig. 11
figure 11

A case study: components and assembled lego

  • A–B–D–G–J

  • A–B–E–H–J

  • A–C–F–I–J

After independently training the two neuro network models, they are combined as the backbone of the proposed skill transfer support model. The model guides a junior operator while performing a sequence of assembly operations. In addition, the action recognition and object detection model can be run simultaneously, and the proposed model is supplied by guidance based on the skill representation.

Experiment and discussion

Experimental setting

A case study using Lego assembly was conducted to evaluate the performance of the proposed model. In the experiment, four components must be assembled to produce a desired shape, as shown in Fig. 10. In this experiment, action recognition is one of the main issues, as shown in Fig. 12. Several options of sequence operations can be performed in terms of Lego assembly. The assembly process was recorded by a video camera. The video images were processed to recognize the human activities associated with each video frame and determine the components correlated with the action (Fig. 11).

Fig. 12
figure 12

Examples of human action images

Illustration on skill representation

Three sequence options can be performed to produce the assembled Lego. These options offer the flexibility to the operator to facilitate the production of the desired Lego in many approaches, which is defined in this research as the skill representation. The sequences based on the skill representation diagram is listed in Fig. 10.

For example, the operator can start with Component 1 and then assemble Component 2 to construct the Lego. Subsequently, this shape can be assembled with Components 3 or 4. If the operator selects to assemble with Component 3, then the final step is combining the shape into Component 4. Otherwise, if the operator selects to assemble with Component 4, then the final step is combining the shape into Component 3. Nine classes of actions and nine classes of objects are available for training the model that fits into the scenario (see in “Appendix”).

Model evaluation

Images covering all motions involved in the assembly process were obtained prior to taking the video from the two cameras to train the CNN and faster R-CNN network for human action recognition and assembly object detection. The output of the CNN model is the classification of the motion of the operator’s task, whereas the output of the faster R-CNN model is the detection of the objects that appear in the video while performing the assembly tasks. The performance of the human operator is slightly different while recording the video, thereby reflecting the variability of human operator in performing the same task. This variability is utilized to avoid overfitting in the training.

This experiment was conducted using two cameras, which work independently and set in different angles, as shown in Fig. 13. The videos were recorded for 10 to 17 s depending on the operators’ speed of performing the operations. The frame width is 540 pixels and the height is 960 pixels. The frame rate of the videos is 30 frames/s. Every operator completed three trials one per sequence of motion to enrich the training dataset.

Fig. 13
figure 13

Two cameras angle setting

Among all the images, 80% was used for training the networks and 20% were allotted for testing. The learning and loss curves of nine CNN classes are shown in Fig. 14a, b, respectively. The training accuracy achieved 80% after 10 epochs. In addition, CNN has a good fit because the plot of training and validation loss decreases to a point of stability with a slight difference. For faster R-CNN, the loss curve for nine classes is shown in Fig. 12c. The loss values converged at 5,000 steps. This model has good fit because the plot of the training and validation loss decreases to a point of stability with a slight difference. The training accuracy for CNN and faster R-CNN is 96.16% and 98.46% in nine classes, respectively. Moreover, the F1-score is employed as a performance indicator of the proposed model. The F1-score reflects better the confusion matrix and presents a weighted compromise between precision and recall. F1 score is calculated using Eq. (5). Faster R-CNN implemented for the present model achieves 94.47%.

Fig. 14
figure 14

CNN and faster R-CNN training performances

$$ F1 = 2 \cdot \frac{precision \cdot recall}{precision + recall} $$
(5)

Once the training was completed, the two networks were used to process the video images. From the available frames, 200 frames were randomly selected and used for testing the networks. For each test frame, the action recognition model and object detection were applied to respectively recognize the motion and object associated with the frame. Among the 200 test frames, 11 frames were misclassified in action recognition model, leading to a classification accuracy of 94.5%. Meanwhile, in the object detection model, two objects were misclassified, leading to a classification accuracy of 99%. Most misclassifications occurred during the transitions among human motions, which caused uncertainty in classifying these transition motions into a predefined category.

Illustration on skill transfer support

In the proposed scenario, the three sequences are considered options to construct the same finished good. After detecting the motion of the operator, the model guides the operator by giving instructions of their subsequent task, as shown in Fig. 15. This model can sequentially detect the motion of each class. Therefore, the operator is guided successively based on the currently selected operation.

Fig. 15
figure 15figure 15

Skill transfer support model: an illustration

Conclusion

This study developed a skill transfer support model for skill transfer of assembly tasks in a manufacturing scenario. This model used two types of deep learning as the backbone: CNN for action recognition and Faster R-CNN for object detection. Inside this model, the human operator is guided by the model based on its skill representation during performance of assembly tasks. The proposed CNN obtained 94.5% accuracy in action recognition. The object detection model achieved 99% accuracy. Faster R-CNN implemented also achieves 94.47%. Subsequently, these models are integrated and run simultaneously to advise the junior operator in terms of the assembly tasks.

In terms of practical contribution, the proposed model enables the following functions:

  • To help junior operators in performing complex tasks.

  • To guide the operator on the subsequent task on the basis of a skill representation and recommend the tools or part related to a particular task.

  • To propose a new training method for new jobs.

In terms of theoretical contribution, this study achieves the following goals:

  • To integrate two deep learning models, namely, CNN and faster R-CNN, to offer a new skill-transferring method from senior to junior operators.

  • To perform effectively in terms of accuracy and F1-score.

  • To simultaneously recognize the action of a worker and detect objects.

Some challenges in the future study include the following:

  • The grasp prediction model can be further studied to empower machine recognition to humans to realize high performance in HRI. Such a model will be informative for robot action planning to assist the operator by handing over parts or tools related to the tasks.

  • Additional learning modules, such as single-shot detector or YOLO as the comparison of the currently used module, can be developed and used for object detection.

  • Complicated operations that involve assembly and split motions and small parts or tools can be experimented to test the robustness of the proposed model.