Introduction

Since the cutting force in the metal cutting process has many influencing factors, it is difficult to establish a perfect cutting force prediction model including all of these factors. Fortunately, the development of neural network technology provides a good solution to this problem. Neural network model can easily determine the implicit relationship between a set of input and output parameters, based on the data set, while effectively capturing the non-linearity between them, which is very important to the good cutting force prediction in the machining process. In the case of sufficient cutting data, the neural network can be used to predict the cutting force, without considering the influence of various factors, making this an ideal approach.

The most widely used method of neural network is to evolve a comprehensive prediction model between processing information (such as cutting parameters, tool geometry etc.) and processing results (such as cutting force, surface roughness, etc.) (Sharma et al. 2008; Tandon and El-Mounayri 2001). Radhakrishnan and Nandan (2005) developed a regression model to filter out the abnormal samples in the experimental data set. The research shows that using the filtered samples can significantly improve the prediction accuracy of the neural network model. Jurkovic et al. (2018) compared the performance of three machine learning methods: neural network, support vector machine and polynomial regression, using the parameters of surface roughness, cutting force and tool life time in high-speed turning. The research shows that the three methods have advantages and disadvantages, depending on different parameters and ranges. Vaishnav et al. (2019) used the data set, generated by the mechanistic force model, to train the neural network model to predict the instantaneous cutting force in milling, verifying the effectiveness of the method. In addition, the research on neural network in machining field has led to many achievements (Özel and Nadgir 2002; Rao et al. 2014; Asiltürk and Çunkaş 2011). Yeganefar et al. (2019) compared the performance of four methods, in the task of prediction and optimization of cutting force and surface roughness on aluminum alloy: regression analysis, support vector regression, artificial neural network and multi-objective genetic algorithm. The results show that, in the case of sufficient samples, the neural network model will perform better than other methods. The above review shows that, the application of neural network becomes wider and the usage mode more diverse. However, it is worth noting that neural networks are not always advantageous. The accuracy of the neural network model is mainly dependent on the quantity and quality of the training data set, which is an important disadvantage of this approach. Salimiasl and Özdemir (2016) described the comparative analysis of different methods of tool wear online monitoring, during turning of SAE4140 steel, which included artificial neural network, fuzzy logic and least squares method. According to the results, neural network predicted tool wear more precisely, excluding the case of limited amount of experimental data, where fuzzy logic was more precise. In previous research work, training datasets are mostly generated by conducting machining experiments within the entire range of input parameters. Due to the long processing cycle, high material cost and expensive machine tool maintenance, involved in the actual process, obtaining a large number of data samples often means higher financial and time cost, which limits the popularity of neural networks in the field of machining.

Transfer learning is a branch of the machine learning field. Pan and Yang (2009) proposed the definition of transfer learning as such: given a source domain DS and learning task TS, a target domain DT and learning task TT, transfer learning aims to help improve the learning of the target predictive function fT (·) in DT, using the knowledge in DS and TS, where DS DT, or TS TT. In real life scenarios, obtaining perfect data samples is an expensive and time-consuming operation. In machine learning, transfer learning is an important tool for solving the basic problem of insufficient training data. In many studies of transfer learning, it is often assumed that the target dataset is significantly smaller than the source domain dataset. As the popularity of deep learning methods increases, more and more researchers use deep neural networks for transfer learning, while the related research field is called deep transfer learning (Tan et al. 2018). Yosinski et al. (2014) took the lead in conducting the study of the transitivity of deep neural networks and proposed the fine-tuning method of neural networks. Nowadays,fine-tuning is almost the most widely used method in the field of neural network transfer. According to this approach, first a neural network model is trained, using the source domain data set, while considering the trained model as the initial value. Next, the model is trained according to the target domain data, to obtain a model suitable for the target domain. However, the fine-tuning method is not suitable for all transfer tasks. When the similarity between the source domain and the target domain is very low, this method is difficult to achieve satisfactory results. Subsequently, many researchers innovated the structure and loss function of neural networks, based on the work of Yosinski et al. (2014) and maximum mean discrepancy (MMD) (Gretton et al. 2012), thus establishing a series of new transfer methods, such as the ones described in Ghifary et al. (2014), Tzeng et al. (2014) and Long et al. (2015). MMD is almost the most frequently used distance measurement method in transfer learning. It is a framework for analysis and comparison, in order to determine whether two samples are from the same distribution. The MMD was first proposed for the two-sample test problem, to compare the difference between two data distributions (Borgwardt et al. 2006).

Transfer learning provides a possible solution to improve the performance of neural network, by applying the knowledge and skills (in the form of parameters), acquired during the previous tasks with sufficient training data, to the new task with a smaller training data set. On one hand, the research of Zhang et al. (2017) has shown that transfer learning is capable of achieving prominent performance, commensurate with large scale CNNs, using only a small set of training data. On the other hand,transfer learning becomes possible and promising because the layers of the convolutional stages of the convolutional neural network, trained on a large dataset, indeed extract general features of inputs, while the layers of the fully connected stages provide more specific features, as described by Cao et al. (2018). The limitations encountered by the application of neural network, in the field of cutting, are very obvious and well known, while it is often very difficult to obtain large amounts of cutting data in actual machining conditions. In this paper, the cutting force data is used as a research example, in an effort to reduce the dependence of neural network on cutting data, by using the method of deep transfer learning. According to the definition of transfer learning, two cutting force data sets, derived from simulation and its corresponding experimental data, are given. The simulation data represent the source domain, while the experimental data is the target domain. Relevant methods and theories, in the field of transfer learning, combined with different experimental samples, were used to train the neural network, applied to the experimental samples and compared to the ordinary network using only experimental data. The results show that, in most cases, the transfer network is better than the ordinary network.

Experiment and model

According to the transfer learning theory, there is a difference between the source and the target domain data, but at the same time there is a certain correlation. Finite element simulation is often regarded as the prediction and verification means of experiments. Although there may be some deviation between the simulation results and the actual results, this does not prevent the simulation results from retaining a certain reference and guidance significance. At present, as a commonly used research method, the effect of finite element simulation has been confirmed in many studies, such as the research of Zhang et al. (2014) and Wang et al. (2018). Therefore, there may be some differences between the cutting force numerical value, as obtained by simulation and by experiment, under the same machining conditions, but there is still a certain correlation between them. Under this premise, based on the relevant methods and theories of transfer learning, a method for establishing a transfer neural network for cutting force data prediction is proposed.

Experimental and data processing

The integral carbide end milling cutter with a diameter of 12 mm is used in both simulation and cutting experiments. The material of the workpiece is aluminium alloy 2A14, while the processing method is end milling. Aluminum alloy material has low processing difficulty, generates less heat during cutting and lower tool temperature, limiting tool wear. Therefore, cutting parameters can be selected within a wide range. The simulation software uses Third Wave AdvantEdge. Variable parameters and their varying ranges in the experiment are listed in Table 1. The ultimate goal of the two groups of experiments is to extract the cutting force values of each axis, corresponding to the cutting parameters. The simulation experiment includes 467 sets of parameters, while the cutting experiment includes 300 sets. The range of cutting parameters is shown in Table 1. The experiments were performed in a vertical CNC machine centre (DAEWOO ACE-V500), while the experimental setup is shown in Fig. 1. The signals of milling forces are recorded by a dynamometer (Kistler 9257B).

Table 1 Cutting parameter range

In order to minimize the simulation calculation time, the cutting distance of a single group of simulation experiments is corresponding to the cutting distance of a tool rotation of 100°. The original cutting force data sets, obtained from simulation and cutting experiments, are all in the form of signals. Prior to developing the neural network model, it is necessary to process the original signal and extract the cutting force value, according to the unified standard. Following, the X-axis force cutting signal, corresponding to the cutting parameters of spindle speed 1500 r/min, feed per tool 0.15 mm, radial depth 1.1 mm and axial depth 1.6 mm, is considered as an example to describe the flow of data processing.

The original cutting force signal, as obtained from the cutting experiment, is illustrated in Fig. 1a. First, it is necessary to use low-pass filter on the original signal. The filter frequency is calculated as: \( 5 \times {\text{z}} \times {\text{n}} \div 60 \)where \( {\text{z}} \) represents the number of tool teeth and \( {\text{n}} \) represents the spindle speed in r/min. The filter frequency, used here, is 500. As the tool cuts in and out, the cutting force signal will fluctuate to a certain extent. Therefore, about one-fifth of the total length is cut off from each side of the filtered signal, while the remaining part will be segmented, based on the time of each tool rotation. The maximum value of each numerical point, in each small segment, is considered, while the average value of all maximum values is calculated, which then lead to the cutting force value of the X axial force corresponding to the parameter under study.

Fig. 1
figure 1

Signal processing flow of experimental cutting force

Figure 2a shows the state at the end of the simulation process. The complete simulation process corresponds to tool rotation of 100°, with two cutter teeth participating in the cutting. The resulting cutting force signal is illustrated in Fig. 2b, where one can see that the original force signal is very disordered, its value shows great fluctuation with a peak value exceeding—500, which is very different from the peak value of the cutting force signal, as obtained from the experiment. Consequently, the cutting force signal from the simulation and the one derived from the experiment cannot be processed using the same method. Figure 2c shows the filtered cutting force signal with a filtering frequency of 8 * n, that is 12,000 as used here. Compared to the original signal, the filtered signal has greatly improved, while there are still large numerical fluctuations. As shown in Fig. 2c, the peak value of the trough in the circled area exceeds—100, while the average value of the peak section is estimated to be about—85. If the peak value is selected directly, the regularity of the extracted cutting force value will be seriously weakened. Therefore, the tenth degree polynomial is used to fit the filtered signal and the resulting curve is shown in Fig. 2d. The extremum of the two troughs is taken from the fitted curve, while the average value is taken as the value of the X axial force, corresponding to a group of cutting parameters. The cutting force values extracted according to the above process are listed in Table 2.

Fig. 2
figure 2

Signal processing flow of simulation cutting force

Table 2 Some examples of data samples

The data processing flow and the results of simulation and experiment show that there are some differences between the cutting force values obtained by simulation and those derived by experiment. This deviation is mainly due to the difference between the original simulation signal and the different data processing methods. The experimental signal comes from the data measured by the force measuring instrument, while the simulation signal is the calculation result of the simulation software. The original signal curve, as illustrated in Figs. 1 and 3, indicates that there are very obvious differences between them in the amplitude and change rule of the signal. The difference of the original signal leads to the fact that it is not appropriate to apply the same processing method. It is necessary to filter and fit the original signal of the simulation, wherein, to a certain extent, the extracted simulation cutting force value is significantly lower than the experimental cutting force value.

Fig. 3
figure 3

Set-up and measurements for milling experiments

In this study, the processing method of the simulation signal aims to ensure that the extracted value has as strong regularity as possible, rather than enhancing the consistency between the simulation and experimental cutting force values. In the related research of transfer learning (Pan and Yang 2009; Yosinski et al. 2014; Long et al. 2015), the process of transfer is to use the knowledge in the source domain to improve the performance of the prediction function in the target domain. The laws contained in the data are often important representatives of “knowledge”. Therefore, during data processing, more attention should be paid to the extracted simulation cutting force value, which has more robust regularity.

Modeling

Related work

The statistical test method based on MMD refers to the following sequence of steps: Based on the two distributed samples, by looking for the continuous function f in the sample space, the mean value of the function values of the samples, from different distributions on f, is obtained. The difference between the two values is the mean discrepancy of the two distributions, corresponding to f. Looking for an f makes this mean discrepancy have a maximum, the MMD. Finally, MMD is taken as the test statistic to determine whether the two distributions coincide. If the value is low enough, the two distributions are considered to be the same, otherwise they are considered different. At the same time, this value is also used to determine the degree of similarity between the two distributions. Let \( \{ {\text{X}}_{s}^{\left( i \right)} \}_{i} = 1, \ldots ,n_{s} \) and \( \{ {\text{X}}_{t}^{\left( j \right)} \}_{i} = 1, \ldots ,n_{t} \) be data vectors drawn from distributions Ds and Dt in the data space χ, where the empirical estimate of MMD is:

$$ MMD_{e} \left( {{\text{X}}_{s} ,{\text{X}}_{t} } \right) = \left\| {\frac{1}{{n_{s} }}\mathop \sum \limits_{i = 1}^{{n_{s} }} \phi \left( {{\text{X}}_{s}^{\left( i \right)} } \right) - \frac{1}{{n_{t} }}\mathop \sum \limits_{j = 1}^{{n_{t} }} \phi \left( {{\text{X}}_{t}^{\left( j \right)} } \right)} \right\|_{{\mathcal{H}}} $$
(1)

where \( \phi \left( \cdot \right) \): X → H is referred to as the feature space map, \( { \mathcal{H}} \) represents Reproducing Kernel Hilbert Space (RKHS). The most important property is that p is equal to q, when MMD = 0 (Salimiasl and Özdemir 2016). By casting Eq. (1) into a vector–matrix multiplication form, the kernelized equation form of Eq. (1) is as follows:

$$ MMD_{e} \left( {{\text{X}}_{\text{s}} ,{\text{X}}_{\text{t}} } \right) = \left( {\frac{1}{{n_{s}^{2} }}\mathop \sum \limits_{i = 1}^{{n_{s} }} \mathop \sum \limits_{j = 1}^{{n_{s} }} \kappa \left( {{\text{X}}_{s}^{\left( i \right)} ,{\text{X}}_{s}^{\left( j \right)} } \right) + \frac{1}{{n_{t}^{2} }}\mathop \sum \limits_{i = 1}^{{n_{t} }} \mathop \sum \limits_{j = 1}^{{n_{t} }} \kappa \left( {{\text{X}}_{t}^{\left( i \right)} ,{\text{X}}_{t}^{\left( j \right)} } \right) - \frac{2}{{n_{s} n_{t} }}\mathop \sum \limits_{i = 1}^{{n_{s} }} \mathop \sum \limits_{j = 1}^{{n_{t} }} \kappa \left( {{\text{X}}_{s}^{\left( i \right)} ,{\text{X}}_{t}^{\left( j \right)} } \right)} \right) $$
(2)

where \( \left[ {{\text{K}}_{x \bullet \bullet } } \right]_{ij} = \kappa \left( {{\text{X}}_{ \bullet }^{\left( i \right)} ,{\text{X}}_{ \bullet }^{\left( j \right)} } \right) \) is the gram-matrix of all possible kernels in the data space.

The work of Long et al. (2015) adds MMD distance to the 7th layer of AlexNet network, to reduce the difference between source and target domain. This method is called DDC for short and the idea is to add an adaptation layer to the 7th layer of the network, based on the original AlexNet network. The function of the adaptation layer is to examine the network’s ability to distinguish the source domain from the target domain separately. If this discrimination ability is very poor, it shows that the features learned by the network are not enough to distinguish the two areas of data, so it is helpful to establish the domain-insensitive feature representation.

Based on DDC, the new Deep Adaptation Network (DAN) architecture (Tzeng et al. 2014) was proposed, which offered a good solution to two problems of DDC. First, DDC is a single kernel MMD, while a single fixed kernel may not be optimal. The DAN method replaces the single-kernel MMD with a multi-kernel MMD (MK-MMD), when calculating the distance between the source and target domain. The MK-MMD method, proposed in Gretton et al. (2012), that is, to construct the total kernel with multiple kernels, provides a better effect than the single-kernel MMD. Furthermore, DDC only adapts to a layer of network, while DAN adapts to the last three layers of the network and adds the calculated distribution distance to the loss function of the neural network, which can be written as:

$$ \mathop {\hbox{min} }\limits_{\Theta } \frac{1}{{n_{\alpha } }}\mathop \sum \limits_{i = 1}^{{n_{a} }} J\left( {\theta \left( {{\text{X}}_{i}^{a} } \right),y_{i}^{a} } \right) + \lambda \mathop \sum \limits_{{l = l_{1} }}^{{l_{2} }} d_{k}^{2} \left( {D_{s}^{l} ,D_{t}^{l} } \right) $$
(3)

where λ is a penalty parameter greater than zero, while the specification between the l1 and l2 layer indices is valid. Where, \( l_{1} \) and \( l_{2} \) are 6 and 8, respectively, indicating that the network adaptation is from layer 6 to layer 8. \( J\left( \cdot \right) \) defines a loss function and cross-entropy function is used in DAN. \( D_{ *}^{l} \) is the ℓth layer hidden representation for the source and target examples, while \( d_{k}^{2} \left( {D_{s}^{l} ,D_{t}^{l} } \right) \) is the MK-MMD between the source and target domain, evaluated on the ℓth layer representation.

Cutting force prediction model

In this section, the process of establishing neural network for cutting force data prediction is described, according to the aforementioned related methods and theories of transfer learning. In the established cutting force data set, each sample contains seven dimensions, including rotational speed, feed per tooth, axial depth of cut, radial depth of cut, as inputs; x-axis force, y-axis force and z-axis force, as outputs. The process of building the neural network is shown in Fig. 4.

Fig. 4
figure 4

Transfer network establishment process. The input units of the network include rotational speed, feed/tooth, axial and radial depth of cut, while the output units include X, Y and Z axial forces

Since there are not many input and output units in the network, a four-layer hidden layer neural network structure is adapted. First, a neural network is pre-trained based on simulation data, which is used to predict data in simulation mode. Subsequently, the hidden layer of the training network is used as the initial value of the target domain network, while the experimental data is used for training. Experiments in Yeganefar et al. (2019) have proved that the method of transfer arbitrary layers and fine-tuning can achieve better results than ordinary neural networks. In this case, the first four layers are transferred and fine-tuned. In the loss function of the neural network, the MMD distance of the simulation and experimental data are additionally considered. The optimization objectives of the whole network include prediction errors in experimental data and discrimination errors in the two domains, which can be expressed as:

$$ \mathop {\hbox{min} }\limits_{\Theta } \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {\hat{y}_{i} - y_{i} } \right)^{2} + \lambda MKMMD_{e} \left( {{\text{X}}_{s} ,{\text{X}}_{t} } \right) $$
(4)

Where, \( \hat{y}_{i} \) and \( y_{i} \) represent the real value and the predicted value, respectively, \( MKMMD_{e} \left( {{\text{X}}_{s} ,{\text{X}}_{t} } \right) \) represents multi-kernel MMD computing method between source and target domain. In this paper, the neural network model, as trained by this method, is called a transfer model. The control group of the transfer model is the traditional BP neural network, trained only by experimental data, which is here considered as the ordinary network, while its optimization objective is:

$$ \mathop {\hbox{min} }\limits_{\Theta } \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {\hat{y}_{i} - y_{i} } \right)^{2} $$
(5)

This experiment establishes prediction models, considering different sample sizes, to explore the effect of transfer learning methods under these variations. The training process of transfer network and ordinary network, with “n” experimental samples as training sets, is as follows:

  1. (1)

    Establishing a four hidden layer neural network and initializing randomly. Training is achieved by a training set, consisting of eighty percent of simulation samples, while the performance of the network is tested by the remaining simulation samples.

  2. (2)

    Constructing a training set by “n” samples, as extracted from the experimental samples.

  3. (3)

    The network, trained by simulation data, is considered as the initial value, as the training set is used to train the network and save the completed training model.

  4. (4)

    The above steps are repeated, by considering different values of “n”.

  5. (5)

    Using the validation set to evaluate all the models, corresponding to each “n” value; using the established model to evaluate error rate on the test set.

Where n ∈ {5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200}. In the training process of transfer network, the extraction of experimental samples is not completely random.

In addition to the different optimization objectives, there are also differences in the training process, between the ordinary network and the transfer network. The training process of ordinary network is relatively simple, as the initial network structure will be trained directly according to experimental samples. In the process of training, the experimental samples and learning rate, iterations and other parameters used by the ordinary network and the transfer network are identical.

When using n samples to build training data set, the selection of samples should follow certain principles. The first two hundred samples of all experimental samples are used as training set, while the last two hundred samples are divided into verification set and test set. Extracting “n” samples means selecting the first n samples of train set, while all training samples are used, when n is set to 200. In the whole training process, the complete data set will be divided randomly five times. The initial value of the network often has some effect on the convergence results of the network. In order to eliminate the possible influences of the initial value of the network, on the experimental results, in this work, the transfer network and the ordinary network are randomly initialized ten times. Among them, transfer network initialization refers to the initialization process of pre-trained network. That is to say, the transfer and the ordinary network will train 50 models, for each n value. Then, the corresponding test set is used to evaluate the prediction accuracy of the trained model.

Results and discussion

According to the training process, as described in the previous section, the performance of the ordinary network and the transfer network model is evaluated by a test set. According to the number of experimental samples used in training, the two models of each group are evaluated on the test set and the prediction error is averaged. The influence of transfer learning method on the prediction accuracy of neural network model is tested, according to the experimental results. The comparison of the two models is illustrated in Figs. 4 and 5.

Fig. 5
figure 5

X-axis force error. The chart is divided into three stages, according to the number of samples, where the numerical value corresponds to the average error of the model in the corresponding stage

Table 3 includes comparative results regarding the prediction accuracy of the transfer network and the ordinary network, in the range of different sample numbers.

Table 3 Performance advantages of transfer network in each stage

The experimental results illustrated in Figs. 5 and 6, show that the transfer network has different effects in different sample stages. When the number of training samples is less than or equal to 90, the performance of the transfer network exhibits obvious performance advantages, while its average error rate is 11.15% lower than that of the ordinary network; when the number of training samples is higher than 100, there is no significant difference between the performance of the transfer network and the ordinary network.

Fig. 6
figure 6

Y-axis force error. The chart is divided into three stages, according to the number of samples, where the numerical value corresponds to the average error of the model in the corresponding stage

In the range of 5–90, the average prediction error of transfer network is lower than that of ordinary network, as well as its performance is obviously improved. However, in the sample range of 5–20, the error rate of the transfer network is still high, which may be unacceptable in actual application. Although in this range, the performance improvement effect of the transfer network is the most obvious, its average error is 15.53% lower than that of the ordinary network. Therefore, in the range of 5–20, both transfer network and ordinary network are difficult to meet the actual use requirements. In the range of 30–90 samples, the average error of transfer network is 7.38%, while the average error is 24.76% and 20.73%, respectively, which is a suitable range for transfer networks.

In the range of 100–200 samples, the difference between the prediction error of the transfer network and the ordinary network is less than 1%. This shows that, in this range, the transfer learning method has no effect on the performance of neural network.

In the whole range of samples, the prediction error of the transfer network is, in most cases, lower than that of the ordinary network, while its performance advantage basically decreases gradually and finally disappears. This shows that, in this experiment, the influence of transfer learning method on the performance of neural network will gradually decrease, as the number of samples increases. When the number of samples is higher than 100, the effect of transfer learning becomes very weak, while at a number of samples higher than 140, the method of transfer learning has no effect at all.

Based on the experimental data, in the range of 0–20, the prediction errors of transfer network and ordinary network are relatively high, while the prediction models, established in this range, are not suitable for actual application. In the range of 0–90, the performance of the transfer network has significantly improved, while its performance has improved to a certain extent, compared to the ordinary network. In this range, the transfer learning method is generally the most suitable. In the range of 100–200, the achievable prediction error of the model has basically reached its limit. Even if the training samples continue to increase, it is difficult to improve the performance of the model significantly. In addition, it is still worth noting that, in different sample stages, the transfer network exhibits different performance advantages; but in any sample range, the transfer learning method will not have a negative impact on the performance of the model. In case of uncertainty whether the number of samples can improve the performance of neural network, before building the model, the transfer learning method is a better choice.

Conclusion

In this paper, the transfer network is established by using simulation data as the source domain and experiment data as the target domain, combined with the related methods and theories of transfer learning. Based on the experimental results, the transfer network shows obvious advantages in performance, compared to the ordinary network, which proves that the idea and method of transfer learning can be applied to the field of cutting process, to solve some practical problems, such as the prediction of cutting force. This work mainly includes the following two contributions:

Considering the same number of training samples, the performance of the transfer network exceeds that of the ordinary network. It can also be noted that, the proposed established method of transfer network requires the lower number of samples in order to provide decent results. To some extent, this can relieve the amount of experimental data needed to predict cutting force, by using neural network. At the same time, this is also in line with one of the problems that transfer learning seeks to solve: the gap between data demand and data volume.

The effect of the transfer learning method will gradually weaken, as the number of samples increases. When the number of samples is sufficient, the performance of the transfer network is basically consistent with that of the ordinary network. Considering any sample number, the transfer learning method will not have a negative impact on the performance of the network. However, when it is uncertain whether the number of samples is enough to train a neural network model of excellent performance, the method of transfer learning is a valid option.