1 Introduction

Process monitoring is essential for improving the quality and efficiency of machining because the information obtained enables online control or offline optimization of process parameters. Process monitoring systems integrate professional software, acquisition systems, and sensors. Moreover, they are integrated with machine-tool controllers through communication protocols [1,2,3]. However, the cost and technology of software and industrial sensors limit their application to small- and medium-sized enterprises (SMEs). Therefore, there is an urgent need for low-cost, portable, and easy-to-operate monitoring methods. Smartphones have become essential devices in people’s personal and professional lives. Owing to the use of high-performance microprocessors, multiple embedded sensor types, and open-source software resources, smartphones are ideal tools for developing low-cost portable monitoring systems, especially for SMEs, which are cost-sensitive [4, 5]. Technicians can use smartphone apps to collect signals, such as images, sounds, and acceleration data, to process and analyze monitoring data. They can send data to the analysis device via the Web, Bluetooth, and instant messaging software (https://phyphox.org/, https://ww2.mathworks.cn/products/matlab-mobile.html) [6]. Stacks et al. (https://phyphox.org/, https://ww2.mathworks.cn/products/matlab-mobile.html) developed a mobile app called Phyphox to collect and analyze data that contained signal processing algorithms such as the fast Fourier transform (FFT) and time-frequency analysis method.

Smartphones have been used for vibration monitoring and can obtain measurements with an accuracy close to that of professional vibration sensors [7, 8]. Therefore, smartphones have been used for structural vibration performance testing [7], rail transit vibration comfort tests [8], and bridge health monitoring [9, 10]. However, process monitoring requires industrial sensors with high sampling frequencies (>1 kHz) and large measurement ranges (>20 g), which are difficult to achieve with triaxial accelerometers on smartphones.

Recently, machine vision has been used to monitor the condition of machine tools. Remote technicians can observe the machining site using cameras installed inside or outside the machine [11, 12]. Deep learning–based text recognition techniques are used to obtain machine tool information from human-machine interface (HMI), enabling automatic monitoring. Kim et al. [13] proposed a low-cost machine tool monitoring system based on a commercial camera and open-source software to monitor online machine operation data by gathering information on the HMI of CNC tools. Lee et al. [14] used a webcam to monitor the HMI of a five-axis machine, and adjusted the spindle speed by using the machine’s control panel. Xing et al. [15] developed a data recording system for machine tool measurements based on a low-cost camera and an open-source computing platform to reduce human error, machining costs, and time. Compared with webcams, smartphones have advantages in terms of camera performance, data transmission, software resources, and computing performance. They are widely used in portable microscopic inspection [16] and tool-wear-type determination (https://www.sandvik.coromant.cn/zh-cn/knowledge/machining-calculators-apps/pages/tool-wear-analyzer.aspx). Structural vibrations can be analyzed from videos recorded by smartphones using motion alignment techniques in image processing. Gupta et al. [17] proposed a modal parameter identification method based on the high-speed photography function of a smartphone. André et al. [18] presented methods for performing advanced mechanical analyses by using smartphones. Using recorded video data, they estimated the instantaneous angular velocity of a rotating fan and the natural frequency of a cantilever beam structure.

The acoustic emission characteristics in the air can reflect the machine and tool conditions. Moreover, acoustic acquisition systems can be installed far from the workpiece owing to their sound propagation characteristics, thus providing a quick and easy method to diagnose faults [19, 20]. Smartphones can be developed as portable acoustic acquisition and fault diagnosis devices owing to their low cost and simple operation. Akbari et al. [21] used a smartphone to collect sound signals from tool strikes and identify the modal parameters of the holder-tool system. Xu et al. [22] proposed a mobile device-based fault diagnosis method for extracting fault-related features using quasi-tachometer signal analysis and envelope order spectroscopy. Huang et al. [23] proposed an acoustic fault diagnosis method based on time-frequency analysis and machine learning for automotive power seats using a smartphone as an acoustic signal acquisition instrument. Sound signal is a standard signal used for process monitoring, especially for chatter detection. Chatter stability can be determined by extracting features related to the chatter component in the time, frequency, and time-frequency domains of the sound signal [20, 24,25,26,27]. The effectiveness of smartphones in chatter detection has not yet been explored. In contrast to traditional industrial microphones, technicians can use their smartphones for process monitoring and recording process parameters and experimental phenomena. Speech can be recognized as textual information by using open-source AI models. However, the human voice signal overlaps with the sound signal generated by the cutting process in the frequency domain, which may lead to false positives in chatter detection.

The tool status is related to the cutting position [27]. For example, chatter can be effectively detected by relying only on acoustic signals; however, traditional industrial sensors and instruments cannot obtain the tool position information related to chatter occurrence without collecting controller information. Smartphones can simultaneously collect multiple types of monitoring information by integrating multiple sensors. The combined measurement results obtained from different sensors can help technicians analyze machining process anomalies more accurately. This study proposes a position-oriented mobile monitoring method that utilizes image and sound information simultaneously. The main contributions of this study are as follows:

  1. 1)

    This study explored the effectiveness and extendibility of using smartphones for process monitoring at manufacturing sites. Compared to industrial sensors and acquisition instruments, the AI-based monitoring method can obtain machine operation information from the HMI without retrieving it from the machine tool controller, and enables the operator to record the experimental details easily.

  2. 2)

    Proven open-source AI and signal processing algorithms facilitates reduced development cycles and costs, which facilitate rapid deployment by SMEs or research institutions.

  3. 3)

    Simultaneously acquired video and sound signals are used to enable position-oriented process monitoring. The image contains the actual tool movement or tool position displayed on the HMI and can help establish an association between the sound signals and tool position.

  4. 4)

    Robotic milling and deep hole drilling experiments were conducted to verify the effectiveness of the proposed method at different manufacturing sites.

The remainder of this paper is organized as follows. In Section 2, a process monitoring scheme using a smartphone was proposed, in which open-source toolkits and algorithms were used for optical character recognition (OCR), speech recognition, voice elimination, motion tracking, and tool status monitoring. In addition, the tool modal parameter was identified using the operational modal analysis (OMA). The sound signal, with the human voice removed, was used to calculate the energy ratio and time-domain characteristics to help the operator determine the chatter and tool wear status. In Section 3, the effectiveness of the proposed method was verified using robotic milling and deep-hole drilling experiments.

2 Materials and methods

This section presents a smartphone-based position-oriented machine-tool monitoring method using open-source software and models written in Python. Open-source AI techniques have been used for image text recognition, speech recognition, human voice cancellation, and motion tracking. The time-domain characteristic and signal process methods of sound signals are used for modal parameter identification, chatter detection, and tool status determination. Figure 1 illustrates the process monitoring scheme based on a recorded video of the HMI or machining area during the machining process. The images and sound signals were captured simultaneously. Because the image contains processing position information, the sound signal and tool position can be correlated effectively. This helps the technician better understand the machining status with respect to different positions and modify the cutting parameters in the corresponding area.

Fig. 1
figure 1

Scheme of smartphone-based process monitoring

2.1 Processing monitoring using AI methods

Various types of information, including the program position, spindle speed, and feed rate, can be extracted from the HMI using OCR technology. Subsequently, the sound signal is correlated with the identified program instructions over time. A mapping relationship was established between the sound signal and its image when using a smartphone to film the tool movement during machining. We can then locate the corresponding cutting area with respect to the monitoring anomaly. Tool position can be automatically identified via motion alignment for batch processing or long-term monitoring scenarios. The implementation method for each part is described as follows:

The image information in the video provides a visual record of the machine HMI and machining area information. The machine HMI displays the necessary information about the machine's operational status, which can be monitored by taking videos using a smartphone. As shown in Fig. 1a, the video is converted into an image, and subsequently, the appropriate area is selected using image-processing techniques. Subsequently, the HMI information was extracted using the OCR method. The OCR method can recognize the original information based on training data related to the recognized characters. In this study, PaddleOCR (https://github.com/PaddlePaddle/PaddleOCR) was chosen as the OCR algorithm, of which Baidu is the most accurate currently available free and open-source OCR engine. Note that when the operating panel characteristics are minor or light, fixing the handset next to the machine’s operation panel is recommended to improve the recognition accuracy.

The cutting tool status and position were recorded and analyzed using a smartphone to film the machining area. The magnification adjustment function of the video recording allows the filmmaker to observe the cutting area from afar. As shown in Fig. 1c, the tool position can be marked in a video or image using a target recognition algorithm. It should be noted that target tracking is not required in most cases. Target tracking can help the operator discover the tool position more visually when the phone is fixed to monitor batch processing. In this study, we used the CascadeClassifier with OpenCV (https://docs.opencv.org/4.x/d6/d00/tutorial_py_root.html) for target recognition using Hal and local binary features as recognition features. The images were grayed and resized to improve the training and recognition efficiency. Note that when the operator takes a video using a smartphone, the video will shake, thereby affecting the recognition results. In addition, environmental factors such as lights, fixtures, and tools at the experimental site reduce the recognition accuracy.

The sound signal was captured simultaneously during video recording, which is a commonly used process for monitoring signals. In contrast to traditional industrial sensors and acquisition systems, technicians can collect sound signals while recording experimental details using smartphones. Therefore, speech recognition is required to eliminate the influence of human voices on process monitoring. Figure 1b shows the process of speech recognition and human voice separation using open-source software. Speech recognition is a widely used artificial intelligence (AI) technology. The open-source toolkit, PaddleSpeech (https://github.com/PaddlePaddle/PaddleSpeech), and its lightweight speech recognition model were used for speech recognition in this study. Large recognition errors may occur because the model is trained without considering noise during processing. Because the frequency range that contains the human voice (< 1000 Hz) overlaps with the signal frequency of the cutting process, system misclassification may occur if the human voice signal is not eliminated. This study used Spleeter (https://github.com/deezer/spleeter) to separate human voice from noise, which is an open-source separation library and pre-trained model developed by Deezer, to separate human voices from noise. Note that the eliminated vocal sound signal still contains residual human vocal signals, which may interfere with process monitoring. Therefore, we marked the location of the vocal signal to help the technician perform a better assessment. A human voice signal separated from the original signal cannot be used for speech recognition, mainly because the necessary information is removed during the separation process.

The video and audio signals captured synchronously using a smartphone are aligned in time, and the number of points in the sound signal corresponding to the video signal for \({{\varvec{N}}}_{{\varvec{s}}}\) frames can be calculated using the following equation:

$${\mathrm{N}}_{\mathrm{s}}=\frac{{\mathrm{N}}_{\mathrm{v}}{\mathrm{f}}{\mathrm{s}}_{\mathrm{s}}}{{\mathrm{f}}{\mathrm{s}}_{\mathrm{v}}},$$
(1)

where \({{\varvec{N}}}_{{\varvec{v}}}\) is the number of video frames. \({{\varvec{f}}{\varvec{s}}}_{{\varvec{v}}}\) and \({{\varvec{f}}{\varvec{s}}}_{{\varvec{s}}}\) are the sampling frequencies of the video and sound signals, respectively.

2.2 Modal analysis and chatter detection using sound signal

Figure 2 describes the scheme of tool modal parameter identification and chatter detection. The operator determines the time and position of chatter occurrence based on the sound signals and images collected by the smartphone.

Fig. 2
figure 2

Scheme of tool modal parameters identification and chatter detection

2.2.1 Identification of tool modal parameters by OMA

Tool mode parameters can help the operator select appropriate cutting parameters to avoid chatter. However, the modal hammer test requires specialized acquisition instruments and analysis software. This paper proposes a tool modal identification method based on sound signals recorded by a smartphone, as shown in Fig. 2b. First, the tool tip was struck using a metal tool commonly used at the machining site, and the sound of the strike was captured using a smartphone. The sound signal contains information about the tool tip vibration; therefore, the natural frequency and damping ratio can be estimated using the OMA method. OMA is a modal parameter estimation method that uses only the output signal. Accordingly, the relationship between the output power spectral density (PSD) matrix of the sound signal and the frequency response function matrix \(\mathbf{H}\left(\omega \right)\) of the tool tip is obtained as follows:

$${S}_{yy}\mathrm{(}\omega \mathrm{)}={H}^{*}(\omega ){S}_{xx}(\omega )H{(\omega )}^{T}$$
(2)

where \({\mathbf{S}}_{xx}\left(\omega \right)\) is the PSD matrix of excitation, which is the diagonal matrix of white-noise excitation with constant real values. To reduce the computational effort, a modal decomposition equation based on the semi-spectral method was used, which is expressed as follows:

$$S_{yy}^+\left(\omega\right)=\sum_{r=1}^{N_m}\frac{\Phi_r\mathrm{g}_r^T}{j\omega-\lambda_r}+\frac{\Phi_r^\ast\mathrm{g}_r^H}{j\omega-\lambda_r^\ast}$$
(3)

where \(\Phi_r\), \({g}_{r}\), and \({\lambda }_{r}\) are the vibration mode, operating factor, and stability pole, respectively. Finally, the modal parameters were identified through a poly-reference least-squares frequency domain method (p-LSCF); a more detailed explanation of the p-LSCF method is described in [28]. The least-squares method was used to solve the equations and determine the model parameters during the calculation. The natural frequency and damping ratio are calculated as follows:

$${\omega }_{\mathrm{r}}=\frac{|{\lambda }_{r}|}{2\pi }\mathrm{,}{\xi }_{\mathrm{r}}=-\frac{\mathrm{Re(}{\lambda }_{r}\mathrm{)}}{|{\lambda }_{r}|}$$
(4)

The physical and computational poles of stability were determined according to the stability criteria. Subsequently, the modal parameters were automatically extracted using a clustering approach. The frequency response function estimation results obtained using the OMA method cannot be directly used for chatter stability prediction because of the lack of an input force signal. However, the operator can select an appropriate spindle speed based on the identification results to ensure that the spindle rotation and harmonic frequencies avoid the natural frequency of the tool tip. This is indicated by the remainder of the selected tool tip natural frequency and spindle rotation frequency, as shown in the following equation:

$$\mathrm{Inx=mod}(\frac{{\omega }_{\mathrm{tool,r}}}{{\omega }_{\mathrm{sp}}}),$$
(5)

where \({\omega }_{{\mathrm{tool}},r}\) is the rth-order natural frequency of the tool tip and \({\omega }_{{\varvec{s}}{\varvec{p}}}\) is the spindle rotation frequency. The threshold value is determined empirically; typically, the first three orders of tool tip frequency are used to determine the spindle speed.

2.2.2 Chatter detection method

The sound signal is effective for detecting cutting chatter. Industrial microphones can be positioned away from the cutting area to collect sound signals, which are analyzed using specialized acquisition systems and software. In this study, a smartphone was used as the sound acquisition instrument. The sound signal during the cutting process was recorded next to the machine, and was composed of

$${x}_{\mathrm{raw}}={x}_{\mathrm{chatter}}+{x}_{\mathrm{period}}+{x}_{\mathrm{voice}}+{x}_{\mathrm{noise}},$$
(6)

where \({\mathbf{x}}_{\mathrm{period}}\) is the periodic component associated with spindle rotation and is the dominant signal during stable cutting. \({\mathbf{x}}_{\mathrm{voice}}\) is a human voice, which has not been considered in other chatter detection studies. \({\mathbf{x}}_{\mathrm{chatter}}\) and \({\mathbf{x}}_{\mathrm{noise}}\) are the chatter and noise components of a signal, respectively. The chatter frequency associated with the natural frequency of the spindle tool system appears in the milling signal when the chatter occurs. Moreover, the energy of the chatter frequency increases dramatically as the chatter develops. The chatter detection procedure, based on the chatter mechanism, is illustrated in Fig. 2a.

First, the raw signal is filtered using a bandpass filter according to the modal parameters of the tool to reduce the effects of the high- and low-frequency components. The sampling frequency of the phone exceeded 40 kHz, whereas chatter generally occurred below 5 kHz. As mentioned above, the component near the natural frequency of the tool tip is the focus of the chatter detection. However, there are multiple frequencies in the vibration spectrum distributed around the central chatter frequency owing to the signal modulation. Moreover, multiple modes of tool tip frequency may cause multi-central chatter frequencies under certain spindle speeds. Hence, the cutoff frequency of the passband filter is determined based on the lowest and highest modal natural frequencies of the tool tip as follows:

$$\left\{\begin{array}{l}\omega_\mathrm{lp}=max\left(\omega_\mathrm{tool,1}-3.5\omega_\mathrm{sp},\omega_\mathrm{sp}\right)\\\omega_\mathrm{hp}=\min\left(\omega_\mathrm{tool,max}+3.5\omega_\mathrm{sp},5000\right)\end{array},\right.$$
(7)

where \({\omega }_{{\varvec{l}}{\varvec{p}}}\) is the lowest cutoff frequency in the passband, and is determined by the first-order natural frequency of the tool tip \({\omega }_{{\mathrm{tool}},1}\) and spindle rotation frequency \({\omega }_{{\varvec{s}}{\varvec{p}}}\). \({\omega }_{{\varvec{l}}{\varvec{p}}}\) is the highest cutoff frequency in the passband, and \({\omega }_{{\mathrm{tool}},{\varvec{m}}{\varvec{a}}{\varvec{x}}}\) is the highest order tool tip frequency.

In this study, a matrix notch filter was used to suppress the periodic signals in the acquired signal to reduce interference with chatter detection [29]. A matrix notch filter is typically used to suppress specific frequencies in a signal and is employed to eliminate narrowband interference while keeping the broadband signal unchanged. The output signal filtered by the notch filter can be expressed as follows:

$${\mathbf{x}}_{{\varvec{r}}{\varvec{e}}{\varvec{s}}}={\mathbf{H}}_{\omega s}{\mathbf{x}}_{{\varvec{r}}{\varvec{a}}{\varvec{w}},{\mathrm{tool}}},$$
(8)

where \({\mathbf{x}}_{{\varvec{r}}{\varvec{a}}{\varvec{w}},{\mathrm{tool}}}\) denotes the input signal obtained by using the designed bandpass filter. \({\mathbf{x}}_{{\varvec{r}}{\varvec{e}}{\varvec{s}}}\) is the output signal, and is called the residual signal. \({\mathbf{H}}_{\omega s}\) is a matrix trap filter characterized by the following constraints:

$$H_{\omega\mathrm s}V(\omega)=\left\{\begin{array}{ll}0&\omega=\omega_s\\V\left(\omega\right)&\omega\neq\omega_s\end{array},\right.$$
(9)

where the vector \(\mathbf{V}\left(\omega \right)\) is defined as \(V(\omega )={\mathrm{[1,}{\mathrm{e}}^{j\omega }\mathrm{,}{\mathrm{e}}^{j2\omega }\mathrm{,}\cdots \mathrm{,}{\mathrm{e}}^{j\mathrm{(N-1)}\omega }\mathrm{]}}^{T}\) and the passband of the filter is P = [0, ωs-ε]∪[ωs+ε, π], where ωs is the trap frequency, which is related to the rotation and harmonic frequencies. ε is a small positive-valued constant. A detailed design of the filter is presented in [29].

Subsequently, the raw signal \(\mathbf{x}{\left(n\right)}_{{\varvec{r}}{\varvec{a}}{\varvec{w}}}\) is filtered using a bandpass filter within a passband frequency range of [25–5000 Hz], thereby eliminating the direct component and high-frequency components to obtain the signal \(\mathbf{x}{\left(n\right)}_{{\varvec{r}}{\varvec{a}}{\varvec{w}},{\varvec{b}}{\varvec{p}}}\).

Finally, the ratio of the residual signal energy to the total signal energy is used as an indicator for chatter detection using:

$$\mathrm{ER} = \frac{{\sum }_{n=1}^{N}x{(n)}_{\mathrm{res}}^{2}}{{\sum }_{n=1}^{N}x{(n)}_{\mathrm{raw,bp}}^{2}}.$$
(10)

The indicator determines the threshold value for chatter occurrence through cutting experiments. When the energy ratio of the sound signal exceeded this threshold, the algorithm adjudicated the occurrence of chatter. According to the results of the milling experiments, the chatter detection threshold in this study was 0.15. To improve the frequency resolution, six image frames and the corresponding sound signals were used to calculate the chatter indicator.

3 Experiments and results

3.1 Process monitoring in robotic milling

An idle operation test was conducted using the KUKA KR 160 robot to verify the effectiveness of HMI monitoring. Figure 3a shows the results of text recognition based on the OCR method using the selected HMI area. As shown in Fig. 3b, the smartphone was mounted on a mobile stand to record the robot HMI during robotic operation. A comparison between the toolpath fitted by the NC program and that fitted by the recognition values is shown in Fig. 3c. Text-recognition errors cause errors in some areas of the recognition curve. Because the operator must hold the operation panel in robotic milling, which is not conducive to photographing the HMI with a phone for text recognition, subsequent milling experiments were conducted using a smartphone to record a video of the machining area. A computer (Intel Core i7-6500U 2.5 GHz, 16 GB RAM) with Python software was used to process the signals.

Fig.3
figure 3

Process monitoring based on the recorded HMI video: a image recognition procedure, b the fixed smartphone was used to record robot HMI, c comparison between the toolpath fitted by NC code and that fitted by measured values

Milling tests were performed using a KUKA KR500 MT-2 robot equipped with a mechanical spindle (Fig. 4). A Sandvik R390-022A20L-11L milling cutter, mounted with two inserts, was used in the experiments. The insert is T3 16M-KM H13A; the detailed tool parameters are listed in Table 1.

Fig.4
figure 4

Robotic milling experiment setup

Table 1 Parameters of milling cutter and insert

The raw sound signal of the knock and its PSD are shown in Fig. 5. The sound of a metal rod knocking on the tool tip was captured using a Redmi K40 phone. To reduce the estimation error, data collected from five knocking tests were used for the OMA. The natural frequencies and damping ratios in the different directions of the tool can be neglected so that only the sound signals when tapping the tool in the x-direction are collected. Figure 6 shows the stabilization diagram of the OMA, where the natural frequency and damping ratio were automatically calculated based on the stability pole selection criteria. The dominant mode is typically used for chatter prediction. However, it is difficult to determine the tool mode with the lowest modal stiffness, owing to the lack of mass scaling in the OMA method. Therefore, this study chose the multiorder mode of the tool for chatter detection. The identification results of the tool are listed in Table 2.

Fig. 5
figure 5

Tool modal testing: a raw sound signal and b power spectral density of the sound signal

Fig. 6
figure 6

Stabilization diagram for modal identification

Table 2 Tool modal parameters

The feed direction and tool path are shown in Fig. 4. The milling test contained two toolpaths: slot milling and side milling. The milling experiments on aluminum alloys with the experimental parameters are listed in Table 3. The spindle speed was 4000 rpm, and the feed per tooth was 0.225 mm. Note that the 5th-order tool natural frequency was chosen as the highest cutoff frequency \({\omega }_{{\mathrm{tool}},{\varvec{m}}{\varvec{a}}{\varvec{x}}}\) in Eq. (9) because of the low spindle speed.

Table 3 Cutting parameters and experimental settings

Figure 7a presents a comparison of the raw signal with the separated vocal signal in test no. 5. The raw sound signal contained three vocal signal regions, marked as V1, V2, and V3. The sound signal of V1 can be used to accurately identify the cutting parameters and corresponding textual content. By contrast, the sound signals in the V2 and V3 regions could not be recognized as statements with a clear meaning. It should be noted that spindle rotation and noise from the milling process can interfere with the recognition results, particularly if the operator speaks at a fast speed. The failure in speech recognition is related to spindle rotation noise and unintelligible speech of the operator during the cutting process. Figure 7b shows the chatter index calculated from the sound signal, in which the spindle noise was removed according to the method proposed in [30]. Severe chatter occurred during period 28.6–30.2 s.

Fig. 7
figure 7

Sound signal and chatter indicator curve in test no. 5. a Comparison of raw sound signal and human voice signal, the result of speech recognition; b chatter indicator curve

Figure 8 shows the procedure of the method proposed in this study for test no. 2. An analysis of comparison between the residual and raw signals is shown in Fig. 8a. According to the chatter indicator curve shown in Fig. 8b, the tool has the most severe chatter at 5.6 s (P1), while the chatter disappears at 6.6 s (P2). Figure 8c presents the FFT diagram for different frequency ranges at 5.6 s. The chatter frequencies can be clearly observed and are related to the first-order natural frequency of the tool. The signal energy is mainly concentrated in 600–1000 Hz at the time. Figure 8d shows the tool position when the most severe chatter occurred and when the chatter disappeared, as determined by the chatter indicator curve. Severe chatter occurred before the tool entered the corner and disappeared when the tool cut into the corner. There was no chatter during the side-milling process.

Fig. 8
figure 8

Chatter detection associated with the cutting position of the tool in test no. 2: a comparison between the raw signal and residual signal, b chatter indicator curve, c FFT at P1, and d tool position at P1 and P2

Fig. 9 compares the chatter indicator curves of tests no. 1, no. 3, no. 4, and no. 6. The maximum stability axial depth of the cut is 0.4 mm when the tool feeds along the y-direction. In contrast, the chatter is improved when the tool is fed along the x-direction. The feed directions in tests no.4 and no.6 were different. Comparing Fig. 9 a and b, tool chatter occurred when the depth of the cut was increased from 0.2 to 0.4 mm. Comparing Fig. 9 c and d, severe chatter occurred in test no. 4 during the slot milling process, which means that the chatter component was the main sound signal component. A lower chatter level was observed in test no.6 compared to test no. 4. The severe chatter occurred during the period 7.8–8.2 s in test no. 6. However, it is impossible to determine the tool position when the chatter occurred because only the sound signal was recorded.

Fig. 9
figure 9

Chatter indicator curve in tests. a no. 1, b no. 3, c no. 4, and d no. 6

The corresponding tool positions were extracted from the video recordings according to the initiation and disappearance of chatter, as shown in Fig. 9c. Figure 10 indicates that chatter occurs when the tool starts to cut into the workpiece and continues until the tool cuts into the corner. In addition to the FFT diagram of the sound signal, the technician can detect the chatter by observing the surface quality. Compared with the machined surface without chatter, shown in Fig. 10b, the chatter marks on the surface of the workpiece used in test no.4 can be clearly observed in Fig. 10a. It extended from the cut-in area to the corner. Hence, the chatter marks in Fig. 11a demonstrate the accuracy of the position of the chatter occurrence, as shown in Fig. 10.

Fig. 10
figure 10

Tool positions captured from the video recorded during test no. 4: a tool position when chatter occurs and b tool position when chatter disappears

Fig. 11
figure 11

Experimental surface characteristic: a surface characteristic after test no. 5 and b surface characteristic after stable milling

The aforementioned experiments proved the effectiveness of the position-oriented chatter detection method proposed in this study. Fusing the image and sound signals can help the operator discover the tool position related to chatter occurrence.

3.2 Process monitoring in the deep hole drilling process

To verify the method proposed in this study, the machine HMI shown in Fig. 12 was filmed in a factory environment using a smartphone during the deep-hole drilling process. The machine could drill three holes simultaneously.

Fig. 12
figure 12

a Deep hole drilling machine and b schematic structure of the drill piece

The machining object was a composite sheet of Inconel 625 and FeCr alloy. The spindle speed was 95 rpm during the drilling of Inconel 625 and was increased to 500 rpm as the drilling depth increased. Figure 13 shows the identification process of the machine HMI, which displays the spindle speed, feed speed, tool position, axial thrust, spindle power, and other information for each spindle. The recognized failure rate related to machining information with a small font was high; therefore, this study mainly recognized the spindle power and thrust values. The spindle speed displayed on the HMI was determined based on the programmed spindle speed.

Fig. 13
figure 13

Image recognition procedure of deep hole drilling machine HMI

Experienced technicians can assess the tool wear state based on the sound and workpiece vibrations during the cutting process. Figure 14a shows the sound signal collected from the deep-hole drilling processing site, which contains spindle sound, cutting sound, cooling system sound, and human voice. In addition, it is difficult to isolate the cutting sound signal related to each spindle accurately when the three spindles operate simultaneously. In this study, the raw signal was processed using a bandpass filter with a set passband frequency range [25–1500] Hz, and the filtered sound energy was used as the reference data. Figure 14b shows a comparison between the spindle speed and signal energy curve during the drilling process. It should be noted that the signal energy curve was calculated using the filtered signal after removing vocals. The spindle speed had the greatest effect on sound. Moreover, the sound energy curve increased abruptly at 320 s, which may be related to the tool wear state.

Fig. 14
figure 14

Sound signal and spindle speed recognition results: a raw sound signal and human voice signal and b synchronization of spindle speed and sound signal energy

The thrust signals of the three spindles are compared in Fig. 15, where the z-directional thrust force is primarily related to the cutting edge. Based on the thrust signals, we determined that the thrust forces of the three spindles were higher when drilling the sheet of Inconel 625, and lower when the cutter cut into the sheet of the FeCr alloy. The thrust force of tool 3 was higher than that of tools 1 and 2 owing to the more severe tool wear detected in tool 3. The thrust force of tool 3 suddenly increased at 275 s, which may have been caused by chipping of the tool cutting edge.

Fig. 15
figure 15

The z-directional thrust values of the three machine spindles during the drilling process

Figure 16 presents a comparison between the power curves of the three spindles during the drilling process. Exhibiting a trend similar to that of the thrust signal, the spindle power signals were higher when machining Inconel 625 and gradually decreased to steady values after the tools started to cut into the sheet of FeCr alloy. The spindle power required by tool 3 was higher than that required by tools 1 and 2. The exception is that Fig. 15 does not depict an abrupt change in the spindle power curve for tool 3. The spindle power requirement is primarily related to the radial cutting forces caused by the guide pad. Therefore, by comparing the thrust and spindle power curve tool 3, it can be observed that the cutting edge was severely worn or chipped when the guide block was in a normal wear condition. Tool 3 was replaced by the operator because of severe wear after machining, whereas tools 1 and 2 remained in use for drilling.

Fig. 16
figure 16

Spindle power curves during the drilling process

4 Conclusions

This study explored the effectiveness of using smartphones for position-oriented process monitoring. The proposed method uses recorded videos of the HMI or machining area during the machining process to determine the tool status based on AI and signal processing techniques. First, information on the tool position and machine tool status was obtained through HMI recognition based on PaddleOCR. Then, the open-source toolboxes PaddleSpeech and Spleeter were used for speech recognition and human voice elimination based on the simultaneous acquisition of sound signals. In addition, sound signals were used for modal identification by oma, and the residual sound signal, which was obtained after eliminating the human voice and periodic components, was used to calculate the chatter indicator.

Finally, robotic milling and deep hole drilling experiments were conducted to validate the effectiveness of process monitoring using a smartphone. The simultaneous acquisition and analysis of tool motion images and sound signals in robotic milling enables tool-position-oriented chatter detection. The chatter marks on the workpiece surface match the extracted tool position. The status information of the robot and machine tool can be efficiently obtained from the HMI video based on the proposed methods. For deep hole drilling machine tools, spindle speed, thrust, spindle power, and acoustic signals are available from the HMI videos. Operators can use monitoring data to visually analyze the wear status of different tools. In summary, mobile phone–based process monitoring helps operators gain insight into tool status during cutting without changing the machine control cabinet.

In the future, a smartphone application will be developed based on the proposed method, which can be used for signal collection, data processing, and chatter detection.