Keywords

1 Introduction

Due to the complexity of the actual production process, uncertain factors such as equipment downtime, urgent orders, repairing due to quality problems, time adjustment and other factors are difficult to avoid. Job shop scheduling is often shown as complex dynamic scheduling, which needs to adjust the job plan at any time according to the changes of production conditions. For the dynamic scheduling problem, the traditional methods are simplifying the problem, ignoring the disturbance and uncertainty, and transforming the complex problem into a static scheduling problem. These methods need to be redesigned according to the current new state, and models are needed to readjusted according to the changes in the production environment. Since the disturbance of the system state is not considered in traditional methods, the actual production needs cannot be met. In addition, for large-scale production workshops, the order tasks are complex and a great number of resources are involved, which will geometrically increase the difficulty of scheduling problems. For these large-scale and more complex scheduling problems, traditional methods are not easy to be applied. Therefore, it is of great significance to study the dynamic scheduling problem for actual production.

Scheduling rule is a priority allocation rule, which has the characteristics of low time complexity and high robustness, and is applicable to solve dynamic scheduling problems. In recent years, scheduling rules are often used to solve job shop scheduling problems. Durasevic et al. [1] studied the applicability of different scheduling rules, and found out the scheduling standards suitable for each scheduling rule by testing nine standards and four job types. Kuck et al. [2] proposed an optimization method based on adaptive simulation to select appropriate scheduling rules for production control in the case of equipment failure in complex manufacturing systems. Zhang et al. [3] proposed a semantic-based scheduling rule selection system, which associated scheduling rules with optimization objectives through semantic similarity and semantic expressions, and realized the generation of scheduling rule combinations for a given production target. Rolf et al. [4] presented a method of scheduling rule allocation in solving the hybrid flow shop problem with sequence-related setup times. Lee et al. [5] proposed a sequential search method to set appropriate weight sets for scheduling rules, and used decision trees and hierarchical clustering to improve search efficiency. Braune et al. [6] proposed a tree based scheduling priority rule generation method, which realized the decision-making of job allocation and machine sequencing through single tree and multiple.

Reinforcement learning algorithm is a kind of method that does not rely on samples. Compared with traditional intelligent algorithm, it has higher efficiency and generalization ability. Q-learning (QL) is one of the main methods of reinforcement learning. It is a model-free learning method, which can avoid the huge amount of computation of large-scale scheduling, and is applicable to dynamic scheduling problems. At present, there are more and more researches on reinforcement learning for scheduling problems. Bouazza et al. [7] selected reasonable equipment and process routes for the dynamic scheduling by improving the state-action value table of the reinforcement learning algorithm. Shahrabi et al. [8] used QL algorithm to find the appropriate parameters for problem with equipment failure and dynamic arrival of workpieces. Shiue et al. [9] studied the problem of real-time scheduling (TRS) and proposed a real-time scheduling system using a multiple scheduling rules (MDR) mechanism to ensure that the knowledge base (KB) can respond to changes in the workshop environment in real time. Wang [10] proposed an adaptive scheduling strategy, which avoided the blind search problem of the traditional method through dynamic greedy search, and realized the weighted iteration of Q function by defining state error, which improved the speed and accuracy of the learning algorithm. Qu et al. [11] proposed a multi-agent method for the scheduling of production system covering multiple types of products, equipment and labor, which could adaptively update production plans in real time. Chen et al. [12] proposed a method for flexible job shop scheduling, which took genetic algorithm as the key and intelligently adjusted its parameters based on reinforcement learning. Kardos et al. [13] proposed a new method to select machines according to real-time information, so as to reduce the delay time of the workpiece in production.

The current research on dynamic scheduling problems is mostly based on specific workshop scenarios, and the scheduling scheme is only suitable for special environments, which is not universal. By analyzing the above research, this paper proposes a multi-objective dynamic scheduling method based on QL. Through QL technology, the optimal scheduling strategy under dynamic disturbance is obtained, and the real-time matching between the scheduling strategy and the production environment is realized.

2 Description of Dynamic Scheduling Problem

Job shop dynamic scheduling is a complex optimization problem, which can be described as: n workpieces \(Q= \{Q_{1} {,}Q_{2} {,}...,Q_{n} \}\) are processed on m equipment \(M = \{M_{1} {,}M_{2} {,}...,M_{n} \}\). Each workpiece Qi has its corresponding process route and process, and each process corresponds to an optional equipment set. At the same time, it is necessary to consider the disturbance factors in actual production, such as equipment failure, urgent orders, etc. The goal of scheduling is to select the appropriate processing equipment for the workpiece under various constraints and dynamic disturbances, to determine the processing sequence of the workpiece and its working time, and to continuously improve the scheduling index through optimization to meet the expected index requirements.

For the universality of the problem, the dynamic scheduling problem in this paper is based on the several conditions:

  • The workpieces arrive dynamically, and the arrival times of the workpieces are random, regardless of the delivery time of the material. The processing time of the operation includes the preparation time.

  • Each process of the workpiece corresponds to an optional equipment set, and only one of the equipment can be selected to complete the process.

  • The process cannot be stopped halfway after it starts.

  • Each equipment can only be used for the processing of one workpiece at the same time, and other workpieces are not allowed to preempt after the processing starts.

  • Each workpiece has a definite process route, and the processing is carried out in the order specified in the process route. The next process can only be carried out after its previous process.

  • The processing time of an operation has nothing to do with the process route.

To define the dynamic scheduling problem, the definitions of relevant parameters are shown in Table 1:

Table 1. Parameter definition of dynamic scheduling problem

For the dynamic scheduling problem, the constraints can be described as follows:

$$TQ_{sij} + X_{ijk} \times TQ_{ijk} \le TQ_{eij}$$
(1)
$$TQ_{eij} \le TQ_{si(j + 1)}$$
(2)
$$TQ_{{eiL_{i} }} \le C_{\max }$$
(3)
$$TQ_{sij} + TQ_{ijk} \le TQ_{shr} + \inf \cdot (1 - Y_{ijkhr} )$$
(4)
$$TQ_{eij} \le TQ_{si(j + 1)} + \inf \cdot (1 - Y_{ikhr(j + 1)} )$$
(5)
$$\sum\limits_{k = 1}^{{H_{ij} }} {X_{ijk} } = 1$$
(6)
$$\sum\limits_{i = 1}^{n} {\sum\limits_{k = 1}^{{L_{i} }} {Y_{ijkhr} } = X_{hrk} }$$
(7)
$$\sum\limits_{h = 1}^{n} {\sum\limits_{r = 1}^{{L_{h} }} {Y_{ijkhr} } = X_{ijk} }$$
(8)

In the above constraints, Eqs. 1 and 2 represent the production route constraints of the workpiece; Eq. 3 represents the constraint of the completion time of the process; Eqs. 4 and 5 indicate that a piece of equipment can only be used for one process at the same time; Eq. 6 represents the exclusive constraint of a process; Eqs. 7 and 8 represent the usability constraints of the equipment.

The goal of defining dynamic scheduling is to facilitate the evaluation of the effect of scheduling. Common scheduling performance evaluation includes production efficiency, such as maximum makespan; Stability of production process, such as deviation index; Economic indicators, such as production cost, total processing energy consumption and so on. This paper uses the synthesis of multiple indicators as the evaluation indicators.

$$f_{1} = \min (\max (C_{i} ))$$
(9)
$$f_{2} = \min (\frac{1}{n}\sum\limits_{i = 1}^{n} {(C_{i} - G_{i} )} )$$
(10)
$$f_{3} = \min (\mathop {\max }\limits_{i = 1}^{n} (\max (C_{i} - T_{ei} )))$$
(11)

Equation 9 represents the maximum makespan requirement, Eq. 10 represents the index of average flow time, and Eq. 11 represents the index of delayed delivery time. We construct the final scheduling performance index by synthesizing the above indicators, as shown in formula 12, and f is the objective function of comprehensive optimization.

$$f = \min F(f_{1} ,f_{2} ,f_{3} )$$
(12)

3 Dynamic Disturbance and Scheduling Rules

3.1 Analysis of Dynamic Disturbance Factors

The dynamic disturbance factors in actual production can be divided into indirect disturbance and direct disturbance according to their performance characteristics. Indirect disturbances, such as processing time deviation, poor material turnover, and equipment efficiency decline, etc., will affect the execution of scheduling only when these factors accumulate to a certain extent. Direct disturbance will significantly interfere with the scheduling and cause the adjustment of the plan. Direct disturbance can be divided into two types. One is related to resources, such as equipment failure, operation interruption, personnel absence, material shortage or delay, etc. The other is related to the workpiece, such as emergency order insertion, random arrival of workpiece, task cancellation, delivery date adjustment, working hours change, workpiece repair, etc. Common dynamic disturbance factors are shown in Table 2.

Table 2. Classification of dynamic disturbance factors

This paper mainly studies the direct disturbance, focusing on four typical disturbance factors: the dynamic arrival of the workpiece, urgent order, equipment maintenance and workpiece repair. In the actual production process, the workpiece arrives randomly. In this paper, the arrival time of the workpiece is set so that they are uniformly distributed. For urgent order, it can be set by proportion R1 and advance its delivery date. The equipment is not available if it is under the maintenance period, and the disturbance can be set by the maintenance time proportion R2. For the quality problems in actual production, it is achieved by setting a certain number of workpieces with a proportion of F for rework. Considering the difference of disturbance degree in actual production, this paper reflects it through different disturbance intensity.

3.2 Scheduling Rules

The scheduling rules are to calculate the priority of the workpiece according to the selection of processing time, process quantity, delivery period, etc., and select workpiece to be processed for idle equipment.

For dynamic scheduling problems, the evaluation method based on scheduling rules can be used. This method is relatively easy to implement in the actual production environment. It belongs to an efficient closed-loop control method, so it can be used for real-time job shop scheduling. In the process of scheduling, we need to consider the selection of equipment and the allocation of jobs. This paper analyzes the scheduling rules for these two types of problems, studies the performance differences of different scheduling rules under dynamic disturbance, and finally selects the rules with excellent performance. Common scheduling rules are shown in Table 3.

Table 3. Typical scheduling rules

4 Dynamic Scheduling Problem Solving Based on Reinforcement Learning

4.1 State Space Definition

The job shop scheduling process is transformed through the state space to express the system environment of reinforcement learning. The definition of state needs to reflect the features and process of the scheduling environment, and it needs to be able to express different scenarios. In this paper, according to the state of the equipment and the workpiece. The state space is defined by means of feature vectors. The state space includes 5 features, which are shown as follows.

$$s_{k,1} = {{n_{k} } \mathord{\left/ {\vphantom {{n_{k} } n}} \right. \kern-\nulldelimiterspace} n}$$
(13)
$$s_{k,2} = {{T_{k}^{a} } \mathord{\left/ {\vphantom {{T_{k}^{a} } {T^{a} }}} \right. \kern-\nulldelimiterspace} {T^{a} }}$$
(14)
$$s_{k,3} = {{\sum\limits_{j = 1}^{{L_{i} }} {TQ_{ijk} } } \mathord{\left/ {\vphantom {{\sum\limits_{j = 1}^{{L_{i} }} {TQ_{ijk} } } {\sum {\sum {TQ_{ij} } } }}} \right. \kern-\nulldelimiterspace} {\sum {\sum {TQ_{ij} } } }}$$
(15)
$$s_{k,4} = {{\sum\limits_{h = j + 1}^{{L_{i} }} {TQ_{ih} } } \mathord{\left/ {\vphantom {{\sum\limits_{h = j + 1}^{{L_{i} }} {TQ_{ih} } } {\sum\limits_{j = 1}^{{L_{i} }} {TQ_{ij} } }}} \right. \kern-\nulldelimiterspace} {\sum\limits_{j = 1}^{{L_{i} }} {TQ_{ij} } }}$$
(16)
$$s_{k,5} = {{n_{k}^{d} } \mathord{\left/ {\vphantom {{n_{k}^{d} } n}} \right. \kern-\nulldelimiterspace} n}$$
(17)

The above equations describe the current state of the system, where nk represents the number of workpieces processed by the equipment k at the current moment; \(T_{k}^{a}\) indicates the number of processing operations of equipment k; \(n_{k}^{d}\) indicates the number of delayed workpieces processed by equipment k. Equation 13 reflects the distribution of processed workpieces on each equipment; Eq. 14 reflects the distribution of processes on the equipment; Eq. 15 reflects the proportion of processing time of the equipment; Eq. 16 reflects the proportion of remaining time of work in process; Eq. 17 reflects the distribution state of delayed workpieces.

The state values constitute a vector [s11, s12, …, s15, s21, s22,…, s25, s31,…, sm5]. The state vector can be transformed into a state value located in a certain numerical interval (such as [0,100]) by using neural network, and the state value can be used as a criterion to distinguish the state of the scheduling environment.

4.2 Q-value Table

The action space of QL can be expressed as the scheduling behavior in the current state. This paper selects seven typical scheduling rules as the action space of QL. According to the aforementioned priority rules, the Q(s, a) table can be established by combining the state values. In this paper, 11 states are used to construct the Q(s, a) table, as shown in Table 4.

Table 4. Q(s, a) table

For state 0, that is, the scheduling action has not yet started. This state is empty and is also the initial state, so its value is 0.

4.3 Design of Reward Function

For the reward function R, its construction needs to consider the performance indicators of the scheduling system, and the function needs to reflect the impact of action selection on the scheduling results. Therefore, the design of R needs to reflect not only the immediate reward of the action, but also the cumulative impact on the production cycle, and it should also be suitable for scheduling problems of all sizes. For the production efficiency indicators related to time, because these indicators are related to the utilization of equipment, it can be considered to take the working state of equipment as a reward and punishment function, and the equipment state can be defined as follows.

$$\delta_{i} (t) = \left\{ {\begin{array}{*{20}l} { - 1,\,\,{\rm Equipment}\ i \ {\rm is}\ {\rm idle} \ {\rm at}\ {\rm time}\ t} \hfill \\ {0,\,\,\,\,\,\,{\rm Equipment}\ i\ {\rm is}\ {\rm in}\ {\rm working}\ {\rm state}\ {\rm at}\ {\rm time}\ t} \hfill \\ \end{array} } \right.$$

The reward and punishment function are as follows.

$$r_{k} = \frac{1}{m}\sum\limits_{i = 1}^{m} {\int_{{\tau = t_{{k{ - }1}} }}^{{t_{k} }} {\delta_{i} (\tau )} }$$
(18)

In Eq. 18, m is the number of devices, and rk represents the reward when it transitions from sk-1 to sk. The absolute value of rk is the same as the average time that each device is idle when the two states are transferred. It can be seen that the production cycle is negatively correlated with the cumulative return.

$$\begin{gathered} R_{{\text{a}}} = \sum\limits_{k = 1}^{K} {r_{k} } = \frac{1}{m}\sum\limits_{k = 1}^{K} {\sum\limits_{i = 1}^{m} {\int_{{\tau = t_{k - 1} }}^{{t_{k} }} {\delta_{i} (\tau )} } } = \frac{1}{m}\sum\limits_{i = 1}^{m} {\int_{\tau = 0}^{{c_{\max } }} {\delta_{i} (\tau )} } \hfill \\ { = } - \frac{1}{m}\sum\limits_{k = 1}^{m} {(C_{\max } - \sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{{L_{i} }} {TQ_{ijk} } } )} = \frac{1}{m}\sum\limits_{k = 1}^{m} {\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{{L_{i} }} {TQ_{ijk} } } } - C_{\max } \hfill \\ \end{gathered}$$
(19)

In Eq. 19, Ra represents the cumulative return. It can be seen that the smaller the Cmax, the greater the cumulative return Ra.

5 Case Study

To verify the effectiveness of the method, we carry out simulation analysis through MATLAB. The QL algorithm parameters settings: α = 0.05, β = 0.9, ε = 0.15. The initial state-action reward value is zero. For the experimental data, the number of processes of a single workpiece is between 1 and 4, the total number of equipment is 10, the process processing time is between 10 and 25, and the workpiece cache is 100. The processing time, the time interval of arrival, and the required completion time data all meet the normal distribution. The dynamic disturbance parameter settings are shown in Table 5. The training data of QL is randomly generated by the system, with a total of 100000 pieces of workpiece data (the unit of time-related data is hour). In the test phase, seven sets of workpiece data such as 300, 600, 1000, 1500, 2000, 2500 and 3000 are used respectively, and a single scheduling rule and the QL lgorithm in this paper are used for scheduling.

Table 5. Dynamic disturbance parameter setting
Table 6. Performance comparison of different rules and QL

Table 6 shows the makespan data obtained by different scheduling rules and the QL algorithm in this paper. It can be seen that the advantage of QL is not obvious when the number of workpieces is small, but with the increase of the number of workpieces, the performance advantage of QL scheduling becomes obvious. Through QL, the frequency of the system selects different rules in different delivery periods is shown in Table 7.

Table 7. Selection frequency of dispatching rules

It can be seen that there are differences in the selection frequency of scheduling rules. The rule of MST, EDD and LPT are used more frequently under tight period, and MST, LPT, EDD and SPT are used more frequently under medium period, and MST and SPT are used more frequently under loose period. MST has the highest frequency of use under the three periods. These rules with higher frequency contribute the most to the solution of QL algorithm, while other rules are used less frequently and contribute less to the solution of the system.

Table 8 analyzes the comparison of the time limit and completion of QL and MST under the three periods. It can be seen that QL has better effect in the actual time limit, the tardiness and the advance.

Table 8. Comparison of time limit and completion under three periods

Figure 1 shows the tardiness of QL and MST under different workpiece cache, including tight period and medium period. It is obvious that the tardiness of two methods increase with the expansion of the cache capacity, and the tardiness of tight period is worse than that of medium period. In addition, the QL curve rises gently than MST in both periods. In the medium period, the MST curve quickly crosses the QL curve when the workpiece cache capacity reaches 70, and it crosses the QL curve when the capacity is only 30 in the tight period.

Fig. 1.
figure 1

Comparison of tardiness rate under different cache capacity

Figures 2, 3 and 4 shows the comparison between QL and MST in terms of the change of overdue rate with the number of workpieces under different disturbance intensity and different periods. It can be seen that under the three periods, the overdue rate of both methods increases with the increase of disturbance intensity. When the number of workpieces increases from 200 to 1000, the overdue rate of each period and disturbance intensity increases rapidly. The higher the disturbance intensity, the greater the slope of the curve. After the number of workpieces exceeds 1000, the overdue rate decreases slightly, but remains at a high level. During this period, the overdue rate of QL is generally lower than that of MST. After the number of workpieces exceeds 2000, the overdue rate of QL and MST gradually decreases and stabilizes, and it decreases faster using QL under tight period and high disturbance intensity. By comparison, it is obvious that QL has better effect under tight period and high disturbance intensity.

Fig. 2.
figure 2

Comparison of over time of different disturbance intensities under tight period

Fig. 3.
figure 3

Comparison of over time conditions of different disturbance intensities under medium period

Fig. 4.
figure 4

Comparison of over time conditions of different disturbance intensities under loose period

Figure 5 is a comparative analysis of the maximum makespan of QL, GA and PSO under different workpiece cache capacities in the medium period, in which the number of the workpieces is 1000, the number of genetic algorithm population is 20, and the crossover and mutation parameters are 0.4 and 0.2, the acceleration index is 1.5. The particle size of PSO is 40, the acceleration factor is 2, the inertia weight is 0.5, the maximum particle speed is 0.7, and the number of iterations is 100. From Fig. 5 we can see that when the workpiece is 1000, GA and PSO have advantages over QL in terms of completion time optimization. However, from Fig. 6, the scheduling running time of GA and PSO is much higher than that of QL. When the workpiece cache is low, GA method is about 12 times the running time of QL. When the cache capacity is 80–150, the running time decreases slightly, and when the cache capacity is more than 150, it enters an upward trend. The performance of PSO is worse, and the running time increases sharply after the cache capacity exceeds 40. The running time of QL is always maintained at 2 s, which is only related to the number of workpieces and has nothing to do with the cache. It can be seen that QL is slightly weaker than GA and PSO in terms of completion time, but QL has a better time efficiency advantage in the case of large number of workpiece production and large workpiece cache.

Fig. 5.
figure 5

Comparison of makespan of three algorithms

Fig. 6.
figure 6

Comparison of running time of three algorithms

6 Conclusion

This paper studies the dynamic scheduling problem based on reinforcement learning technology and scheduling rules. Typical production disturbances are classified and described through disturbance parameters, the dynamic scheduling problem and its optimization objectives can be consequently represented. On this basis, the state space, Q-value table and reward function of reinforcement learning scheduling are designed. The algorithm is carried out with simulation analysis by using the example data. Through the case study, the method proposed in this paper shows far superior to the traditional intelligent algorithm in time efficiency, and also has a good dynamic scheduling effect. This paper provides a new idea for large-scale job shop dynamic scheduling. For the selection of equipment, this paper adopts the method of man-hour priority, which still lacks the overall analysis of man-hour and load under the complete process route, and further in-depth research can be continued from this direction in the future.