1 Introduction

In recent years, important changes have taken place in the production paradigm associated with the Industry 4.0 concept [1]. This change has implied a greater penetration of digital technologies in the productive processes associated with a growing connectivity through the Internet of Things (IoT) [2]. The increasing responsiveness of Industry 4.0 is one of its main attributes [1]. Increasing responsiveness implies, roughly speaking, being able to fast-adjusting the manufacturing system itself in a non-rigid way to the scenarios and situations that it must face [3].This enables the production capacity to be boosted in the face of changing scenarios, either due to changes in demand [4] or due to the occurrence of unexpected events in the process [5].

These Industry 4.0 features directly impact the decision-making processes associated all levels of a manufacturing system, from resource and facility planning to shop floor control [6]. In particular, a quick and efficient response to disruptive events is of primary importance. There are two pre-requisites: on one hand, the ability to capture in real-time and process data related to the disruptions and on the other hand, to adapt the system to the new disrupted situation.

Rescheduling is a crucial aspect in the Industry 4.0 era [3]. Rescheduling relates the decisions made when an already existing production plan is in execution [7]. Once the schedule has been generated, unforeseen events will arise and will modify the initial conditions, affecting the performance of the manufacturing system [8]. Rescheduling methods must therefore contemplate the modified scenario and propose actions that allow meeting the organization’s objectives [7]. Two key factors for rescheduling are the speed of the calculations and the quality of the information required to devise a corrective action, as well as the ability to provide quality and stable schedules.

Industry 4.0 affects these two aspects directly [9]. On one hand, the availability of more and better information will improve the ability to analyze new scenarios and on the other hand, quality schedules can provide a competitive advantage [10]. Since the information arrives more quickly and efficiently to the shop supervisors, the control actions can also be quickly fed back to the shop floor. In this sense, Industry 4.0 enhances the potential for the development and implementation of sophisticated methods for recalculating. In their work, Zhang et al. [3] highlighted that the management of these dynamic environments in Industry 4.0 poses a challenge for the rescheduling mechanisms in the shop floor. Naturally, developing more sophisticated methods will allow a better performance than simpler methods such as dispatching rules or right shift schedules.

However, to take advantage of the Industry 4.0 Technologies, there are several issues unresolved: the first question concerns the appropriate tools to model and optimize production plans carried out by Cyber-physical Systems. Most tools nowadays provide either (i) powerful tailor-made algorithms for unrealistic and simplified representations of production systems or, (ii) accurate representations connected with limited simulation-based optimization methods based mostly on priority or probabilistic decision rules. Most of these approaches consider the uncertainty proper of rework events only through expected values or Monte-Carlo methods. However, modelling rework events in such manners can undermine severely system performance, since each rework event can differ significantly from others; then, an event-driven approach enables a more accurate and efficient management of disrupting events. The second question is the selection of appropriate optimization methods with rescheduling capabilities. It is well known that rescheduling not only involves decisions as to maintain the quality of the schedule but also related to stability. It is also clear that frequent re-calculation of schedules leads to system instability that affects the supply of materials and components, human resources and labor (e.g., shifts already committed), and the fulfilment of production orders [11].

In this paper, we study rework in job shop manufacturing systems with Industry 4.0 capabilities. We propose an integrated modeling and optimization approach able to handle rescheduling in real-time considering an event-driven logic. Rework in manufacturing is one of the many disruptions that affect their performance and it is one of the least studied. Rework occurs due to many causes that include worker errors, quality problems of raw materials, incorrect machine settings, and tool breakage. An issue that it is often overlooked is the fact that rework generally leads to additional reconditioning operations: for example, in a case of a machining defect, the workpiece may need additional machining operations (e.g., grinding); in a case of a paint quality problem, the paint must be scrapped off before the workpiece is repainted; assembly errors require disassembly before the rework is performed.

In this paper, we use an integrated timed Petri nets [12, 13] modeling and optimization approach to (i) represent the manufacturing system with provisions for rework, (ii) calculate the “baseline” or offline schedule, and (iii) trigger the schedule regeneration. In turn, the rescheduling system proposed in this work lays the foundations for the design of an automated system for scheduling and rescheduling functionalities, with capabilities to process information in real-time, as well as to model and solve the problems that arise during production runs.

We aim to study both scheduling quality (in terms of RPD) and stability with different scheduling/rescheduling methods and with different rescheduling policies in manufacturing systems with Industry 4.0 Technologies able to both detect the need for rework in real-time and estimate, based on previous occurrences, the duration of the reconditioning operation. We propose a hybrid policy that triggers either a full reschedule procedure or a more conservative modification of the current schedule. The choice of the rescheduling method is made based on a predefined threshold value that is related to the expected duration of the reconditioning operation. The specific goals of this papers are (1) to compare dedicated algorithms for offline scheduling vs. the more versatile dispatching rules with the above metrics and, (2) establish the right value of threshold that will be a trade-off between RPD and stability. The contributions of this paper can be summarized as follows:

  • An integrated modeling and optimization approach for real-time event-driven rescheduling under rework/reconditioning events.

  • An extensive experimentation framework to compare the tested algorithms under the two metrics mentioned above.

  • A rescheduling method that provides a trade-off between RPD and schedule stability

The remainder of the paper is organized as follows: Sect. 2 examines the background on the topic; Sect. 3 describes the proposed approach, Sect. 4 presents the computer tests, the results and analysis, and finally, Sect. 5 concludes and suggests future work.

2 Background

2.1 Petri nets

The modeling approach used in this paper was Petri nets. Petri nets are an excellent modeling and optimization tool for the manufacturing system and for the rework and reconditioning tasks. In general, Petri nets are well-known for modeling and simulation. However, Petri net-based scheduling algorithms have been proved powerful on a variety of manufacturing systems such as job shops, flexible manufacturing systems, and project scheduling [14, 15]. An important advantage of using Petri nets over other modeling approaches is the integration of modeling with both simulation and optimization.

In this section, we give a brief introduction of the notation and definitions of Petri nets used in this paperFootnote 1. Formally speaking an ordinary (timed place), Petri net is a 5-tuple \(G=\{P, T, F,\tau , {M}_{0}\}\), where \(P\) is a set of places, \(T\) is a set of transitions, \(F\) is the flow relation between places and transitions (i.e., arcs), \(\tau\) is the set of time delays associated with places, and \({M}_{0}\) is the initial marking. We model a job shop system following the concept of timed S3PR [12]. In S3PR nets, a number of concurrent serial processes (jobs) are modeled with an equal number of acyclic connected state machines.

Places

There is a unique initial place \({p}_{sj}\) and a unique end place \({p}_{ej}\) on each job type \(j\). These initial and the end places represent the conditions process instance ready to start and process instance finished. \(M_0(p_{sj})\) represents the number of parts (instances) of job type \(j\). Places belonging to the serial processes are denoted as “Operational Places” (PO) and can be timed, if representing the execution of an operation, or untimed if representing a condition such as “part in buffer,” “ready,” “in queue,” or “finished.” As in Lee and DiCesare [16], a timed place \({p}_{ij}\) is associated with operation \({o}_{ij}\). In this paper, we assume that buffers of infinite capacity are located before each stage. Places representing the condition “job \(j\) at buffer prior to execution of operation \({o}_{ij}\)” are denoted as \({b}_{ij}\). Resource capacity constraints are incorporated to the model via “resource places.” The set \({P}_{R}=\left\{{p}_{r} / r\in R\right\}\) contains all resource places. The subset \({P}_{R(ij)}=\{{p}_{r}/r\;\text{is used in operation}\;{o}_{ij }\}\) contains all resource places required in operation \({o}_{ij}\). Without loss of generality, the capacity of all resources \(r \mathrm{\hat{I} } R\) is set to 1, i.e., there is only one unit of resource of each resource type. All resource places are untimed.

Transitions \({t}_{\left(ij\right)s}\) and \({t}_{\left(ij\right)e}\) represent respectively the start and end of operation \({o}_{ij}\).

\(T=\{{t}_{s\left(ij\right)}\}\cup \{{t}_{e\left(ij\right)}\}\) such that \({o}_{ij}\in O\). The following construction rules hold in this paper:

\({t}_{s\left(ij\right)}=\{{p}_{sj}\}\cup {P}_{R(ij)}\) if \({o}_{ij}\) is the first operation (i.e., \(i=1\)) of job type \(j\) and \(\{{b}_{ij}\}\cup {P}_{R(ij)}\) otherwise; \({t}_{s(ij)}\)• = \({\{p}_{ij}\}\);

\({t}_{e\left(ij\right)}\) = \(\{{p}_{ij}\}\);

\({t}_{e\left(ij\right)}\)• = \(\{{b}_{(i+1)j}, {P}_{R(ij)}\}\) if \(i < |{O}_{j}|\) and \(\{{p}_{ej}, {P}_{R(ij)}\}\) otherwise.

The S3PR model of the production activities is defined here as the “plant” net as in Mejía et al. [17]. Figure 1 illustrates a Petri net model of the plant of two jobs with two stages each: stage 1 of job 1 (resp. job 2) has operation \({o}_{11}\) (resp. \({o}_{12}\)) represented by place \({p}_{11}\) (resp. \({p}_{12}\)). Stage 2 of job 1 (resp. job 2) has operation \({o}_{21}\) (resp. \({o}_{22}\)) represented by place \({p}_{21}\) (resp. \({p}_{22}\)). Place \({b}_{21}\) and \({b}_{22}\) are buffer places between stage 1 and stage 2. Places \({M}_{1}\) to \({M}_{4}\) represent the availability of machines 1 to 4.

Fig. 1
figure 1

Petri net Model of two jobs with two operations each

Definition

A reconditioning/rework trajectory is a sequence of actions required to finish an operation that, for instance, failed inspection and needed reprocessing. The reconditioning/rework consists of two actions: the repair (reconditioning) and the reprocessing (rework).

For the purpose of this work, reconditioning/rework trajectories are modelled with a subnet consisting of one single timed place \({p}_{rw(ij)}\), denoted as reconditioning place, which models the execution of the reconditioning action. Such a reconditioning place is linked to one input uncontrollable transition \({t}_{d(ij)}\) and to one output controllable transition \({t}_{c(ij)}\). These transitions \({t}_{d(ij)}\) and \({t}_{c(ij)}\) represent respectively the start and end of the reconditioning action required by operation \({o}_{ij}\). Let us assume that operation \({o}_{ij}\) requires a set of resources \({R}_{ij} \subseteq R\) for its processing.

Rework subnets are created in real-time and attached to the “plant” subnet (i.e., the manufacturing system) whenever a rework is triggered. In real-life, a sensor will detect a defect in a workpiece and triggers an alert to the shop controller that makes a rescheduling decision depending on the duration of the corresponding reconditioning activity. The rework subnet is eliminated from the plant subnet, whenever the reconditioning action is finished.

Definitions

A reconditioning subnet associated with the reconditioning of a single operation \({o}_{ij}\) is defined as:

\({{\text{N}}_{rw\left(i,j\right)}=(P}_{rw\left(ij\right)}, {T}_{rw\left(ij\right)}, {P}_{rw\left(ij\right)}\times {T}_{rw(ij)}, {{T}_{rw(ij)}\times P}_{rw(ij)},{M}_{0rw\left(i,j\right)},{\tau }_{rw\left(i,j\right)})\) in which:

\({P}_{rw\left(ij\right)}=\{{p}_{rw\left(ij\right)}\}\)

\({T}_{rw\left(ij\right)}=\{{t}_{d\left(ij\right)}, {t}_{c\left(ij\right)}\}\)

\({M}_{0rw\left(i,j\right)}\) is the initial marking of place \({p}_{rw\left(ij\right)}\)

\({\tau }_{rw\left(i,j\right)}\) is the stochastic time delay of place \({p}_{rw\left(ij\right)}\).

The flow relations \(P_{rw(ij)}\times T_{rw(ij)},T_{rw(ij)}\times P_{rw(ij)}\) are the following:

\({p}_{rw\left(ij\right)}=\left\{{t}_{d\left(ij\right)}\right\}\).

\({p}_{rw(ij)}\)\(=\left\{{t}_{c\left(ij\right)}\right\}\).

\({t}_{d(ij)}=\left\{{p}_{ij}\right\}\).

\({t}_{d(ij)}\)• \(=\left\{{p}_{rw(ij)}\cup {p}_{r}\in {P}_{R(ij)}\right\}\)

\({t}_{c(ij)}=\left\{{p}_{rw(ij)}\right\}\)

\({t}_{c(ij)}\)\(=\left\{{b}_{ij}\right\}\)

See Fig. 2 for an example of a rework subnets. The plant, representing only production activities, is shown in Fig. 2a; the augmented net is shown in Fig. 2b. Places \({p}_{1}- {p}_{4}\) are operational places and \({p}_{5}\) is a resource place representing the availability of resource \({r}_{1}\). Places \({p}_{1}\) and \({p}_{4}\) are respectively start and end places; \({p}_{2}\) represents a buffer place and place \(p\) 3 represents the execution of an operation. Transitions \({t}_{1}\) to \({t}_{3}\) \(\in\) T are controllable plant transitions; transitions \({t}_{d}\) and \({t}_{c}\) represent respectively the start and end of the reconditioning action. All transitions \({t}_{1}\)\({t}_{3}\) and \({t}_{c}\) are controllable whereas \({t}_{d}\) is uncontrollable. Consider the case where place \({p}_{3}\) is marked: both transitions \({t}_{3}\) and \({t}_{d}\) can fire. If \({t}_{d}\) fires, the system enters into a disrupted state and can be eventually returned to a normal state when transition \({t}_{c}\) fires. When \({t}_{c}\) fires, the subnet is removed from the plant.

Fig. 2
figure 2

Reconditioning/rework trajectories modeled with Petri nets. a Plant. b Example of reconditioning/recovery subnet

The augmented Petri net is \({\text{N}}_{A} = \left({\text{N}}, {\text{N}}_{rw\left(i,j\right)}\right)\). This augmented net is live and bounded [18].

2.2 Job shop scheduling problem (JSSP)

A job shop is manufacturing system in which a set \(M\) of machines, indexed in \(i\), processes a set of jobs \(J\), indexed in \(j\). Each job consists of a set of operations which must be processed in a pre-defined order (route) and no pre-emption is allowed. In general, not all job routes are the same. The term \({o}_{ij}\) denotes operation of job \(j\) on machine \(i\). The deterministic processing time of operation \({o}_{ij}\) is \({p}_{ij}\). The set of operations of job \(j\) is denoted as \({O}_{j}\).Without loss of generality, in this work, we set \(\left|{O}_{j}\right|=\left|M\right|=m\). The set of all operations is denoted as \(O= \bigcup {O}_{j}\).

A job shop scheduling problem (JSSP) consists of calculating the start (or end) times of all operations \({o}_{ij}\in O\) in such a way that an optimization criterion is minimized. A schedule \(S\) is an array of operations \({o}_{ij}\) together with their corresponding start times \({s}_{ij}\) sorted in the chronological order. The literature on offline JSSP and its variants is vast.

In this paper, we studied the minimization of the total flow time (TFT), which is related to the total time spent in queue. The TFT can be calculated following Eq. (1).

$$\text{Total }\text{flow time } ({\text{TFT}})=\sum\limits_{j=1}^{n}{C}_{j}$$
(1)

The term \({C}_{j}\) is the completion time of job \(j\) in the original schedule and \(n\)\(\left|J\right|\) is the number of jobs. We define RPD as in Katragjini et al. [11]. The formula is:

$$RPD=\frac{TF{T}_{e}-TF{T}_{\mathrm{best}}}{TF{T}_{\mathrm{best}}}$$
(2)

where \(TF{T}_{e}\) (resp. \(TF{T}_{\mathrm{best}}\)) is the value of \(TFT\) of the executed schedule (resp. the best offline schedule among all algorithms).

In the rescheduling situation, we use the classical stability formula [19] as follows (Eq. 3):

$$\eta =\frac{\textstyle\sum_{1}^{|O|}\sum_{{o}_{ij}\in O}\left|{C}_{ij}-{C}_{ij}^{{{\prime}}}\right|}{\sum_{{o}_{ij}}{p}_{ij}}$$
(3)

where \({C}_{ij}\) (resp. \({C}_{ij}^{{\prime}}\text{)}\) is the completion time of operation \({o}_{ij}\) in the original schedule (resp. the executed schedule).

2.3 Rescheduling

Rescheduling is a topic that has been widely studied over the years. Efforts have been made to classify the environment, the rescheduling policies, the orksle regeneration, and the types of disruptions, and. We only will provide a brief review of the topic. Interested readers are referred to the paper of Vieira et al. [7].

Regarding the environment, these can be classified as static and dynamic: in static environments, all ork are known from the beginning whereas in dynamic environments, ork arrive continuously to the system. Schedule generation refers to whether the orksle is calculated with provisions for disruptions (i.e., robust schedules) or not (i.e., nominal schedules). In the first case, the orksle is expected to absorb most perturbations and rescheduling is orksle rarely recalculated; in a nominal orksle, rescheduling actions take place in one way or another when disruptions occur.

When disruptions occur, the three more common repair strategies to repair are full orksle regeneration, partial rescheduling, and right-shift scheduling [20]. In orksle regeneration, the entire orksle is recalculated, considering those operations that have not started yet their execution. This strategy produces the best results in terms of the objective function but requires more computer effort and produces instability. Partial rescheduling attempts to reschedule only a subset of operations. One of the better-known methods for partial rescheduling is AOR (affected operations rescheduling) in which only affected operations (those whose start times change) by the breakdown are rescheduled. The right-shift scheduling strategy consists of postponing the start of the affected operations by the amount of time of the breakdown. This is the easiest to implement and produces the highest stability but the orksle quality can be compromised [20].

In terms of policies, these can be periodic, event-driven, or hybrid. In periodic rescheduling, the orksle regeneration is performed at regular intervals whereas in event-driven approaches, the orksle is regenerated whenever disruptions occur. Hybrid rescheduling falls in between [21]. One of the questions is when to launch a full or a partial re-schedule. A common approach is to trigger a full rescheduling method orksle certain conditions on a case by case basis. For example, Pfeiffer et al. [22] used a simulation-based approach, taking into account machine breakdowns and stochastic processing times. Authors evaluated their solution approach in terms of stability and efficiency. They proposed a heuristic method for periodic rescheduling and a triggering mechanism that activates whenever the cumulative difference between the planned and simulated completion times are greater or a threshold. Later, Bidot et al. [23] investigated stochastic processing times and machine breakdowns. In that work, a orksle is generated offline and launched. A controller monitors the orksle execution and activates full, partial, or progressive rescheduling processes by checking case-specific performance conditions.

Lately, the research body has focused towards schedules that are modified whenever disruptions occur. For example, Dong and Jang [21] proposed new rescheduling methods orksle the AOR and compared their results with the classical AOR methodologies. He and Sun [24] combined robust schedules with rescheduling in flexible job shops and compared their results with other strategies in terms of stability and robustness. Katragjini et al. [11] combined stability measures with efficiency measures (i.e., makespan) in a single objective function in a flow shop manufacturing system and compared several metaheuristic algorithms. Ahmadi et al. [19] proposed a multi-objective model for flexible job shops and studied the performance of several classical algorithms (e.g., NSGA-II, NRGA). More recently, Larsen and Pranzo [25] developed a general rescheduling framework with different types of disruptions. A controller triggers a rescheduling algorithm whenever a deterministic or stochastic rescheduling condition is met. Authors evaluated their solution approach in a job shop orksle. Gao et al. [26] proposed a Jaya algorithm on a flexible job shop orksle with machine recovery. In the rescheduling phase, instability and makespan are minimized. Nie et al. [27] proposed an event-driven, rescheduling approach orksle a genetic algorithm for flexible job shop scheduling problems subject to machine breakdowns. In that work, authors focused on minimizing makespan in the scheduling phase, while in the rescheduling phase, the orksle’s performance is measured by the weighted sum of RPD and stability.

In regard to Industry 4.0, Framinan et al. [28] investigated the impact of real-time information in the performance of rescheduling policies. Their results suggest that real-time information can indeed improve the performance of the manufacturing system but only under low variability conditions. Also, technological advances have enable to address scheduling problems using real-time information through Cloud architectures, as in Zhang et al. [29], where a multi-objective game-theoretic approach is developed. Similarly, Carlucci et al. [30] and Goodarzi et al. [31] implemented game theory approaches.

In other orks, Ghaleb et al. [32] proposed a real-time rescheduling system considering job shop problems and stochastic processing times. In this work, the arrival of new ork and machine breakdowns trigger the rescheduling process. The authors evaluated several rescheduling policies such as continuous and event-driven rescheduling.

In general, most orks aim to balance performance with stability measures. To do so, the different approaches adopt different strategies to launch either a full reschedule or a partial one. Our approach is different in the sense that it exploits real-time information to decide when to perform a full or a partial reschedule. On the other hand, game theoretical approaches involve agents with individual goals which is not the case in our paper. In our case, we have a single central controller with two distinct goals.

2.4 Rework

Rework in manufacturing is one of the many disruptions that affect its performance, and it is one of the least studied. The literature in rework includes aspects such as costs [33], lot sizing [34], and quality [35, 36]. Rework in manufacturing systems has been addressed from the perspectives of simulation-based dispatching rules [37, 38], and with classical heuristic and metaheuristic approaches: examples of the latter are cases of single machine scheduling [39], parallel machines [40,41,42,43], flow shops [44,45,46], and batch facilities [47].

Table 1 summarizes the findings on rework. In this table, papers are classified regarding scenario modelling (i.e., how rework events are considered): if reworks are modelled with estimated parameters, or if they are event-driven. The event-driven approach refers to those cases where the rework event data are known only when the production is started; thus, it is mandatory to make decisions on the fly with some rescheduling strategy. As seen in Table 1, few papers considered the event-driven approach, although these works studied single-machine problem and parallel machine settings [41, 48]. Then, to the best of our knowledge, in the literature, there are no event-driven approaches for the rescheduling problems with reworks for more complex manufacturing configurations as job shops, and this is a problem that occurs in real-life for instance in automotive manufacturing settings.

Table 1 Review of papers on reworks and scheduling

It can be seen from Table 1 that most works aim to optimize classical objective functions, although in terms of expected values. Only the paper of Liu and Zhou [41] addressed the topic of rework and stability although on single stage manufacturing systems.

To the best of our knowledge, no work has addressed the problem of the rescheduling with rework and reconditioning in job-shop systems. The literature on scheduling with rework estimate the schedule performance based on expected values calculated analytically or with simulation and consider the situation in which a job is processed on a machine, then inspected outside the machine, and immediately returned to the queue. These works do not consider reconditioning and/or stability. The works on rescheduling consider mostly machine breakdowns and other disruptions, but no rework.

3 Proposed approach

3.1 Assumptions

We consider the case in which a reconditioning action is needed prior to the rework itself. This is the same example of paint that needs to be scrapped off (i.e., the reconditioning) before the work piece is repainted (i.e., rework). The following assumptions hold in this paper:

  • All job operations may eventually need rework. All operations that need rework require reconditioning actions.

  • The corresponding reconditioning times are exponentially distributed. The mean of the distribution is denoted the mean time to reconditioning (MTTR).

  • The rework probability of all operations is the same. This probability is denoted the percentage of reworked operations (PRO).

  • The rework time of a job operation equals the original processing time.

  • The time required for the reconditioning is known as soon as the rework/reconditioning request is triggered. In practice, we need technologies of industry.

3.2 Event-driven implementation

In this work, we study via simulation how the event-driven system reacts to disruptions that involve reconditioning and rework operations. This approach is different from previous works in rework in which the performance indicators are calculated based on expected values and not from the actual schedule execution.

The first step is to produce an “offline” schedule. In this paper, we generate schedules with two methods: The first method is dispatching rules, which are widely used and produce acceptable performance in terms of solution quality. The second method is a modified BAS (Beam A* Search) algorithm, originally presented in Mejía and Montoya [59] and later improved in Mejía and Lefebvre, [12] and in Rossit et al. [60]. BAS is a graph search algorithm for different types of scheduling problems. Such algorithm is amenable with our Petri Net model and performs well in comparison with other generic job shop scheduling algorithms. Like all graph search methods, BAS starts with a node (a marking in the Petri net) and progressively expands its children’s nodes until a final goal is found (final marking). To avoid, “state explosion,” the number of nodes at each state is limited at a certain predefined value (the “beam”). In BAS, the criterion for expansion is “best first,” which implies an evaluation function for prioritizing expansion. We refer the reader to Mejía and Lefebvre [12] for more details of the BAS algorithm.

The pseudo code of the event-driven implementation is as follows:

Inputs

An instance of a JSSP and its equivalent Petri net, the MTTR, and the PRO values.

Outputs

The calculation of the executed total flow time and stability indicators.

  1. 1.

    Calculate an initial schedule array (\(S\)).

  2. 2.

    Execute the schedule until a reconditioning/rework request is issued for some operation \({o}_{ij}^{*}\) or until all jobs are completed. A reconditioning/rework request occurs with a probability set as PRO.

  3. 3.

    If all jobs are completed, then terminate with success.

  4. 4.

    Else (some job operation \({o}_{ij}^{*}\) needs rework)

    1. (a)

      Generate \(\tau ({o}_{ij}^{*})\), an exponentially distributed reconditioning time of operation \({o}_{ij}^{*}\) with mean MTTR.

    2. (b)

      Rebuild the schedule taking into account that job operation \({o}_{ij}^{*}\) will be available for scheduling after \(\tau ({o}_{ij}^{*})\) time units and go to 2. The new current schedule array (\(S\)) contains all job operations that have not started their processing.

This simulation of the event-driven process is illustrated in Fig. 3

Fig. 3
figure 3

Simulation process diagram

3.3 Proposed rescheduling policies

In this paper, we study a hybrid policy that triggers either a full schedule regeneration or our adaptation of the right-shift policy depending on the duration of the reconditioning.

Definition

Let \({U}_{t}\) be the set of operations that have not started their execution at time \(t\) and that can start the earliest.

The “full regeneration” (FR) policy reschedules at time \(t\) all operations in \({U}_{t}\) either with dispatching rules or with the BAS algorithm depending on how the initial schedule was computed. Clearly, the FR method must be consistent with the original generation method of the offline schedule. The complexity of FR will depend on the selected algorithm.

Our adaptation of the right-shift policy consists of rebuilding the schedule maintaining the original sequence as much as possible. We denote this policy as “keep sequence” (KS). The following pseudocode illustrates the KS approach. Let us suppose that a reconditioning/rework request is issued at time \(t\) (i.e., some job operation needs to be reworked).

Input

The current schedule array \(S\) and \({S}^{{\prime}}\) = null;

Output

A valid schedule \({S}^{{\prime}}\) that contains the start times of all operations that have not yet started.

  1. 1.

    Determine the set \({U}_{t}\) at time \(t\)

  2. 2.

    Among all operations in \({U}_{t}\), find the operation that appears first on \(S\), say operation \({o}_{ij}^{{\prime}}\). Append \({o}_{ij}^{{\prime}}\) and its expected start time \({s}_{ij}^{{\prime}}\) to \({S}^{{\prime}}\) and remove it from \(S\).

  3. 3.

    Repeat 1 and 2 until the set \({U}_{t}\) is empty.

    Let \(n\) and \(m\) be the number of jobs and machines respectively. Computing \({U}_{t}\) and finding \({o}_{ij}^{{\prime}}\), both take \(mn\) steps in the worst case; therefore, the complexity of KS is \(O\left(mn\right)\).

The selection of the rescheduling method (FR vs. KS) will depend of whether the reconditioning time exceeds a certain threshold. The rationale is that if the reconditioning time is short (i.e., below the threshold), there will be no need to reschedule all operations and the KS routine will be preferred; in the situation of above-threshold rework times, a full schedule regeneration will be selected. In the extreme cases, if the threshold is very small (\(\sim\)0), the schedule will always be fully regenerated (i.e., FG policy); if the threshold is very large (\(\sim\)∞), the KS routine will always be invoked. The main goal of this paper is to compare dedicated algorithms for offline scheduling vs. dispatching rules in a dynamic scenario. The section of computational results will present the behavior of the system under varying threshold values.

4 Experimental runs

In this section, we present the results of the computational experiments. In the experiments, we evaluated the performance of the BAS algorithm and of three dispatch rules: shortest processing time (SPT), longest processing time (LPT), and most work remaining (MWR). The three dispatching rules are selected for their capability to optimize each of the two criteria that we are considering in this study. We studied 100 instances, grouped in ten different sizes of instances, that is, ten different combinations of \(J\) × \(M\), namely, 10 × 5, 10 × 10, 15 × 5, 15 × 10, 15 × 15, 20 × 5, 20 × 10, 20 × 20, 30 × 05, and 30 × 10. The MTTR values were set as a percentage of the total work time of all job operations in particular each instance. The percentages were 0.5%, 1%, and 1.5%. The total work time is defined as the sum of processing times of all job operations. The PRO values were set as 2%, 5% and, 10% of the total number of operations. These values are consistent with the suggestions of Subramaniam et al. [20]. Threshold values were set between 0 and 500% of the MTTR in 50% steps [60].

The BAS algorithm and the three abovementioned dispatching rules were tested on 10 sets of different sizes, each set with 10 instances, with all combinations of MTTR (3 different values), PRO (3 different values), and threshold (11 different values). The number of replicates for each combination was set to 100. Thus, 100 × 10 × 10 × 3 × 3 × 11 = 990,000 different experiments were run.

The corresponding result analysis is presented as follows: first, we globally compare the performance of the rescheduling methods, then, the analysis is focused on comparing the dispatching rules and the BAS algorithm considering different threshold values in terms of stability and RPD. Finally, we analyze the effect of instances sizes, MTTRs and, PROs on the performance measures discussing the main insights that our study provide.

4.1 Evaluation of the rescheduling methods

In this section, we present a sample of the results obtained with the different scheduling methods and with different levels of the external factors, namely, MTTR and PRO. Due to the very large size of the experiments, only the illustrative results are presented; full results are presented in the supplementary material.

The charts of Fig. 4 show the performance of the tested algorithms in terms of RPD (averaged on all instances) vs. threshold. PRO values are identified with lines of different colors, while MTTR values are identified by markers. The charts of Fig. 5 represent the stability values (averaged on all instances) vs. threshold, following a format analogous to that of Fig. 4.

Fig. 4
figure 4

Average RPD values vs. threshold

Fig. 5
figure 5

Average stability values vs. threshold

We can observe the BAS algorithm is the best performer in terms of RPD, but it is also the worst performer in terms of stability. The opposite is true for the MWR rule. The performance of all algorithms is significantly affected by the threshold values. However, we could not observe a definite trend: for both BAS and SPT, as the threshold value increases, the RPD increases and the stability decreases (improves), also as expected. However, for the LPT and MWR rules, the pattern is the opposite; both RPD and stability improve with higher values of threshold. For the lowest value of PRO (i.e., PRO = 2), the RPDs of a particular algorithm are virtually the same for all values of MTTR and the same threshold. However, for greater values of PRO, the differences in RPD are more noticeable for the different values of MTTR. Also, for PRO = 2 and PRO = 5, the average RPD of the BAS algorithm is clearly better that those of the dispatching rules for all threshold and MTTR values. However, as the values of PRO increase, the differences between BAS and the second performer (SPT) are smaller. For example, for PRO = 2, MTTR = 0.5 and threshold = 0, the average RPD of BAS is around 3.5%, whereas for SPT is around %7.0; at the other extreme, for PRO = 20, MTTR = 1.5 and threshold = 500, the average RPD of BAS is around 49.0%, whereas for SPT is around %51.0.

Changes in RPD and stability are negligible with threshold values higher than 250% for all algorithms. Threshold values have a more noticeable impact on the BAS algorithm than on the dispatching rules, especially in terms of stability. It can be seen in Figs. 4 and 5 that for the BAS algorithm, the slopes of the lines are steeper at lower values of threshold in comparison with the dispatching rules. Generally speaking, the BAS algorithm is more sensitive to changes in threshold in comparison with the dispatching rules. This characteristic can be useful for the decision maker to establish the desired trade-off between schedule quality (measured by the objective function) and stability. On the other hand, dispatching rules are very robust to changes in threshold and therefore the KS policy can be adopted as it is the least demanding in terms of computational terms.

4.2 Comparison of the algorithms’ performance

In this section, we compare the algorithms averaging for all values of MTTR and PRO and varying the thresholds. Figures 6 and 7 illustrate the average RPD and stability, respectively. In each figure, the four algorithms (BAS, SPT, LPT, and MWR) are shown with different colored lines. The values on the charts represent the global average of all the experiments for different values of threshold.

Fig. 6
figure 6

Average RPD for all values of RPD and PRO

Fig. 7
figure 7

Average stability %

Figure 6 shows that BAS clearly outperforms the MWR, SPT, and LPT dispatching rules in terms of the average RPD for all threshold values. As the threshold values increases, the RPD increases for BAS and SPT, while MWR and LPT remain more or less constant. Notice that BAS and SPT present a similar pattern. This can be explained due to the fact that BAS uses SPT as its evaluation function.

Regarding the stability, Fig. 7 shows that MWR dispatching rule is the algorithm that, on average, leads to more stable schedules. On the other hand, the BAS algorithm is the one that on average leads to more unstable schedules. Nevertheless, at higher threshold values, the difference between BAS and the dispatching rules reduces. As the threshold values increases, the stability with the MWR and SPT does not seem to vary much, while it tends to decrease with the BAS and LPT algorithms.

Another feature of the problem that is relevant to analyze is the effect of the instance sizes in the performance of the compared algorithms in terms of the RPD and stability. The following findings show that the relation between the number of jobs and the number of machines of different instance sizes, i.e., \(\mathrm{ratio}=\frac{\left|J\right|}{\left|M\right|}\), has an important influence on the performance of all algorithms in terms of RPD and stability. The ratio can be seen as an indicator of congestion as more jobs are pushed into system. Figure 5 illustrates the average relative percentage differences (ARPD) of the algorithms with respect to the best performer (BAS) of RPD vs. threshold.

$$ARPD=\frac{{\overline{RPD} }_{{algo}}- {\overline{RPD} }_{{best}}}{{\overline{RPD} }_{{best}}}\times 100\%$$

\({\overline{RPD} }_{{algo}}\) and \({\overline{RPD} }_{{best}}\) are respectively the average relative percentage deviations of the algorithm and of the best performer.

Figure 8 shows a selection of the instance sizes, particularly, those with 10 machines (10 × 10, 15 × 10, 20 × 10, and 30 × 10) with ratios of 1, 1.5, 2, and 3 respectively. These results suggest that more congested systems (generally speaking, congested system are those with far more jobs that machines) would be benefited from better algorithms in the case of rescheduling with rework.

Fig. 8
figure 8

% ARPD

In Fig. 8, we can observe that all ARPD values are larger in the LPT and MWR charts (ranging from 10 to 30%) in comparison with the SPT chart (between 3 and 5%). For example, in the 30 × 10, 20 × 10, and 15 × 10 curves, with ratios of 3, 2, and 1.5 respectively, the ARPD values are larger in comparison with instances of ratio of 1 (10 × 10).

Figure 9 presents the average stability percentage differences (ASPDs) between each algorithm and the best performer (MWR dispatching rule in this case).

Fig. 9
figure 9

% stability difference SPT; LPT; BAS vs. MWR

$$ASPD=\frac{{\bar{\eta }}_{{algo}}- {\bar{\eta }}_{{best}}}{{\bar{\eta }}_{{best}}}\times 100\%$$

\({\bar{\eta }}_{{algo}}\) and \({\bar{\eta }}_{{best}}\) are respectively the average stability values of the selected algorithm and of the best performer.

The results show that the ratio also impacts the performance of the algorithm. Notice that there is a similar pattern in comparison with the ARPD curves, although the best performer is now the MWR rule. In addition, as the threshold values increases, the stability obtained with BAS and SPT improves. This improvement is much more accentuated for those cases with larger job/machine ratios. The sensitivity of the ratio vs. the threshold, in terms of stability, is clearly seen when comparing the performance of the algorithms on the 30 × 10 (ratio = 3) curves and on the 10 × 10 (ratio = 1) for BAS and SPT.

4.3 Problem features assessment: PRO and MTTR values

Finally, in this section, we will present a detailed analysis of the PRO and MTTR factors on the performance of the BAS algorithm. First, the impact of the PRO for the BAS algorithm is analyzed with fixed threshold and MTTR values. Figures 10 and 11 show the RPD and the stability values of the BAS algorithm for different values of PRO. In these figures, the number of machines is fixed at 10, with different job/machine ratios. The threshold is fixed at 0% on the left panels of Figs. 10 and 11, and on the right panels, the threshold is fixed at 500%. The MTTR considered for the cases depicted at Figs. 8 and 9 is of 0.5.

Fig. 10
figure 10

BAS algorithm RPD impact depending on the number of machines

Fig. 11
figure 11

BAS algorithm stability impact depending on the number of machines

In general terms, it can be noticed that both the RPD and the stability increase in a linear fashion as the PRO values increase. Another observation illustrated by Figs. 10 and 11 is that the RPD and the stability tends to deteriorate as the number of jobs increases (with the same number of machines). Both figures represent instances of 10 machines, but different number of jobs, and when the charts are compared to each other, it is observed that the larger the number the jobs (the larger the \(\mathrm{ratio}\)), the larger increase in the values of the RPD and the stability. Nevertheless, it can be mentioned that this behavior is more notorious for stability than for the RPD. It is important to mention that these findings will also take place when considering other MTTR values, such as 1.0 and 1.5 that for sake of brevity are not introduced in this manuscript.

Finally, we studied the effect of the MTTR on the stability and the RPD, considering the different PRO, depending on the features of the instance sizes, as is shown in Figs. 12 and 13. This analysis is carried out for the BAS algorithm with a fixed threshold value, but it can be extended for the other algorithms and threshold values. From Figs. 12 and 13, it is possible to notice that, in general terms, for each MTTR, both RPD and stability tend to deteriorate while the PRO increases. In order to identify other behavior patterns, we analyze each instance size according to the number of jobs and the \({\text{ratio}}\). As an example, Figs. 12 and 11 show the RPD and stability behavior respectively for instances with 15 jobs (15 × 5, 15 × 10, and 15 × 5). Particularly, we can observe that, for instances with larger \({\text{ratio}}\) values, when considering lower PROs values, the RPD and the stability do not seem to present a relevant variation between the ones obtained with the different MTTR values. Nevertheless, as the \({\text{ratio}}\) decreases, the RPD and stability tend to deteriorate as the MTTR values increases.

Fig. 12
figure 12

BAS algorithm RPD depending on the PROs and instance sizes

Fig. 13
figure 13

BAS algorithm stability impact depending on the number of machines

5 Conclusions and future work

In this work, we studied and presented the results of job shop rescheduling under rework and reconditioning. We used an event-driven scheduling method embedded in a Petri nets formalism for all the algorithms. The results were analyzed in terms of RPD and stability, showing that the BAS algorithm in combination with high threshold values is the algorithm that achieves the best results. On the other hand, simple dispatching rules are more stable than the more sophisticated BAS. These results also indicate that the threshold has a significant impact on the algorithm performance, with higher thresholds resulting in better stability but worse RPD. The overall conclusion is that, for the studied methods, no algorithm fully outperforms the others in both criteria. However, if the PRO and MTTR values can be reduced through better manufacturing methods and quality control, the use a better scheduler such as BAS is justified. Even more, the experimentation justifies the dispatching rule selection, since MWR is the best in the stability criterion alone; meanwhile, SPT is the best for total flow time alone. However, BAS algorithm has the best performance regarding the two criteria together, yielding reschedules that can represent the compromise solution from both criteria.

We also analyzed the influence of the instance size on the performance of the system. When analyzing the instance, an important factor that describes the behavior of the system was the ratio between the number of machines and the number of jobs. For higher ratio values (i.e., increasing the number of jobs for a fixed number of machines), the stability was worse.

In terms of potential implementations, Industry 4.0 Technologies and machine learning algorithms are a must. The system needs to quickly detect the need of rework and to compute the time to reconditioning so to establish the “best” reactive strategy. The computation of the reconditioning time requires automatic learning to provide better estimates. As future research lines, we aim to extend the analysis to study the impact of real-time information and the impact of other events such as machine breakdowns.

Game theoretical approaches can be also included in further research. For example, an interesting approach will be handling customer orders. The Petri net approach is also well suited to handle communication protocols and can be combined with the manufacturing system model.