1 Introduction

University timetabling was one of the first problems to motivate research on timetabling (Cole 1964). Many academics have their agenda determined by the results to a considerable extent. A course timetable, for example, is strongly constrained by the (limited number of) lecture, tutorial, seminar, and lab rooms, their respective properties, the number, size, and the hierarchical structure of student groups, and the availability of the lecturers. Increasing flexibility to combine different modules results in extra constraints on the students’ opportunity to attend all the lectures and exams. Constructing timetables manually thus becomes more complex, and moreover it is a tedious and repetitive process. The quality of the final timetable affects lecturers and students alike, but it can also affect the administrative and technical services on a campus. If, for instance, all lectures scheduled before lunch break finish at the same time, the student restaurant will be flooded by hungry students. Exams that are not spread out sufficiently for all students, can have a serious impact on the number of students that pass their exams.

University timetabling comes in two groups (Schaerf 1999b): course and examination timetabling. In university course timetabling a distinction is made between post-enrolment and curriculum-based course timetabling. These were also the two course timetabling tracks of the second International Timetabling Competition (McCollum et al. 2010). Post-enrolment differs from curriculum-based course timetabling in that the timetable is constructed after the students have chosen their courses. In the curriculum-based case, the students have to take a set of courses that belong to a certain curriculum. Many research papers address different problems or instances ranging from timetabling competitions (Kostuch 2005; Müller 2008) and (simplified) real world benchmark problems (Carter and Laporte 1996) to very complicated problems from practice that involve room and equipment requirements, flexible student groups, lecturer groups, flexible lecture durations, non-weekly repeating lab sessions, etc. (Beligiannis et al. 2008; Dammak et al. 2008; De Causmaecker et al. 2009; Di Gaspero and Schaerf 2008; Schaerf 1999a; van den Broek et al. 2009).

Examination timetabling is known to the research community as the problem of assigning exams to time slots and rooms in such a way that (1) no students are taking more than one exam at the same time, (2) the study intervals between exams are sufficiently long for each student, and (3) the capacity of rooms is respected. For this problem, competition instances (McCollum et al. 2010) as well as benchmark problems are available. The benchmark problems have been derived from several real world problems, e.g. at the universities of Toronto, Nottingham, Melbourne, etc. (see Qu et al. 2009 for a comprehensive overview).

The problem we introduce in this paper is the examination timetabling problem at the Engineering Department of KAHO Sint-Lieven. We will refer to it as the KAHO problem in the remainder of the paper. The problem under investigation is new. It differs from known examination timetabling problems in that it addresses both oral and written exams. We address this problem with a hyperheuristic approach and compare the timetable with the manual results that were generated in parallel. For the first semester of the academic year 2005–2006, we managed to arrange the exams on weekdays only, whereas the manual planner had not been able to avoid scheduling some exams on Saturdays. The performance of the hyperheuristic algorithm is tested on both the Toronto benchmark instances and the examination timetabling track of the second International Timetabling Competition (ITC 2007).

The contribution of this paper is threefold: (1) the introduction of a complex problem from practice in examination timetabling, (2) a tournament-based hyperheuristic approach consisting of (among other things) the recently introduced late acceptance strategy as a move acceptance criterion, and (3) a performance analysis of the solution approach, including a comparison with the best approaches to the benchmark problems in the recent literature. The claim of this paper is that the hyperheuristic approach can successfully address a complex problem from practice as well as benchmark instances from the scientific literature. There is no one to one match between the three problem descriptions, and the difference between the models is taken care of by the low-level heuristics.

The related literature is reviewed in Sect. 2, whilst a detailed problem description and a model are presented in Sect. 3. We elaborate on the difference between the problem from practice and two profoundly studied benchmark problems for examination timetabling. Section 4 introduces the components of the hyperheuristic approach. In Sect. 5, we discuss experimental results for the KAHO examination timetabling problem and the two benchmarks instances. We conclude the paper and point out some directions for future research in Sect. 6.

2 Related literature

2.1 Examination timetabling

In general, the literature on examination timetabling can be divided into two categories: one concentrates on solving examination timetabling problems from practice, whilst the other focuses on improving benchmark results. Benchmark instances give an unbiased comparison of the search algorithms’ performance, on the condition that

  • some validation tool that impartially checks the correctness of the obtained results exists,

  • and a tool that eliminates the hardware dependency exists. The latter can be achieved, for instance, by providing a software tool that determines the hardware specifications of the researcher’s computer and calculates the available computation time to obtain a solution for that specific problem (cfr. for example First International Nurse Rostering Competition 2010; McCollum et al. 2010).

The most studied examination timetabling benchmark is probably the one that was introduced by Carter et al. (1996). This benchmark consists of 13 simplified real world examination time tabling problems of different universities around the world. In contrast to examination timetabling problems from practice, these benchmark instances do not consider rooms and lecturers. It is assumed that rooms have infinite size. These instances are commonly known as the uncapacitated Toronto benchmarks. Until the introduction of the examination timetabling benchmarks of the Second International Timetabling Competition in 2007, the Toronto data sets were the standard test set for examination timetabling. The objective of the problem is twofold: constructing clash-free examination timetables for every student, and spreading out the exams as much as possible, such that students have sufficient study time. If only the no-clash constraint is taken into account, the problem reduces to a graph colouring problem (Burke et al. 2004b): exams correspond to nodes and edges correspond to pairs of exams that have at least one common student. Assigning exams to time slots corresponds to assigning colours to the nodes, such that connected nodes have different colours. Minimizing the number of time slots whilst satisfying the clash free constraint corresponds to minimizing the number of colours needed to colour all the nodes (or exams). Carter et al. (1996) introduced five basic sorting criteria for assigning exams to the time slots, namely: largest degree, saturation degree, largest weighted degree, largest enrolment, and random ordering.

Different versions of the Toronto instances are available. The instances used in this paper correspond to version I in Qu et al. (2009), which is the most commonly applied version in the literature. The specific properties of the 13 Toronto benchmark instances are summarized in Table 1. The values in the conflict density column (fifth column of Table 1) are the ratios of the total number of non-zero elements to the total number of elements of the ‘conflict matrix’. The conflict matrix contains the number of common students for every exam. The conflict density value gives an indication of the number of conflicts between the individual exams. In the last column, the minimum number of time slots needed to obtain a feasible solution is presented.

Table 1 Problem size of the 13 data sets of the Toronto benchmark (version I). The ‘conflict’ density in the fifth column is the ratio of the number of non-zero elements in the conflict matrix to the total number of conflict matrix elements. In the last column, the minimum number of time slots of a feasible solution is presented

The ‘Benchmark Data Sets in Exam timetabling’ website (Qu 2010), provides a collection of several examination time tabling benchmark data instances and a validator to evaluate the solutions of the Toronto benchmark data sets. Initiatives like these are welcomed since they provide a means to check the correctness of the objective function, cf. the discussion on measurability and reproducibility in Schaerf and Di Gaspero (2007).

Although numerous papers have been published on the uncapacitated Toronto benchmarks—we refer to Qu et al. (2009) for an overview—we only discuss those papers that consider the problem in a similar way. Thompson and Dowsland (1998) investigate the characteristics of a robust simulated annealing algorithm. It turns out that next to the cooling schedule, also the way the neighbourhoods are constructed and sampled is important. The Kempe chain-based neighbourhoods (which will be discussed in Sect. 4.1) outperform the other neighbourhoods. Thompson and Dowsland tackle the examination timetabling problem in two stages: the problem is first solved to feasibility and in the second stage, the quality of the obtained solution is improved without making the solution infeasible. Both stages apply different neighbourhoods. The second stage only employs neighbourhoods that do not violate any hard constraints. We will take these findings into account. Kendall and Hussin (2005a) apply a tabu search-based hyperheuristic approach in which the low-level heuristics execute small changes to the current solution. Low-level heuristics like these are called ‘perturbative’. The heuristic selection criterion applies a tabu list for selecting a non-tabu heuristic in the next iteration. The tabu list length is equal to the number of heuristics. Although they do not improve on the best results in the literature, Kendall and Hussin demonstrate that it is possible to find good enough solutions with a very general approach that can easily be applied to other problems. A hyperheuristic approach to university timetabling problems, was also applied in Qu and Burke (2009). For tackling the Toronto benchmarks, they employed a graph-based hyperheuristic framework. Actually, high level heuristics such as steepest descent, iterated local search, tabu search, and variable neighbourhood search, are applied for searching sequences of low-level graph colouring heuristics, such as largest (weighted) degree, largest enrolment, colour degree, and saturation degree. The approach reported in Qu and Burke (2009) builds on the authors’ experiences in Burke et al. (2007). In the latter paper tabu search is applied as a high level heuristic. In both articles (Burke et al. 2007; Qu and Burke 2009) the two search spaces are investigated: heuristics on the one hand, and solutions on the other hand. In Burke et al. (2010), a hybrid variable neighbourhood search approach to the Toronto benchmarks is described. The variable neighbourhood search is hybridized with a genetic algorithm, which is used for configuring the neighbourhoods. The reported method can improve on one of the data sets of the Toron to benchmarks. Burke and Bykov (2008) introduce the late acceptance strategy and apply it to Carter’s examination timetabling benchmarks. For some of the instances, new best results are obtained. We will incorporate late acceptance among other acceptance criteria in our hyperheuristic approach.

The capacitated instances of the Toronto benchmarks were introduced in Burke et al. (1996) and further studied and improved in Abdullah et al. (2007a, 2007b), Di Gaspero and Schaerf (2001), Caramia et al. (2001), Merlot et al. (2003). We refer to Qu et al. (2009) for a review.

Ever since the introduction of the instances of the examination timetabling track of the Second International Timetabling Competition (McCollum et al. 2010), they have gained increasing popularity. In contrast to Carter’s Toronto benchmarks, the new ones include more constraint types, which makes them more interesting and realistic. For instance, the room capacity constraint is now taken into account. The hard constraints are:

  • a student cannot take more than one exam per time slot;

  • an exam cannot be split over several rooms;

  • the room’s capacity cannot be exceeded;

  • some exams require rooms with special properties;

  • the duration of the time slot to which an exam is assigned, should be greater than or equal to the duration of the exam;

  • some exams should be scheduled before, after, at the same time as or at different times than other exams.

The following soft constraints should be taken into account:

  • two exams taken by the same student should not be scheduled on the same day or in two consecutive time slots;

  • exams should be spread out as much as possible;

  • exams with different durations should not be assigned to the same room;

  • large exams should be scheduled early in the timetable;

  • some of the time slots in the examination timetable should be avoided;

  • some of the rooms should be avoided for examination.

The problem size of the 12 data sets is presented in Table 2. For more detailed information about the examination timetabling track of the Second International Timetabling Competition, we refer to McCollum et al. (2007, 2010).

Table 2 Problem size of the 12 data sets of the examination timetabling track of the Second International Timetabling Competition. The conflict density column indicates the number of conflicts between the exams. It is the ratio of the number of non-zero elements in the conflict matrix to the total number of conflict matrix elements

The five finalists of the examination track of the competition have applied different approaches, such as:

  • a constraint solver, which in the first stage constructs a complete feasible solution, followed by a hill climbing and a great deluge approach for improving the solution (Müller 2008);

  • a Greedy Randomized Adaptive Search Procedure, followed by simulated annealing, and integer programming (Gogos et al. 2008);

  • a constraint satisfaction solver combined with tabu search and iterated local search (Atsuta et al. 2008);

  • Drools Solver (De Smet 2008), which is a combination of tabu search and the JBoss Drools rules engine used for the evaluation, and

  • a heuristic method that is inspired by cell biology (Pillay 2008).

After the end of the competition, it was shown that for all 12 instances, feasible solutions could be obtained (McCollum et al. 2009). Also, McCollum et al. (2009) improve on six out of the 12 benchmark instances. Their approach consists of two parts: a feasible solution is constructed first and afterwards the solution is further improved with an extended great deluge algorithm, to which a reheating mechanism is added for escaping from local optima.

The runner-up of the competition, Gogos et al. (to appear), have published a follow-up paper in which they improve the approach applied in the competition. The method consists of two phases: a feasible solution is constructed, which is then further improved. The improvement phase consists of several parts:

  • first a steepest descent local search is applied until a local optimum is reached;

  • simulated annealing is applied for escaping from the local optimum;

  • when no further improvement is obtained, the problem is divided into smaller problems, which are solved with integer programming;

  • finally, if the previous stage cannot improve the current solution anymore, the solution is shaken.

Twenty percent of the computation time is spent on the construction phase, and the remainder is spent on the improvement phase. For some of the instances, the best results in the literature are obtained.

Besides the benchmarks, considerable attention has been paid to examination timetabling problems from practice (Ayob et al. 2007; Carter and Laporte 1996; Dimopoulou and Miliotis 2001; Hansen and Vidal 1995; Kendall and Hussin 2005b; Lim et al. 2000; Thompson and Dowsland 1996).

2.2 Hyperheuristics for examination timetabling

Hyperheuristics are high level search and optimization methods (Burke et al. 2003). Instead of operating directly on a set of candidate solutions, as is the case in meta-heuristics, they operate on a set of (meta-)heuristics. These low-level heuristics can be either perturbative (changing only small parts of the solution) or constructive (constructing a solution). The motivation behind hyperheuristics is to provide a general framework for tackling a wide range of problems. In an ideal situation, the researcher should only adapt the low-level heuristics for solving a different problem than the current one. Another objective of the hyperheuristics research is to generate synergy between heuristics involved in the search, making use of their strengths, avoiding their weaknesses, and taking advantage of their combined capabilities.

Several papers applying hyperheuristics to examination timetabling problems have appeared in the last decade. Some of them are discussed in Sect. 2.1. Pillay and Banzhaf (2009) apply a constructive hyperheuristic to the Toronto benchmarks. The hyperheuristic chooses a combination of low-level heuristics for constructing a solution. Instead of applying the low-level heuristics sequentially, they are combined hierarchically and simultaneously. The low-level heuristics correspond to the five basic graph colouring criteria introduced by Carter et al. (1996). Without any improvement phase, the algorithm generates good quality results. For some data instances of the Toronto benchmarks, the results are better than those obtained by similar graph-based hyperheuristic approaches. Graph-based hyperheuristics for the Toronto benchmarks are also employed in Burke et al. (2005, 2007), Qu and Burke (2005, 2009). These four papers apply five basic graph colouring criteria (Carter et al. 1996) as low-level heuristics for constructing a solution, with different high level heuristics, such as tabu search, variable neighbourhood search, iterated local search, and steepest descent. Kendall and Hussin (2005b) describe the examination timetabling problem at the MARA university in Malaysia. They apply a tabu search-based hyperheuristic approach, selecting the low-level heuristics. The low-level heuristics combine the graph colouring heuristics and perturbative heuristics, swapping or moving random exams. Kendall and Hussin (2005a) apply the same approach to the Toronto benchmarks.

3 Problem description

3.1 The KAHO problem

Unlike the examples discussed in the literature, the KAHO Sint-Lieven examination timetabling problem combines both written and oral exams. Written exams of the same subject should be scheduled in the same time slot for all the students involved, whereas a lecturer can take 20 students at most in the same time slot for oral exams. Both an oral and a written exam should not be assigned to the same room at the same time. Different written exams can be organized in the same room simultaneously, provided the room capacity is not exceeded.

At KAHO Sint-Lieven, the examination capacity of a room ranges from one third to one half of its normal capacity for lectures. The actual ratio depends on the physical properties of the room.

At the beginning of the academic year, students are automatically enrolled for the exams that correspond to their curriculum. This causes—especially in the first year—a high percentage of students not showing up and a poor usage of the available rooms. Exams are assigned to rooms that are larger than needed.

An exam takes at most half a day (from 8h30 till 12h30 or from 14h00 till 18h00) and should only be scheduled on workdays. Students should have at least 3 half days of study time between two consecutive exams. A typical examination period takes 4 weeks of 5 days each, resulting in 40 available time slots.

We solve the data set for the first semester of the academic year 2005–2006. This was the last semester in which exams were scheduled on Saturdays. The manual planner actually needed 24 days (or 48 time slots) for scheduling the exams. Apart from satisfying the above constraints, the aim is to eliminate the exams on Saturdays as well.

The problem includes:

  • 336 exams,

  • 135 student groups,

  • 71 rooms of varying size and

  • 40 time slots of 4 hours each.

Note that we do not take the assignment of invigilators into account. In practice, for oral exams, the examiner is invigilator at the same time, whilst for written exams, teaching assistants are invigilating. We do plan to incorporate the assignment of invigilators in future work.

3.2 Constraints and objective

In this section, we identify the different constraints of the KAHO Sint-Lieven examination timetabling problem and distinguish between hard and soft constraints. A solution containing hard constraint violations cannot be executed in practice. A solution satisfying the hard constraints is called a feasible solution. Soft constraints do not really need to be satisfied, but if they are, they will improve the quality of the solution. The goal is to search for feasible solutions that satisfy the soft constraints to the highest possible extent.

The hard constraints imply:

  • Students cannot take two exams at the same time.

  • The number of students assigned to a room cannot exceed its capacity.

  • All exams should be scheduled within the planning horizon of four weeks.

Although not all the following rules are considered to be soft constraints at KAHO Sint-Lieven, we rank them as soft since they do not make the solution ‘physically infeasible’ when violated. All soft constraints get the value 1 as weight.

  • All written exams of the same subject should be scheduled in the same time slot.

  • All oral exams should be scheduled such that the maximum number of examinees per time slot is at most 20.

  • Lecturers who are responsible for oral exams cannot examine more than one student group at the same time.

  • Oral and written exams should not be merged into the same room.

  • Students should have at least three slots of study time between two consecutive exams. This corresponds to the spreading out constraint (Corne et al. 1994; Ross et al. 1994).

The objective of the problem is to find a feasible solution that minimizes the weighted sum of the violations of the soft constraints.

3.3 Mathematical formulation

In examination timetabling, E exams need to be scheduled into a limited number of T time slots and R rooms, taking into account the capacity of the rooms, and the restriction that students cannot take more than one exam at the same time, among other constraints. Every student group takes a number of exams. In the examination timetabling problem at hand, exams depend on the student group’s curriculum, which is similar to the situation in the curriculum-based university course timetabling case.

The decision variables are

$$x_{rte} = 0 \ \mathrm{or} \ 1 $$
(1)

with (r=1,…,R; t=1,…,T; e=1,…,E), and where x rte = 1 if exam e is assigned to room r in time slot t. An exam e is characterized by a subset of the set of all student groups, taking the same course and a lecturer l (with l=1,…,L, with L the number of lecturers). A student group is characterized by s (s=1,…,S, with S the number of student groups).

  • Let S rt be the total number of students assigned to room r at time slot t.

  • Let W denote the set of all written exams and O the set of all oral exams. Since no exam can be both written or oral, this means that WO=∅.

  • Let N e denote the number of student groups taking the same exam e of the same course by the same lecturer.

  • Let CAP r be the capacity of room r.

  • Let A denote the conflict matrix. This matrix represents the exams that cannot be scheduled in the same time slot (‘conflicting exams’), due to exams that have common students. This is a symmetric E×E matrix. The matrix elements on the diagonal are 0, whilst the values of the non-diagonal matrix elements correspond to the number of students that the involved exams have in common.

∀ rooms r, time slots t, and exams e:

$$\sum_{m=1}^E A_{me}x_{rtm} \leq M(1-x_{rte}) $$
(2)

with M an arbitrary large number.

Equation (2) expresses that no student group can take more than one exam in the same time slot.

$$\forall e, r, t:S_{rt} x_{rte} \leq{\mathit{CAP}}_r $$
(3)

Equation (3) expresses that the number of students assigned to a room cannot exceed the room’s capacity.

$$\forall e, r:\sum_{t=1}^T x_{rte} = 1 $$
(4)

Equation (4) expresses that all exams are scheduled.

Solutions satisfying (2)–(4) are feasible. In order to find feasible solutions that satisfy as many soft constraints as possible, the following objective function (5) is introduced.

(5)

with P 1, P 2, P 3, and P 4 the result of the violations of the soft constraints. The penalties corresponding to the violations of the soft constraints are expressed below.

$$\forall e \in W, \forall t:P_{1,te} = \Bigg|\sum_{r=1}^R x_{rte} - N_e \Bigg| $$
(6)

Equation (6) generates a penalty when the written exams of the same subject are not assigned to one single time slot.

$$\forall e \in O:P_{2,e} = \Bigg| \sum_{r=1}^R \sum_{t=1}^T x_{rte} - N_e \Bigg| $$
(7)

Equation (7) requires that the number of oral exams of the same course should equal the number of student groups taking the course.

(8)

Equation (8) expresses that an oral and a written exam cannot be assigned to the same room in the same time slot.

(9)

Equation (9) requires that for each student group, at least 3 time slots of study time should be assigned between two exams.

3.4 Benchmarks versus KAHO problem

In Table 3 we summarize both the differences and similarities between the KAHO problem and the two best known examination timetabling benchmarks in the literature: the Toronto benchmark and the ITC 2007 data sets. Note that the KAHO problem differs from the other problems in the considered examination types and in the smallest examinee unit. Since the Toronto and ITC 2007 instances only consider written exams, they do not impose special constraints on the maximum number of students that can take the exam at the same time. Contrary to the two benchmark problems, the KAHO problem is a curriculum-based examination timetabling problem. It means that exams at KAHO are organized per student group taking the same curriculum, whilst in the Toronto and ITC 2007 benchmark instances, every student may have an individual examination timetable. In the KAHO case, all exams take 4 hours, whereas in the ITC 2007 case, exams may have different durations. Also, the exam order is important in the ITC 2007 case: large exams should preferably be scheduled in the beginning of the period. The KAHO and the ITC 2007 case take into account the (limited) room size, whilst the Toronto instances consider infinite room sizes. Actually, except for the lack of oral examinations, and the notion of curricula, the ITC 2007 model can be considered as more general than the KAHO model. It foresees exams with different durations, and it prefers exams that are scheduled early in the planning period. If it would be extended with exams based on curricula and with the notion of oral exams, our model would perfectly fit into the ITC model.

Table 3 Differences and similarities between the Toronto and ITC 2007 benchmark and the KAHO Sint-Lieven examination timetabling problem

4 Solution methods

In this section we introduce the components of the hyperheuristic approach. Section 4.1 introduces the low-level heuristics. The construction of the initial solutions for the three examination timetabling problems is discussed in Sect. 4.2. In Sect. 4.3 we present the four acceptance criteria and the heuristic selection method.

4.1 Low-level heuristics

The hyperheuristics framework is built on top of a local search framework, for which an appropriate solution representation is generated. We mimic the timetabling process of a human planner as closely as possible. The rows of a two dimensional matrix correspond to the rooms and the columns to the time slots. The number of time slots corresponds to the examination period. A room-time slot combination can hold several exams, which can individually be moved to another room-time slot combination. The number of exams in the same room-time slot combination is limited by the room’s capacity.

In order to overcome the under-utilization of rooms for the KAHO problem (Sect. 3.1), we introduce a show up factor: this factor represents the percentage of enrolled students that are likely to attend the exam. Throughout this paper we assume that 95% of the students show up. This estimate is thought to be safe for the first year’s students.

We distinguish between the heuristics that were developed for solving the capacitated KAHO and ITC 2007 examination problems and good heuristics from the literature (Sect. 2) on the uncapacitated Toronto examination timetabling problem. We address the first set as the capacitated and the latter as the uncapacitated heuristics:

  • All the capacitated heuristics rely on one basic move: relocating an exam to another room-time slot combination. In case the capacity of the destination room is insufficient, the content of both matrix elements is swapped; otherwise, the exam is simply moved from the original to the destination position. The heuristics restrict the directions in which the content can be moved. The four heuristics that both capacitated examination timetabling problems have in common are:

    • Cap 1: This heuristic restricts the moves to random room-time slot combinations. The destination room and time slot need to be different from the original room and time slot.

    • Cap 2: This heuristic only allows moves in the same time slot to rooms that have enough excess capacity to accommodate the students taking the exam.

    • Cap 3: A randomly chosen exam is moved to a randomly chosen room, within the original time slot.

    • Cap 4: A randomly chosen exam is moved to a randomly chosen time slot, whilst maintaining the original room.

    The ITC 2007 examination timetabling problem has some extra soft constraints. Therefore we constructed an additional set of low-level heuristics that address these constraints in particular:

    • ITC 1: A randomly chosen exam is moved to another time slot in the same room thereby avoiding time slots that introduce an extra period penalty.

    • ITC 2: A randomly chosen exam is moved to another room in the same time slot, avoiding rooms that introduce an extra room penalty.

    • ITC 3: A randomly chosen large exam is moved to a time slot in the beginning of the examination period.

    None of the above heuristics takes special precautions to maintain the feasibility of a solution.

  • Concerning the uncapacitated Toronto examination timetabling problem, we apply the Kempe chain based heuristics. These low-level heuristics perturb feasible solutions to the uncapacitated examination timetabling problem, without making them infeasible. Suppose a partial solution that corresponds to the left hand side of Fig. 1. If we want to move exam e 1 to time slot t j , the solution becomes infeasible, since exam e 1 has students in common with exams e 6, e 7, and e 8 that are assigned to time slot t j . To overcome this, exams e 6, e 7, and e 8 should be moved to time slot t i . This process is repeated until all the exams that have students in common are assigned to different time slots. The result is depicted at the right hand side of Fig. 1. The Kempe Chain heuristic Uncap 1 can be seen as moving an exam to another time slot whilst maintaining feasibility by repairing any infeasibilities that may have been introduced. The swap-two-time slots heuristic Uncap 2, introduced in Burke and Bykov (2008), swaps all exams of two randomly chosen time slots. This heuristic maintains feasibility since swapping exams includes all exams assigned to a time slot.

    Fig. 1
    figure 1

    An example of the Kempe chain heuristic. A move of exam e 1 in time slot t i to another time slot t j requires repair moves to maintain feasibility. Exams e 3 and e 4 have to be moved to time slot t j and exams e 6, e 7, and e 8 have to be moved to time slot t i

4.2 Constructing initial solutions

Both the KAHO, the Toronto and the ITC 2007 examination timetabling track data sets are tackled in the same way. The approaches only differ in the applied low-level heuristics and in the way the initial solutions are constructed.

  • For the KAHO problem, we do not spend much effort in constructing a good quality initial solution. All the exams are randomly assigned to free room-time slot combinations.

  • The same random initialization procedure was applied to the Toronto instances, but too many iterations were required for obtaining a feasible solution for most of the problem instances. For this reason, we decided to make use of the properties of the conflict matrix (Sect. 2) to construct a feasible solution. The initialization phase is described in Algorithm 1. Starting from the conflict matrix, we assign exams to time slots, such that no student takes two exams at the same time. The procedure assigns exams in order of decreasing number of conflicts, which corresponds to the largest degree node colouring heuristic (Carter et al. 1996). This is comparable to what many human planners do. They start assigning the exams that have a lot of constraints, so that the exams with a lower number of constraints are planned in the end. Some noise is added by randomly assigning exams that cause no conflicts with exams that were assigned before to one of the non-conflicting time slots. If no feasible solution can be found, the conflicting exams together with the randomly assigned non-conflicting exams are removed from the solution. After that, the removed exams are reassigned following the same procedure. This process is repeated until a feasible solution is found or until the total number of attempts exceeds a threshold value. In the latter case, the (infeasible) solution with the lowest number of unassigned exams is the new solution, and the unassigned exams are randomly assigned to time slots. The Cap i (i=1,…,4) heuristics are then applied to obtain a feasible solution through perturbations. Although these heuristics do not guarantee that they can make the solution feasible, the experiments always resulted in a feasible solution in a short amount of computation time.

    Algorithm 1
    figure 2

    Pseudocode for initializing the solution of Toronto instances

  • In the ITC 2007 examination timetabling problem, the exams are ranked in decreasing order according to the number of students taking the exam. Exams are assigned to a room such that both the capacity and duration constraints are satisfied. This process is repeated for the first T exams. The remaining exams are assigned to room-time slot combinations as follows:

    • First, the algorithm checks whether the exam can be assigned to one of the occupied room-time slot combinations. If there is sufficient capacity left and the exam’s duration is not greater than the duration of the time slot, the exam is assigned to the first appropriate room-time slot combination that is encountered;

    • If none of the occupied room-time slot combinations have sufficient capacity to accommodate the exam, the exam is randomly assigned to an empty room, taking into account that room can accommodate the number of students taking the exam. In addition, the duration of the exam should be less than or equal to the duration of the corresponding time slot.

    The construction of the initial solution does not guarantee the feasibility of the solution, since we only take two of the five hard constraints into account. Algorithm 2 presents the initialization phase.

    Algorithm 2
    figure 3

    Pseudocode for the initialization of the solution of ITC 2007 exam instances

4.3 Hyperheuristics framework: move acceptance criteria, heuristic selection method and tournament factor

A typical hyperheuristics framework includes a heuristic selection mechanism and move acceptance criteria. The first is a mechanism for selecting a heuristic, to be applied to a single candidate solution. The latter determines whether the resulting candidate solution is accepted or declined. Several well-known meta-heuristic methods have been deployed within hyperheuristics to serve as selection methods or as move acceptance criteria.

The hyperheuristic framework applied in this paper is influenced by the work of Özcan et al. (2008) and Bilgin et al. (2007). We have extended this hyperheuristic framework with a tournament factor. At each iteration, the selected heuristic generates a predefined number, namely the tournament factor, of random moves. The move leading to the candidate solution with the lowest fitness value is selected. Evaluation of a solution by the fitness function results in a value that reflects the solution quality: the lower the value, the better the solution. If the best move is accepted by the acceptance criterion, it modifies the current solution by executing the selected move.

This design enables application to a wide range of problems. This tournament-based hyperheuristic framework has been applied successfully applied to patient admission scheduling (Bilgin et al. 2010). This problem consider patients that need to be assigned to hospital beds taking into account medical and personal constraints. The goal is to assign patients to the room type of their choice, whilst at the same time satisfying their medical needs (Demeester et al. 2010).

We apply the ‘simple random’ heuristic selection method, which randomly chooses a heuristic from a fixed list at each iteration. This is actually the simplest heuristic selection method possible.

We consider the following four move acceptance criteria for the experiments:

  • ‘improving or equal’, abbreviated as IE, which only accepts moves that do not worsen the solution;

  • simulated annealing, abbreviated as SA, is represented in Algorithm 3. In contrast to Burke et al. (2004a), in which the authors report spending considerable time tuning the problem specific parameters in their simulated annealing algorithm, the acceptance criterion considered here does not contain any parameters that need to be tuned.

    Algorithm 3
    figure 4

    Pseudocode of the simulated annealing acceptance criterion

  • great deluge, abbreviated as GD, is presented in Algorithm 4

    Algorithm 4
    figure 5

    Pseudocode of the great deluge acceptance criterion

  • an adapted version of the late acceptance strategy (Burke and Bykov 2008), abbreviated as LA. Similar to tabu search, the late acceptance strategy makes use of a list of a given length L. The acceptance list does not contain properties of tabu moves but the values of the previous L candidate solutions. Rather than comparing a candidate solution with the current solution, it is compared with the oldest value in the list. When the candidate solution’s fitness value is less than the oldest value in the list, the candidate solution is accepted, and its fitness value replaces the oldest value in the list. The late acceptance variant that we introduce in this paper first compares the candidate solution’s fitness value with the current solution’s fitness value, which is actually a (limited) steepest descent. It is no steepest descent in the strict sense, since only a small part of the neighbourhood is explored. In case the candidate solution is worse than the current solution, the original late acceptance strategy is applied. Algorithm 5 summarizes the steepest descent late acceptance strategy algorithm.

    Algorithm 5
    figure 6

    Pseudocode of the steepest descent late acceptance strategy acceptance criterion

The hyperheuristic presented in this article, resembles the approach discussed in Bai et al. (2007), Bai and Kendall (2005). Bai et al. (2007) apply a simulated annealing hyperheuristic to problems from nurse rostering, university course timetabling and bin packing. Their heuristic selection mechanism starts with random selection, and gradually learns which low-level heuristics perform better than the other ones. As the search proceeds, the better performing low-level heuristics gradually increase their chances of being selected. Experiments show that this method leads to better solutions than some bespoke meta-heuristics. The approach presented in the current paper differs from Bai et al. (2007) in the move acceptance criteria, and in the use of the tournament factor.

5 Experiments

5.1 Settings

Every hyperheuristic variant is executed ten times on each problem instance. Throughout this section, all the quantitative comparisons are assessed using the Student t-test method with a 95% confidence level.

The applications were implemented in Java. All experiments were performed on Intel Core2Duo (3 GHz) PCs running Windows XP Professional SP3, with a Java 1.6 JRE configured to run in server mode with a heap size of 128 MB.

The stopping criteria differ between problems:

  • one hour for the KAHO problem;

  • for the Toronto benchmarks the stopping criteria range from 1 hour to 12 hours.

  • for the ITC 2007 data sets, the stopping criterion depends on the performance of the computer on which the algorithms are executed. In order to guarantee a fair comparison between the different approaches, the competition organizers released a benchmark program that, dependent on the contestant’s computer performance, shows the available computation time to find a solution. In our case the stopping criterion was fixed at 300 seconds.

We also compare the effect of the original late acceptance strategy (Burke and Bykov 2008) and the late acceptance strategy combined with the steepest descent algorithm. From the experiments it is clear that the combination of the late acceptance strategy and steepest descent leads to better results for the KAHO examination problem. The experiments show that the length of the acceptance list plays a crucial part in the success of the algorithm. On the other hand, the performance difference between the two late acceptance strategies for the Toronto and ITC 2007 data sets is not that clear. For some of the instances, the original late acceptance strategy outperforms the adapted version and vice versa.

5.2 The KAHO problem

Once a random initial solution is generated, the hyperheuristic adapts the solution with perturbative heuristics (Sect. 4.1) until a feasible solution is obtained. By attributing a high weight (100 in this case) to the violations of the hard constraints in the fitness function, the search is biased towards finding a feasible solution whilst trying to improve the quality. The weights corresponding to the soft constraints have all been set equal to 1. Since none of the low-level heuristics remove exams from the timetable, the third hard constraint of the KAHO problem (Sect. 3.2) is always satisfied.

Table 4 presents the experimental results for the KAHO case. For the hyperheuristic with the simulated acceptance criterion with tournament factor 4 (SA-4), we experimented with different types of low-level heuristics. In the first experiment, we only have applied the Cap i (i=1,…,4) heuristics, whilst in the second experiment, the four Cap i and two Uncap i heuristics were applied. From Table 4 it is clear that only applying the Cap i heuristics leads to better results. Based on these findings, we decided to only employ the Cap i heuristics for the further experiments. The steepest descent variant of the late acceptance strategy outperforms the original late acceptance strategy for each of the tournament factors in a statistically significant way. The difference between the two best performing approaches, i.e. the simulated annealing acceptance criterion with tournament factor 4 and the combination of the late acceptance strategy and steepest descent with tournament factor 8 (LA-SD-8) is not statistically significant.

Table 4 Experimental results of the five algorithms applied to the KAHO problem. Each algorithm was run for 10 times, with a computation time of 1 hour. Numbers in italic correspond to the best value obtained with the acceptance criterion

Several tests were performed to obtain the ideal list length for the late acceptance criterion. In Fig. 2, the average fitness over 10 runs versus the list length is plotted for the two late acceptance strategies. The best list length for both strategies is 10000. Compared to the other move acceptance criteria (i.e. simulated annealing, ‘improving or equal’, and great deluge), the dependence of both late acceptance criteria’s performance on one parameter is a disadvantage. The length of the acceptance list influences the algorithm’s performance on the KAHO problem. From the experiments it is also clear that the late acceptance strategy combined with steepest descent is less dependent on the list length than the original late acceptance approach.

Fig. 2
figure 7

Average result after ten runs for different values of the list length (tournament factor 16)

The experiments show that all algorithms, except the late acceptance strategy, generate examination timetables satisfying all the hard and soft constraints. The timetable is organized in 40 time slots, instead of the 48 time slots that were needed by the manual planner.

5.3 Toronto benchmark problems

The constructed feasible initial solution (Sect. 4.2) is locally perturbated applying only the two heuristics Uncap 1 and Uncap 2. This corresponds to the findings of Thompson and Dowsland (1998) who also solve the examination timetabling problem in two distinct phases (Sect. 2). The Cap i heuristics are not used here since they do not maintain feasibility. Preliminary experiments have shown that once a feasible solution becomes infeasible it is really hard to turn it into a feasible one again by only using the Cap i heuristics. The Toronto benchmark data sets only consider spreading out individual students’ exams to the highest possible extent and avoiding student clashes. The Uncap i heuristics appear to be perfectly suited for realizing this:

  • after execution of one of the heuristics the no-clash hard constraint is still satisfied;

  • the heuristics manage to spread out the exams at the same time.

Although these two heuristics maintain feasibility (Sect. 4.1), they are not able to turn an infeasible solution into a feasible one. Therefore, we have opted for constructing a feasible solution in a preceding phase and then to improve the solution further in the improvement phase.

In Tables 5 and 6, the results (validated with the function in Qu 2010) for the different move acceptance criteria are presented for each of the 13 instances of the Toronto benchmark problems. Each algorithm finishes after one hour of computation and is executed 10 times. For every algorithm, which is identified by the move acceptance criterion and a tournament factor, we present the average, the minimum fitness, and the standard deviation. The best fitness for every instance is indicated in italic.

Table 5 Average and minimum fitness and standard deviation for each of the Toronto instances after 10 runs. The acceptance list length for both acceptance strategies is 500 for all experiments. The stopping criterion is one hour
Table 6 Average and minimum fitness and standard deviation for each of the Toronto instances. The acceptance list length is for both acceptance strategies 500 for all experiments. The stopping criterion is 1 hour

It is important to mention that the parameter setting of the hyperheuristic approach is the same for all the data instances. For instance, the same SA-4 algorithm is applied to every data instance.

The algorithms that performed best in a statistically significant manner are indicated in Table 7. In the last column of Table 7, the corresponding p-value is marked. A p-value of 1 means that the first algorithm in the second column performs best according to the average fitness value. The other p-values are calculated against the best performing algorithm. In general, the best performing move acceptance criterion—tournament factor combination is SA-4. Only for data instances pur93 and sta83, other move acceptance criteria perform better.

Table 7 Statistically significant best performing algorithms and corresponding p-values for Table 5 and 6

In Table 8, we present the results of experiments with longer execution times (ranging from 1 up to 12 hours) of the SA-4 hyperheuristic. In all cases longer running times lead to better results.

Table 8 Average and minimum fitness and standard deviation for the SA-4 algorithm

In Table 9, we compare the hyperheuristics results with the Toronto benchmark results in the literature. We limit the selected results from the literature to those that have been confirmed accurate by Qu et al. (2009). For seven out of the 13 data instances, we obtain better or equal results.

Table 9 Selection of the best results from the literature compared with the best obtained values from the SA-4 hyperheuristics approach (12 hours of computation time)

From Table 9, we can conclude that for the smaller instances, longer running times lead to better or equal quality results compared to the results from the literature. However, for the rye92 and pur93 data instances, these results are considerable worse than the best results from literature.

Compared to Kendall and Hussin (2005a), who deploy a more elaborate heuristics selection method and use running times up to four hours, we obtain better results in general, even after one hour of computation. Possibly, this is partly attributed to the use of other heuristics and different move acceptance criteria.

In Table 10 both late acceptance strategies are compared with Student t-tests. For the larger data instances (car91, car92, pur93, rye92 and uta92), the adapted version of the late acceptance strategy method performs better than the original late acceptance strategy in a statistically significant way. On the other hand, the original late acceptance strategy performs significantly better for the following data instances: ear83 hec92, tre92, ute92 and yor83. For the three remaining data instances kfu93, lse91 and sta83, there is no statistical significance in the difference between the performance of the two late acceptance strategies.

Table 10 An X indicates which method performs better in a statistically significant manner. The comparisons are conducted on the results of Table 5 and 6

5.4 The ITC 2007 Examination Timetabling Problem

Starting from the initial (not necessarily feasible) solution, the hyperheuristic for the ITC 2007 problem first tries to find a feasible solution, and secondly a feasible solution that satisfies the soft constraints as much as possible. The first part of the search for a feasible solution is further divided into three stages:

  • In a first stage, starting from the constructed initial solution, the algorithm tries to obtain a solution that satisfies the hard constraints concerning the room occupancy, the period utilization and room related issues (some of the exams can only be assigned to rooms that do not hold any other exams). The period utilization and room occupancy constraints are already satisfied due to the construction of the initial solution.

  • In a second stage, which starts from a solution that satisfies the three hard constraints discussed above, the evaluation of the period related constraint is added to the fitness function. This stage stops when a solution is found that satisfies all four constraints.

  • In the final stage, the last hard constraint (two conflicting exams cannot be assigned to the same time slot) is added to the fitness function. The search finishes when a feasible solution is found.

Once a feasible solution is found, the second part of the search begins: the soft constraints are added to the fitness function and the search starts from the feasible solution obtained in the first part. In order to avoid violations of the hard constraints, the weights corresponding to the hard constraints in the fitness function are increased to 1000, whilst the weights of the soft constraints get the value 1.

Similarly to the Toronto approach, the ITC 2007 data sets are solved in two phases. In both the feasibility and the improvement phase, we have experimented with hyperheuristics with different move acceptance criteria (see Table 11). Solving the problem in phases is based on the good results obtained by combining different methods in the ITC 2007 literature (Gogos et al. to appear; Müller 2008). Based on the good results obtained with tournament factor 4 in Sects. 5.2 and 5.3, we focus on the same tournament factor value. An overview of the eleven hyperheuristics is presented in Table 12. Some of the hyperheuristics only differ in the amount of computation time that is spent in finding a feasible solution. Hyperheuristics SR-4-HH2, SR-4-HH4, SR-4-HH6, SR-4-HH7, SR-4-HH9 can spend at most 150 seconds (this is 50% of the available computation time) in the first (feasibility) stage. If no feasible solution is found in this period, the algorithm automatically switches to the improvement phase, whilst at the same time still trying to make the solution feasible. The other hyperheuristics only switch to the improvement phase when a feasible solution is found. From the obtained results and the t-tests (Table 13), there is no hyperheuristic that significantly performs better than the others over all instances. In order to find the best performing hyperheuristic to all instances, we have ranked the algorithms according to their average and best performance (Table 14). The algorithm that finds the best solution for instance, gets the lowest value, whilst the worst performing algorithm gets a high value. We have also applied the same ranking mechanism to the average values obtained after 10 runs for each instance. The final overall ranking is based on the average of the rankings for all instances. The algorithm with the overall lowest rank can be considered the best performing algorithm. If we only consider the average fitness values for each instance and for each algorithm, we find that the best performing algorithm is SR-4-HH4. On the other hand, if we consider the minimum fitness values, we find that algorithm SR-4-HH11 is the best performing algorithm. The hyperheuristic variants have simulated annealing as the last move acceptance criterion in common. It is safe to say that simulated annealing is a good quality method for these experiments. Hyperheuristic SR-4-HH11 finds most feasible solutions: 95 out of 120 runs (Table 11), whilst SR-4-HH2 finds the best solutions. For two data sets (instance 4 and instance 11), we do not find any feasible solution. Although we do not improve on the best solutions of the ITC 2007 literature, our solutions are competitive with the finalists’ results (Table 15).

Table 11 Average, minimum, standard deviation and the number of feasible solutions for the 12 data sets obtained by eleven hyperheuristics
Table 12 Description of the 11 move acceptance criteria. The hyperheuristics differ in the move acceptance criteria used in the constructive and the improvement phase and in the available computation time in these two phases
Table 13 The Student t-test p-values for the different hyperheuristics. The algorithms are compared to the best performing algorithm (indicated with value 1). Algorithms that are not significantly worse are indicated in bold
Table 14 Ranking of the hyperheuristic algorithms according to their average and best performance. The lower the rank, the better the algorithm’s performance
Table 15 Comparison of the solutions obtained with the best performing algorithm on average (SR-4-HH4) with the best solutions from the literature. The best solutions are indicated in bold and italics. Note that (Gogos et al. to appear) do not report on the 4 last instances

Our results are in line with the experiments of Bilgin et al. (2007). These authors have conducted numerous experiments with different combinations of move acceptance criteria and several heuristic selection methods on different benchmark instances. Their conclusion was that no single combination of acceptance criteria and selection methods dominates any other combination on all benchmarks. In their experiments, ‘improving or equal’ resulted in the best average performance, whilst ‘choice function’ was on average slightly better than the other heuristic selection mechanisms. The choice function heuristic selection method has some built in memory on how well previous applied heuristics performed. The better they have performed, the higher their selection probability at the next iterations. It is very hard to explain why one move acceptance criterion performs better than another one on a particular data instance. In future research, we will take the performance of the different heuristics into account, in a heuristic selection mechanism that has a built in learning mechanism.

We would like to conclude this section with a general advice. The competition organizers released a validation tool for verifying the obtained solution in order to guarantee correctness. Initiatives like this are highly appreciated: foreseeing a validation tool overcomes problems with changing data sets and wrong objective functions as reported in Qu et al. (2009). Also, the CPU performance tool that was released for the International Timetabling Competition allows a fair comparison among different algorithms, since the resulting computation time will be inversely proportional to the performance of the hardware specifications of the researcher’s computer. Some of the papers discussing the Toronto benchmarks give computation times, whereas others do not. Even if computation times are presented, it is still a hard task to compare the approaches, since they were executed on different platforms.

6 Conclusions and future work

We have proposed a hyperheuristic approach to examination timetabling problems that successfully tackles a curriculum-based problem from practice as well as two post-enrolment benchmark instances from the literature.

For the examination timetabling problem from practice, the hyperheuristic generates better solutions, in terms of the number of time slots, than those obtained by the manual planner. Moreover, due to the problem size and its complexity, it takes the manual planner a few weeks to generate a timetable. In addition, the hyperheuristic succeeds to strongly improve on the manual schedule in about one hour on an average desktop computer.

We improved on seven out of thirteen data instances from the Toronto benchmarks. Although the improvement of the benchmark instances was not our primary concern, it shows that focusing on problems from practice can result in competitive algorithms. Concerning the ITC 2007 data sets, our results are competitive, although we could not improve on the best results in the literature.

Since the problems are not completely equivalent, different low-level heuristics had to be applied. Note that these low-level heuristics are rather specific: spreading out exams with the Uncap i heuristics, or moving exams into an appropriate room with the Cap i heuristics. The Uncap i heuristics could not beat the Cap i heuristics for the KAHO problem, neither were the Cap i heuristics able to solve the Toronto benchmarks data instances.

We used an extremely simple heuristic selection method, i.e. one that randomly selects a heuristic from a fixed list of heuristics. It obtains in general better results than some of the approaches that employ more ‘intelligent’ heuristic selection methods. This raises the question whether further improvement can be obtained by merely looking at more sophisticated heuristic selection methods or by concentrating on smarter low-level heuristics. There may be a trade-off in that smarter heuristics are required to reveal the power of a more intelligent heuristic selection mechanism.

Also, we have applied the late acceptance strategy, and a variant, as a move acceptance criterion in the hyper heuristics framework. For the KAHO problem the late acceptance variant combined with steepest descent performs as good as the simulated annealing move acceptance criterion.