1 Introduction

In the AI community, it is well known that the algorithm parameters have to be tuned to achieve peak performance. Since manual parameter tuning is a tedious and error-prone task, several methods were proposed in recent years to automatically optimize parameter configurations of arbitrary algorithms [1,2,3,4,5]. This led to performance improvements in many AI domains, such as propositional satisfiability solving [6], AI planning [7], the traveling salesperson problem [8], set covering [9], mixed-integer programming [10], hyper-parameter optimization of machine learning algorithms [11] and architecture search for deep neural networks [12]. These studies focus either on proposing more efficient automated parameter optimization methods or on improving the performance of a particular algorithm (the so-called target algorithm).

To determine a well-performing parameter configuration of a given algorithm on a given instance set, algorithm configuration procedures (in short configurators) have to collect a lot of empirical performance data. This entails running the algorithm at hand with different parameter configurations on several problem instances to measure its performance. Only the best performing configuration is returned in the end and all the collected performance data is typically not used further, although it was very expensive to collect.

In this paper, we reuse this data to further analyze all parts involved in the configuration process. This includes the target algorithm, its parameters, the used instances as well as the configurator. Hence, users will not only obtain a well-performing parameter configuration by using our methods in combination with a configurator but several additional insights.

A potential use-case is that algorithm developers implemented a new algorithm and want to empirically study that algorithm thoroughly as part of a publication. In a first step, they run a configurator to optimize their algorithm’s parameters on a given instance set to achieve peak performance. As the next step, our automatic analysis tool CAVE (Configuration Assessment, Visualization and Evaluation) generates figures that can be directly used in a publication.

In CAVE, we build on existing methods to analyze algorithms and instances, reaching from traditional visualization approaches (such as scatter plots) over exploratory data analysis (used, e.g., for algorithm selection [13, 14]) to recent parameter importance analysis methods [15,16,17,18]. We combine all these methods into a comprehensive analysis tool tailored to algorithm development and configuration studies, and propose two new approaches to complement their insights. Specifically, the contributions of our paper are:

  1. 1.

    We give an overview of different approaches to analyze a target algorithm, a given instance set and the configurator’s behavior based on collected empirical data. We use these for an exemplary analysis of the SAT solver spear [19] on a benchmark of SAT-encoded software verification instances.

  2. 2.

    We propose a new qualitative analysis of configurator footprints based on a recently proposed similarity metric of configurations [20].

  3. 3.

    We propose a new parameter importance analysis by studying the impact on performance when changing one parameter at a time, thus exploring the immediate neighborhood of the best found configuration.

  4. 4.

    We provide a ready-to-use toolkit, called CAVEFootnote 1, for such analyses, which can be directly used in combination with the configurator SMAC [3].

  5. 5.

    We show the value of our tool and the need of such comprehensive analyses by verifying two common assumptions for algorithm configuration:

    1. (a)

      Parameter importance changes depending on the instance set at hand.

    2. (b)

      Local and global parameter importance analysis do not necessarily agree with each other and hence complement each other.

2 Related Work

Empirical evaluation of algorithms is as old as computer science. One of the first systematic approaches for ensuring reproducibility and insights in comparing a set of algorithms is the PAVER service [21, 22]. The tool is primarily tailored to mixed integer programming solvers and provides some tables and visualization of the performance of algorithms. In contrast to PAVER, our tool also considers parameters of an algorithm and is designed to be used on arbitrary algorithms.

In the context of algorithm selection [23], a lot of performance data of different algorithms is collected, which can be used to analyze algorithms and instance sets. An example of an exploratory data analysis for algorithm selection is part of the algorithm selection library ASlib [13] that provides some simple performance and distribution tables and corresponding plots, e.g., scatter and box plots.

The first system that included an automatic analysis of algorithm performance in the context of algorithm configuration was the high-performance algorithm laboratory HAL [24]. Its main purpose was to help algorithm developers to apply automated algorithm design methods (such as algorithm configuration and selection) in a uniform framework by handling all interfaces, data storage, running experiments and some basic aggregation and visualizations of results, such as scatter and performance distribution plots. In contrast to HAL, CAVE focuses on the analysis part and provides a far more extensive analysis.

The tool SpySMAC [25] followed a similar approach as HAL but was specifically tailored to the needs of the propositional satisfiability (SAT) community. It provided an easy-to-use interface to run the configurator SMAC [3] for optimizing parameter configurations of a SAT solver. Our approach is inspired by SpySMAC but with the focus on the analysis part and extends it substantially. Furthermore, our new approach is no longer specific to SAT and SMAC, but it can be applied to any algorithm and configurator.

In the context of black-box optimization and in particular for hyper-parameter optimization of machine learning algorithms, Golovin et al. [26] proposed Google Vizier. Similar to HAL and SpySMAC, it is a service to run optimization benchmarks and also provides some visualizations, such as performance-over-time of the target function or parallel coordinate visualizations [27]. Since Google Vizier focuses on black-box optimization, it does not have a concept of instances, which are an integral part of algorithm configuration.

Lloyd et al. [28] proposed automatically constructed natural language reports of regression models, giving raise to the automatic statistician tool. Although we have the same goal (providing automatically constructed reports to help users to get more out of their data), our goal is not to provide a natural language report, but to leave the interpretation of the results to the users.

3 Generation of Algorithm Configuration Data

In this section, we describe the general work-flow of generating algorithm configuration data (see Fig. 1), which will be the input for CAVE’s analyses in the next section. The typical inputs of configurators areFootnote 2:

Fig. 1.
figure 1

Work-flow of algorithm configuration (AC) and analysis.

  • a target algorithm \(A\) with a description of the parameter configuration space \(\mathbf {\Theta }\), i.e., all possible parameter configurations,

  • an instance set \(\varPi \) drawn from a distribution \(\mathcal {D}_\varPi \) over all possible instancesFootnote 3,

  • a cost metric \(c: \mathbf {\Theta }\times \varPi \rightarrow \mathbb {R}\) to be optimized (e.g., the runtime of an algorithm, the quality of a plan for an AI planning problem instance or the error of a machine learning algorithm).

A configurator’s task is to find \(\mathbf {\theta }^* \in \mathbf {\Theta }\) such that the cost c of running \(A\) using \(\mathbf {\theta }^*\) across \(\varPi \) is optimized, i.e., \(\mathbf {\theta }^* \in {{\,\mathrm{arg\,min}\,}}_{\mathbf {\theta }\in \mathbf {\Theta }} \sum _{\pi \in \varPi } c(\mathbf {\theta }, \pi )\). Since a configurator iteratively improves the currently best configuration \(\hat{\mathbf {\theta }}\) (called incumbent), its trajectory includes the incumbent \(\hat{\mathbf {\theta }}_i\) at each time point \(t_i\).

In this process, the configurator follows a strategy to explore the joint space of configurations and instances, e.g., local search [1], genetic algorithms [2] or model-based [3] and it collects empirical cost data. Internally, a configurator keeps track of a set \(\{\mathbf {\theta }_j,\pi _j,c_j\}_{j=1}^N\) of all evaluated configurations \(\mathbf {\theta }_j\) on instance \(\pi _j\) with the corresponding cost \(c_j\), called the runhistory. The runhistory is an optional output, but it is crucial for CAVE’s analyses. If several runs of a configurator were performed, CAVE can use all resulting outputs as input.

To guide the search of configurators, a recent approach is to use empirical performance models (EPMs) \(\hat{c}: \mathbf {\Theta }\times \varPi \rightarrow \mathbb {R}\) that use the observed empirical data from the runhistory to predict the cost of new configuration-instance pairs [29], e.g., using random forests [30]. Based on these predictions, a configurator can decide whether to explore new regions in the configuration space or to exploit knowledge of the presumably well-performing regions in the configuration space. Model-based configurators [3, 4] can return these EPMs directly, and for model-free configurators [1, 5], such a model can be subsequently learned based on the returned runhistories. To this end, an optional input to configurators are instance features providing numerical descriptions of the instances at hand [29, 31]. Hutter et al. [29] showed that predictions of new cost data is fairly accurate if enough training data is provided. Further, Eggensperger et al. [32] showed that EPMs trained on runhistory data from configurator runs are good surrogates for real target algorithm runs. Thus, we also use EPMs trained on the union of all runhistories for our analyses, e.g., to impute missing cost data of configurations that were evaluated only on some but not all instances.

Our tool CAVE analyzes all this data as described in the next section. Thereby we use an extended version of the output format defined for the second edition of the algorithm configuration library AClib [33] such that CAVE can be in principle used with any configurator. Right now, we have a ready-to-use implementation in combination with the configurator SMAC [3].

4 Analyzing Algorithm Configuration Data

In this section, we give a brief overview of all components of our analysis report generated based on the trajectory, runhistory data and EPMs described in the last section. A detailed description of the individual elements of CAVE can be found in the online appendixFootnote 4. As a running example, we show figures for studying the SAT solver spear [19] on SAT instances encoding software verification problems based on three 2-day SMAC runs.Footnote 5 In addition to the data generated by SMAC, we validated all incumbent configurations to decrease the uncertainty of our EPM in the important regions of the configuration space.

4.1 Performance Analysis

The performance analysis of CAVE mainly supports different ways of analyzing the final incumbent and the performance of the algorithm’s default parameter configuration (the results obtained by downloading the algorithm and running it with its default parameter settings). In particular, the performance analysis part of CAVE consists of a qualitative tabular analysis providing aggregated performance values across all instances, scatter plots showing default performance vs. optimized performance for each instance (Fig. 2a), empirical cumulative performance distribution (eCDF) plots across the instance set (Fig. 2b) and algorithm footprint plots [14] (Fig. 3a).

Fig. 2.
figure 2

Comparison of the empirical performance of the default and final incumbent.

What have we learned about spear? Figure 2 shows the scatter plot and the eCDF for spear on software verification instances. From these plots, we learn that the performance of spear was not improved on all instances, but on many of them, with large speedups on some. The optimized configuration solved all instances in at most 20 seconds, while the default led to many timeouts. Based on the eCDF plot, the optimized configuration solved all instances whereas the default configuration solved only \(80\%\) of the instances. Furthermore, if the cutoff time of spear is larger than 0.8 seconds, the optimized configuration performed better; the blue curve is above the red curve consistently. The algorithm footprint plot (Fig. 3a) shows that the incumbent performed well on different types of instances. Compared to the scatter plot, we expected to see more red points in the footprint plots (instances where the default outperforms the incumbent). Looking more deeply into the data revealed that several of these are overlapping each other because the instances features are missing for these instances.

4.2 Instance and Instance Feature Analysis

If instances are characterized by instance features (as done for model-based configurators), these features can be studied to better understand an instance set. To obtain instance features for the SAT instances at hand, we used the instance feature generator accompanied by the algorithm selection tool SATzilla [29, 34]. In particular, the feature analysis part of CAVE consists of box and violin plots for each instance feature, clustering plots based on a PCA into the 2-dimensional feature space, correlation heatmaps for each pair of features (Fig. 3b), and feature importance plots based on greedy forward selection [15].

Fig. 3.
figure 3

(a) Green dots indicate instances where the incumbent configuration performed well i.e., at most \(5\%\) worse than the oracle performance of default and incumbent. All other instances are plotted as red dots, indicating that the default configuration performed well. Instances might be mapped closely together in the 2D reduced space. (b) The correlation matrix for all features is shown.

What have we learned about the software verification instances? Based on these plots, we learned that there are at least three instance clusters mixed together; knowing the source of these instances (also reflected in the instance names), we can verify that software verification for four different software tools were encoded in this instance set and that clustering approximately recovered the sources (merging two sources into one cluster). Since the PCA plot and the footprint plot (Fig. 3a) indicate that the instance set is heterogeneous [35], using algorithm configuration on each individual software tool or per-instance algorithm configuration [36, 37] could improve the performance of spear even further. From the feature correlation plot, we see that roughly half of the features are highly correlated and some of the features could be dropped potentially.

4.3 Configurator Behavior

Besides insights into the algorithm and instances at hand, the trajectory and the runhistory returned by a configurator also allow for insights into how the configurator tried to find a well-performing configuration. This may lead to insights into how the optimization process could be adjusted. In particular, the configurator behavior analysis consists of a plot showing the performance of the target algorithm over the time spend by the configurator (Fig. 4a) and parallel coordinate plots showing the interactions between parameter settings and performance of the target algorithm [26] (Fig. 4c).

Fig. 4.
figure 4

(a) depicts the predicted performance over time. The blue area gives the first and third quantiles over three configurator runs. (b) Dots represent sampled configurations, where the size represents the amount of times a configuration was evaluated. The background shows the predicted performance in the 2D space using an MDS. Incumbents are plotted as red squares, the default as orange triangle and the final incumbent as red inverted triangle. Configurations sampled from an acquisition function are marked with an X, all other configurations were purely randomly sampled. (c) Parallel Coordinate Plot for the three most important parameters with a subsampled set of 500 configurations. The best performing configuration has the brightest shade of red, whereas the darkest shade of blue depicts the worst observed configuration.

Configurator Footprint. As a novel approach, we propose in this paper to study how a configurator iteratively sampled parameter configurations, i.e., all configurations in the runhistory, see Algorithm 1 and exemplary Fig. 4b. It is based on a similarity metric for parameter configurationsFootnote 6 which is used in a multi-dimensional scaling (MDS) [38] based on the SMACOF algorithm [39] to obtain a non-linear mapping into 2-dimensions [20]. We extend this analysis by highlighting incumbent configurations in the trajectory and by scaling the dots (parameter configurations) wrt. the number of instances they were evaluated on. For racing-based configurators, this corresponds to how well a configuration performed compared to the current best found configuration. Furthermore, we use an EPM in the 2D space to highlight promising parts of the configuration space. Finally, the figures in the html-report also have a mouse-over-effect that allows to see the parameter configuration corresponding to each dot in the figure.

figure a

What have we learned about the configurator and spear? From the cost-over-time plot (Fig. 4a), we learned that SMAC already converged after 40, 000 seconds and investing further time did not improve spear’s performance further. From the configurator footprints (Fig. 4b), we see that SMAC covered most parts of the space (because every second parameter configuration evaluated by SMAC is a random configuration) but it also focused on promising areas in the space. The fraction of good configurations is fairly large, which also explains why SMAC was able to find a well-performing configuration early on. The parallel coordinate plot (Fig. 4c) reveals that sp-var-dec-heur should be set to 16 for peak performance; however, the runtime also depends on other parameters such as sp-phase-dec-heur.

4.4 Parameter Importance

Besides obtaining a well-performing parameter configuration, the most frequently asked question by developers is which parameters are important to achieve better performance. This question is addressed in the field of parameter importance. Existing approaches for parameter importance analysis (used in AC) include greedy forward selection [15], ablation analysis [17, 18] and functional ANOVA [16].

Fig. 5.
figure 5

(a) Relative parameter importance values for the most important parameters of each method. The parameters are ordered by fANOVA’s importance values. (b) Exemplary LPI plot. The shaded area is the model uncertainty.

Local Parameter Importance (LPI). In addition to the existing approaches, we propose a new parameter importance analysis approach, which we dub LPI. It is inspired by the human strategy to look for further improved parameter configurations or to understand the importance of parameter changes in the neighborhood of a parameter configuration. For example, most users are interested in understanding which parameters in optimized parameter configurations are crucial for the achieved performance.

Using an EPM, we study performance changes of a configuration along each parameter. To quantify the importance of a parameter value \(\mathbf {\theta }_p\), we compute the variance of all cost values by changing \(\mathbf {\theta }_p\) and then compute the fraction of all variances. Given the parameter space of the target algorithm \(\mathbf {\Theta }\), a set of parameters P and an EPM \(\hat{c}:\mathbf {\Theta }\rightarrow \mathbb {R}\) marginalized over \(\varPi \), the local importance of parameter \(p\in P\) with domain \(\mathbf {\Theta }_p\) is given by the fraction of variance caused by p over the sum of all variances

$$\begin{aligned} LPI(p \mid \theta ) = \frac{\mathrm {Var}_{v\in \mathbf {\Theta }_p}\hat{c}(\mathbf {\theta }\left[ \mathbf {\theta }_p=v\right] )}{\sum _{p'\in P}\mathrm {Var}_{w\in \mathbf {\Theta }_{p'}}\hat{c}(\mathbf {\theta }\left[ \mathbf {\theta }_{p'}=w\right] )} \end{aligned}$$

Compared to an ablation analysis, our new analysis is even more local since it solely focuses on one parameter configuration. Nevertheless, it is also similar to fANOVA since we quantify the importance of a parameter by studying the variance of performance changes. However the marginalization across all other parameter settings is not a part of LPI.

What have we learned about spear? Figure 5a gives an example of LPI estimated parameter importance and contrasts it to fANOVA and ablation results. The ablation analysis reveals that the most important parameter change was setting sp-var-dec-heur to 16 instead of the default of 0. Only a few of the 25 other parameters were important for the performance improvement. Overall, fANOVA and LPI agree with this assessment. However, LPI and ablation give a much larger importance to sp-var-dec-heur than fANOVA does; this is in part due to the fact that fANOVA (in contrast to LPI and Ablation) considers higher-order interaction effects as important, but also in part due to fANOVA being a global method and LPI and Ablation being local methods.

5 Exemplary Study of Parameter Importance using CAVE

In this section we will show that CAVE enables comprehensive studies. We will study two assumptions in the field of algorithm configuration regarding parameter importance:

Q1 :

Does the set of important parameters change depending on the instance set? If that is false, parameter importance of a given algorithm has only to be studied once and warmstarting methods of configurators [40] should perform quite well in general. Alternatively, parameter importance studies would be required on each new instance set again and warmstarting methods should perform quite poorly.

Q2 :

Do local and global parameter importance approaches agree on the set of important parameters? The common assumption is that local and global parameter importance analysis are complementary. A globally important parameter may not be important in a local analysis since it might already be set to a well-suited parameter value in a specific configuration.

Setup: To study these two questions, we run (capped) fANOVA [16]Footnote 7, ablation analysis [18] and our newly proposed LPI analysis, through CAVE for different algorithms from different domains on several instance sets, see Table 1. All these benchmarks are part of the algorithm configuration library [33]Footnote 8. We note that clasp and probSAT were the respective winners on these instance sets in the configurable SAT solver challenge 2014 [6]. To collect the cost data for training the required EPMs, we ran SMAC (v3 0.7.1)Footnote 9 ten times on each of the benchmarks using a compute cluster with nodes equipped with two Intel Xeon E5-2630v4 and 128GB memory running CentOS 7.

Table 1. Algorithm Configuration Library benchmarks, with #P the number of parameters of each algorithm.

Metric: For both questions, we are interested in the sets of important parameters. To this end, we define a parameter to be important if it is responsible for at least \(5\%\) cost improvement (ablation) or explains at least \(5\%\) of the variance in the cost space (fANOVA, LPI). To compare the sets of important parameters for two instance sets (Q1) or for ablation, fANOVA and LPI (Q2), we report the fraction of the intersection and the union of the two sets. For example, if both sets are disjoint, this metric would return \(0\%\); and if they are identical, the score would be \(100\%\). In Tables 2 & 3, we show the averaged results for each solver across all instance sets; for Q1 we averaged over all pairs of instance sets and for Q2 we averaged over all instance sets.

Table 2. Q1: Comparison of fANOVA/ablation/LPI results across different instance sets. The values show the percentage of how often a method determined the same parameters to be important on a pair of instance sets.
Table 3. Q2: Comparison of results obtained with different importance methods, on the same instance sets. The values show the percentage of how often two methods agreed on the set of most important parameters.

Q1: Parameter Importance across different Instance Sets. As shown in the right part of Table 2, the overlap of important parameters on pairs of instance sets is often quite small. Hence it depends on the instance set whether a parameter is important or not. Surprisingly, the results of ablation and fANOVA are similar in this respect. This indicates that for some algorithms (e.g. probSAT), a subset of the important parameters is constant across all considered instance sets. This supports the results on warmstarting of configurators [40], but also shows that warmstarting will potentially fail for some algorithms, e.g., clasp-(RAND), CPLEX and SATenstein.

Q2: Comparison of Local and Global Parameter Importance. As shown in Table 3, fANOVA and Ablation do not agree on the set of important parameters for most algorithms. Only for lpg  and clasp(-RAND), both parameter importance approaches return some overlapping parameters, i.e., more than a third of the parameters are on average important according to both approaches. LPI results tend to agree more with ablation results, but there is also some overlap with capped fANOVA. Thus, local and global parameter importance analysis are not redundant and indeed provide a different view on the importance of parameters.

6 Discussion and Conclusion

Algorithm configurators generate plenty of data that is full of potential to learn more about the algorithm or instance set at hand as well as the configurator itself. However, this potential so far remains largely untapped. CAVE provides users with the opportunity to broaden their understanding of the algorithm they want to inspect, by automatically generating comprehensive reports as well as insightful figures. We also introduced two new analysis approaches: configurator footprints and local parameter importance analysis.

We demonstrated the usefulness of such an automatic tool by using it to verify the assumption that local and global parameter importance are complementary and to demonstrate that important parameters depend on the examined set of instances.

CAVE could be further extended in many ways. In particular, we plan to analyze instance sets for their homogeneity [35]; to this extent, CAVE could recommend users to use per-instance algorithm configuration methods [36, 45] instead of conventional configurators if the instance set is strongly heterogeneous. We also plan to improve the uncertainty estimates of CAVE’s EPM by replacing the random forest models by quantile regression forests [46] as shown by Eggensperger et al. [32].