Keywords

1 Introduction

This report describes the results of the 2024 edition of SV-COMP, and is an extension of the series of competition reports (see footnote). We also list important processes and rules, and give insights into some aspects of the competition. The 13th Competition on Software Verification (https://sv-comp.sosy-lab.org/2024) is again the largest comparative evaluation ever in this area. The objectives of the competitions were discussed earlier (1-4 [22]) and extended over the years (5-6 [23]):

  1. 1.

    provide an overview of the state of the art in software-verification technology and increase visibility of the most recent software verifiers,

  2. 2.

    establish a repository of software-verification tasks that is publicly available for free use as standard benchmark suite for evaluating verification software,

  3. 3.

    establish standards that make it possible to compare different verification tools, including a property language and formats for the results,

  4. 4.

    accelerate the transfer of new verification technology to industrial practice by identifying the strengths of the various verifiers on a diverse set of tasks,

  5. 5.

    educate PhD students and others on performing reproducible benchmarking, packaging tools, and running robust and accurate research experiments,

  6. 6.

    provide research teams that do not have sufficient computing resources with the opportunity to obtain experimental results on large benchmark sets, and

  7. 7.

    conserve tools for formal methods for later reuse by using a standardized format to announce archives (via DOIs), default options, contacts, competition participations, and other meta data in a central repository.

The SV-COMP 2020 report [23] discusses the achievements of the SV-COMP competition so far with respect to these objectives.

Related Competitions. SV-COMP is one of many competitions that measure progress of research in the are of formal methods [15]. Competitions can lead to fair and accurate comparative evaluations because of the involvement of the developing teams. The competitions most related to SV-COMP are RERS [80], VerifyThis [65], Test-Comp [28], and TermCOMP [73]. A previous report [23] provides a more detailed discussion.

Quick Summary of Changes. While we try to keep the setup of the competition stable, there are always improvements and developments. For the 2024 edition, the following changes were made:

  • New verification tasks were added, with an increase in C from 23 805 in 2023 to 30 300 in 2024.

  • Tool archives are now uploaded to Zenodo, instead of GitLab, and the meta data about the tools are hosted and maintained in the Repository for Formal-Methods Tools (https://gitlab.com/sosy-lab/benchmarking/fm-tools).

  • The improved witness format version 2.0 [7] (which is based on YAML instead of GraphML) was used for the first time.

  • The scoring schema for the witness validators [44] was changed based on the 2023 community meeting in Paris.

2 Organization, Definitions, Formats, and Rules

Procedure. The overall organization of the competition did not change in comparison to the earlier editions [16,17,18,19,20,21,22,23,24, 26, 27]. SV-COMP is an open competition (also known as comparative evaluation), where all verification tasks are known before the submission of the participating verifiers, which is necessary due to the complexity of the C language. The procedure is partitioned into the benchmark submission phase, the training phase, and the evaluation phase. The participants received the results of their verifier continuously via e-mail (for preruns and the final competition run), and the results were publicly announced on the competition web site after the teams inspected them.

Competition Jury. Traditionally, the competition jury consists of the chair and one member of each participating team; the team-representing members circulate every year after the candidate-submission deadline. This committee reviews the competition contribution papers and helps the organizer with resolving any disputes that might occur (cf. competition report of SV-COMP 2013 [17]). The tasks of the jury were described in more detail in the report of SV-COMP 2022 [26]. The team representatives of the competition jury are listed in Table 5.

Scoring Schema and Ranking. The scoring schema of SV-COMP 2024 was the same as for SV-COMP 2021. Table 1 provides an overview and Fig. 1 visually illustrates the score assignment for the reachability property as an example. As before, the rank of a verifier was decided based on the sum of points (normalized for meta categories). In case of a tie, the rank was decided based on success run time, which is the total CPU time over all verification tasks for which the verifier reported a correct verification result. Opt-out from Categories and Score Normalization for Meta Categories was done as described previously [17, page 597].

Table 1. Scoring schema for SV-COMP 2024 (unchanged from 2021 [24])
Fig. 1.
figure 1

Visualization of the scoring schema for the reachability property (unchanged from 2021 [24])

License Requirements. Starting 2018, SV-COMP required that the verifier must be publicly available for download and has a license that

  1. (i)

    allows reproduction and evaluation by anybody (incl. results publication),

  2. (ii)

    does not restrict the usage of the verifier output (log files, witnesses), and

  3. (iii)

    allows (re-)distribution of the unmodified verifier archive via SV-COMP repositories and archives.

Task-Definition Format 2.0. SV-COMP 2024 used the task-definition format in version 2.0. More details can be found in the report for Test-Comp 2021 [25].

Fig. 2.
figure 2

Category structure for SV-COMP 2024 (changed from 2023)

Properties. Please see the 2015 competition report [19] for the definition of the properties and the property format. All specifications used in SV-COMP 2024 are available in the directory c/properties/ of the benchmark repository.

Categories. The community significantly extended the benchmark set for SV-COMP 2024. The (updated) category structure of SV-COMP 2024 is shown in Fig. 2. We refer to the previous reports for a description and mention only the changes here: Compared to SV-COMP 2023, we added two new sub-categories ReachSafety-Hardness and ReachSafety-Fuzzle to main category ReachSafety. We restructured main category SoftwareSystems as follows: We removed sub-categories SoftwareSystems-BusyBox-ReachSafety, SoftwareSystems-BusyBox-MemSafety, and SoftwareSystems-OpenBSD-MemSafety, and added sub-categories SoftwareSystems-coreutils-MemSafety, SoftwareSystems-coreutils-NoOverflows, SoftwareSystems-Other-ReachSafety, and SoftwareSystems-Other-MemSafety. The categories are also listed in Tables 8, 9, and 10, and described in detail on the competition web site (https://sv-comp.sosy-lab.org/2024/benchmarks.php).

Fig. 3.
figure 3

Benchmarking components of SV-COMP and competition’s execution flow (same as for SV-COMP 2020, except that we now download the tool archives from Zenodo instead of GitLab)

Table 2. Publicly available components for reproducing SV-COMP 2024
Table 3. Artifacts published for SV-COMP 2024

Reproducibility. SV-COMP results must be reproducible, and consequently, all major components are maintained in public version-control repositories. The overview of the components is provided in Fig. 3, and the details are given in Table 2. We refer to the SV-COMP 2016 report [20] for a description of all components of the SV-COMP organization. There are competition artifacts at Zenodo (see Table 3) to guarantee their long-term availability and immutability.

Table 4. Validation: Witness validators and witness linter

Competition Workflow. The workflow of the competition is described in the report for Test-Comp 2021 [25] (SV-COMP and Test-Comp use a similar workflow). For a description of how to reproduce single verification runs and a trouble-shooting guide, we refer to the 2022 report [26, Sect. 3].

Table 5. Verification: Participating verifiers with tool references and representing jury members; for first-time, for hors-concours ( was not able to qualify)
figure aw
figure ax
figure ay
figure az

3 Participating Verifiers and Validators

The participating verification systems are listed in Table 5. The table contains the verifier name (with hyperlink), references to papers that describe the systems, the representing jury member and the affiliation. The listing is also available on the competition web site at https://sv-comp.sosy-lab.org/2024/systems.php. Table 6 lists the algorithms and techniques that are used by the verification tools, and Table 7 gives an overview of commonly used solver libraries and frameworks.

Validation of Verification Results. The validation of the verification results was done by 17 validation tools (16 proper witness validators, and one witness linter for syntax checks), which are listed in Table 4, including references to literature. The ten witness validators are evaluated based on all verification witnesses that were produced in the verification track of the competition.

Hors-Concours Participation. As in previous years, we also included verifiers to the evaluation that did not actively compete or that should not occur in the rankings for some reasons (e.g., meta verifiers based on other competing tools, or tools for which the submitting teams were not sure if they show the full potential of the tool). These participations are called hors concours, as they cannot participate in rankings and cannot “win” the competition. Those verifiers are marked as ‘hors concours’ in Table 5 and others, and the names are annotated with a symbol ().

figure bb
figure bc
figure bd

4 Results of the Verification Track

The results of the competition represent the the state of the art of what can be achieved with fully automatic software-verification tools on the given benchmark set. We report the effectiveness (number of verification tasks that can be solved and correctness of the results, as accumulated in the score) and the efficiency (resource consumption in terms of CPU time). The results are presented in the same way as in last years, such that the improvements compared to the last years are easy to identify. The results presented in this report were inspected and approved by the participating teams.

Quantitative Results. Tables 8 and 9 present the quantitative overview of all tools and all categories. Due to the large number of tools, we need to split the presentation into two tables, one for the verifiers that participate in the rankings (Table 8), and one for the hors-concours verifiers (Table 9). The head row mentions the category, the maximal score for the category, and the number of verification tasks. The verification tasks consist of tasks with expected verdict True, expected verdict False, and tasks that are Void (tasks that were excluded from scoring by the jury). The tools are listed in alphabetical order; every table row lists the scores of one verifier. We indicate the top three candidates by formatting their scores in bold face and in larger font size. An empty table cell means that the verifier opted-out from the respective main category (perhaps participating in subcategories only, restricting the evaluation to a specific topic; Deagle was disqualified by the jury, with details on the web site). More information (including interactive tables, quantile plots for every category, and also the raw data in XML format) is available on the competition web site (https://sv-comp.sosy-lab.org/2024/results) and in the results artifact (see Table 3).

Table 10 reports the top three verifiers for each category. The run time (column ‘CPU Time’) refers to successfully solved verification tasks (column ‘Solved Tasks’). We also report the number of tasks for which no witness validator was able to confirm the result (column ‘Unconf. Tasks’). The columns ‘False Alarms’ and ‘Wrong Proofs’ report the number of verification tasks for which the verifier reported wrong results, i.e., reporting a counterexample when the property holds (incorrect False) and claiming that the program fulfills the property although it actually contains a bug (incorrect True), respectively.

Fig. 4.
figure 4

Quantile functions for category C-Overall. Each quantile function illustrates the quantile (x-coordinate) of the scores obtained by correct verification runs below a certain run time (y-coordinate). More details were given previously [17]. A logarithmic scale is used for the time range from 1 s to 1000 s, and a linear scale is used for the time range between 0 s and 1 s.

Score-Based Quantile Functions for Quality Assessment. We use score-based quantile functions [17, 40] because these visualizations make it easier to understand the results of the comparative evaluation. The results archive (see Table 3) and the web site (https://sv-comp.sosy-lab.org/2024/results) include such a plot for each (sub-)category. As an example, we show the plot for category C-Overall (all verification tasks) in Figure 4. A total of 16 verifiers participated in category C-Overall, for which the quantile plot shows the overall performance over all categories (scores for meta categories are normalized [17]). A more detailed discussion of score-based quantile plots, including examples of what insights one can obtain from the plots, is provided in previous competition reports [17, 20].

The winner of the competition, UAutomizer, achieves the best cumulative score (the graph for UAutomizer  has the longest width from its left to its right end; the graph starts left from \(x=0\) because the verifier produced 7 wrong proofs and 4 false alarms and therefore received some negative points). Also other verifiers whose graphs start with a negative cumulative score produced wrong results.

Fig. 5.
figure 5

Number of evaluated verifiers for each year (first-time participants on top)

figure be

New Verifiers. To acknowledge the verification systems that participate for the first or second time in SV-COMP, Table 11 lists the new verifiers (in SV-COMP 2023 or SV-COMP 2024). Figure 5 shows the growing interest in the competition over the years.

Computing Resources. The CPU time and memory limits were the same as in the previous competitions [20] (15 GB of memory and 15 min of CPU time), but we reduced the number of processing units per run from 8 to 4 processing units. This has the disadvantage that the measurements are more imprecise due to shared resources in the machine, but it roughly doubles the throughput. This change was necessary because of the ever increasing number of participating systems and the continuously increasing benchmark set. Witness validation was again limited to 2 processing units, 7 GB of memory, and 1.5 min of CPU time for violation witnesses and 15 min of CPU time for correctness witnesses. The machines for running the experiments are part of a compute cluster at the SoSy-Lab at LMU that consists of 168 machines, where each machine has one Intel Xeon E3-1230 v5 CPU, with 8 processing units each, a frequency of 3.4 GHz, 33 GB of RAM, and a GNU/Linux operating system (x86_64-linux, Ubuntu 22.04 with Linux kernel 5.15). We used BenchExec [40] to measure and control computing resources (CPU time, memory) and VCloud to distribute, install, run, and clean-up verification runs, and to collect the results. The values for the time are accumulated over all cores of the CPU.

To give an impression of the overall computation work, we report some statistics: One complete verification execution of the competition consisted of 787 779 verification runs (each verifier on each verification task of the selected categories according to the opt-outs), consuming 2 104 days of CPU time (without validation). This is almost double the CPU time spent for the previous edition of SV-COMP. Witness-based result validation required 13.6 million validation runs in 21 243 run sets (each validator on each verification task for categories with witness validation, and for each verifier), consuming 2290 days of CPU time. Each tool was executed several times, in order to make sure no installation issues occur during the execution.

5 Results of the Witness-Validation Track

The validation of verification results, in particular, verification witnesses, becomes more and more important for various reasons: verification witnesses justify and help to understand and interpret a verification result, they serve as exchange object for intermediate results, and they allow to make use of imprecise verification techniques (e.g., via machine learning). A case study on the quality of the results of witness validators [44] suggested that validators for verification results should also undergo a periodical comparative evaluation and proposed a scoring schema for witness-validation results. SV-COMP 2024 evaluated a total of 17 validators on 100 998 correctness and 71 577 violation witnesses in format 1.0, and 45 614 correctness and 27 561 violation witnesses in format 2.0. Figure 6 shows the growing importance of evaluating witness validators.

Fig. 6.
figure 6

Number of witness validators for each year (first-time participants on top)

Fig. 7.
figure 7

Scoring schema for evaluation of validators; \(p = -16\) for SV-COMP 2024; figure adopted from [44]; changed scores compared to 2023 are highlighted in red

figure bf
figure bg
figure bh

Scoring Schema for Validation Track. The score of a validator in a sub-category is computed as

$$ score = \left( \frac{p_{\text {correct}}}{|\text {correct}|}+\frac{p_{\text {wrong}}}{|\text {wrong}|}\right) \cdot \frac{|\text {correct}| + |\text {wrong}|}{2}$$

where the points in \(p_{\text {correct}}\) and \(p_{\text {wrong}}\) are determined according to the schema in Figure 7 and then normalized using the normalization schema that SV-COMP uses for meta categories [17, page 597] (note that the factor q is removed in comparison to last year [27, page 513] from the formula, because it is not necessary to give a higher weight to wrong witnesses anymore). Witnesses that do not agree with the expected verification verdict are classified as wrong. Witnesses that agree with the expected verification verdict can be wrong although they agree with the expected version, for example, if a violation witness has a wrong path to the violation, or a correctness witness has an invariant that does not hold. Therefore, we use the information from the majority of the validators: a witness that agrees with the expected verification result is classified as correct if at least 75 % of the true/false results from validators confirm the result, and as wrong if at least 75 % of the true/false results from validators refute this result (and there must be at least 2 true/false results). Further details are given in the proposal [44]. This schema relates to each base category from the verification track a meta category that consists of two sub-categories, one with the correct and one with the wrong witnesses.

Table 12, 13, and 14 show the rankings of the validators. Violation witnesses in format version 2.0 were not yet ranked, because the jury decided that in SV-COMP 2024, this is a demonstration track. The score results for all validators and all categories are available on the SV-COMP web site Footnote 1 and in the artifact [30]. Wrong proofs in Tables 12 and 13 are claims of a validator that the program is correct according to invariants in a given correctness witness although the program contains a bug (the validator confirms a wrong correctness witness). False alarms in Table 14 are claims of a validator that the program contains a bug described by a given violation witness although the program is correct (the validator confirms a wrong violation witness).

The adoption rate of the new witness format version 2.0 is discussed in the article that defines the format [7]. Tables 12 and 13 shows that there are categories that are supported still by less than three validators (‘missing validators’ for categories ConcurrencySafety and Termination).

figure bi

6 Conclusion

The 13th edition of the Competition on Software Verification (SV-COMP 2024) again compared automatic tools for software verification and the validation of the produced verification witnesses. SV-COMP again had a record number of 59 participating verification systems (incl. 7 new verifiers and 19 hors-concours; see Fig. 5 for the participation numbers and Table 5 for the details). Furthermore, the validation track compared 17 validation tools; the validation tools were assessed in a similar manner as in the verification track, using a community-agreed scoring schema. The number of verification tasks in SV-COMP 2024 was significantly increased to 30 300 in the C category. Table 10 shows that the top verification tools have an extremely low number of wrong results. However, there are still wrong results, and validation of the verification results is absolutely necessary. We hope that this overview and the competition leads to a broader adoption of software verification, and in particular, that more and better validation tools are developed in the near future.

Data-Availability Statement. The verification tasks and results of the competition are published at Zenodo, as described in Table 3. All components and data that are necessary for reproducing the competition are available in public version repositories, as specified in Table 2. For easy access, the results are presented also online on the competition web site https://sv-comp.sosy-lab.org/2024. The main results of last year’s competition were reproduced in an independent reproduction study [72].

Funding Statement. Some participants of this competition were funded in part by the Deutsche Forschungsgemeinschaft (DFG) — 378803395 (ConVeY).