Keywords

1 Introduction

To support the development of Ethereum smart contracts (SCs) and to analyze SCs that have been deployed, over 140 tools were released until mid-2021 [11], and new tools keep appearing. The sheer number of tools makes it difficult to choose an appropriate one for a particular use case. Moreover, it is difficult to assess the effectiveness of the many methods proposed, and to judge the relevance of various extensions. Tool comparisons can facilitate the selection process. However, many tool surveys are based on academic publications that focus on the methods employed by the tools, or on whitepapers of the tools themselves. For a thorough quality assessment of the tools, it is necessary to also install and systematically test the tools – preferably with an appropriate ground truth set of SCs.

Given the scarce availability of an appropriate ground truth, tool developers adopted the practice of comparing their tool to previous ones, often with the somewhat biased intention of demonstrating the superiority of their tool in a particular respect. This approach is justified by the need for an evaluation despite the lack of an established ground truth. However, there are major concerns about the validity of such evaluations.

Undetermined Quality of Tools: Since the point of reference is unclear, a comparison to something of unknown quality only provides relative information.

Dependence Between Tools: When a new tool builds on tools published earlier, there is a tendency to compare it to exactly those tools in order to show the improvements. With the quality of the base tool(s) not clearly determined, the relative quality assessment remains vague.

Ground Truth (GT). In our context, a ground truth for a particular program property is a set of smart contracts (given as source or bytecode) together with assessments that state for each contract whether it satisfies the property or not. As the term truth suggests, these assessments are supposed to be definitive and reliable. To foster trust into the ground truth, it may be accompanied by a specification of the process how the assessments were obtained (e.g. by expert evaluation) or by objective arguments for the assessments (e.g. by specifying program inputs that solicit behavior satisfying the property, or by showing that such inputs do not exist).

Goals and Approach. The primary goal of this work is to compile a unified and consolidated ground truth of SCs with manually labeled properties, starting from GT sets that are publicly available and documented. Ultimately, we aim at a uniformly structured collection of contracts with verified properties that harnesses the individual efforts that have been invested into the original datasets.

Unification. From related work, we collect benchmarks containing GT data. We extract information on the corresponding contracts (like address, source code, bytecode, location of the issue) as well as classifications (properties tested, assessments) and introduce a unique reference for every entry in the original dataset. We clean the data by repairing obvious mishaps and complete it using our database of source codes and chain data.

Consolidation. To consolidate the datasets, we introduce four attributes per contract: the address (with chain and creation block) if the contract has been deployed, as well as unique fingerprints of the source code, the deployment and the deployed bytecode. Based on these attributes, we determine and eliminate discrepancies within the individual datasets. Then, we map the classifications to a common frame of reference, the SWCFootnote 1 classes and the DASPFootnote 2 scheme. Relying again on the attributes, we determine overlaps between datasets, detect disagreements, and examine their cause.

Quality Assessment. Based on the taxonomy by Bosu et al. [2] for assessing data quality in software engineering, we assess the included GT sets set with regard to the three aspects accuracy, relevance, and provenance. For accuracy, we consider incompleteness, redundancy, and inconsistency; for relevance, we consider heterogeneity, amount of data, and timeliness; for provenance, we consider accessibility and trustworthiness.

2 Definition of Terms

To discuss the data, we use the following terms.

Property, Weakness, Vulnerability: Most contract properties addressed in datasets constitute program weaknesses, with a few exceptions like honeypots. In software engineering at large, vulnerabilities are weaknesses that can be actually exploited, while blockchain literature tends to use the two terms synonymously. Throughout the paper, we prefer the term weakness, and use property for general statements.

Judgment: If a property holds, the corresponding judgment is “positive”. If a property does not hold, the judgment is “negative”. If the assessment is inconclusive or does not make sense, the judgment is “not available” (n/a).

Assessment: a triple consisting of a contract, a single property, and a judgment of the latter in the context of the former.

Entry: smallest unit of a dataset according to its authors. Depending on the structure of the dataset, an entry consists of a single assessment or of multiple assessments pertaining to the same contract. We use the term mainly to relate to the original publication accompanying the dataset.

Contradiction: a group of two or more assessments for the same contract and property, but with conflicting judgments.

Duplicates: multiple assessments for the same contract and property with identical judgments.

3 Benchmark Sets with Ground Truth Data

In this section, we specify the selection of the benchmarks sets and give an overview of the contents in the included sets.

3.1 Selection of GT Sets

From the systematic literature review [11], where Rameder et al. identified benchmark sets of smart contracts for the quality assessment of approaches to weakness or vulnerability detection, we extracted all references that contain a ground truth. Moreover, for the years 2021 and 2022, we searched for further GT sets.

Inclusion Criteria. We include all sets that provide a ground truth by either manually checking the contracts or by generating them via deliberate and systematic bug injection.

Exclusion Criteria. We omit sets that reuse the samples of other sets without contributing assessments of their own. Moreover, we exclude sets having been assessed automatically, e.g. by combining the results of selected vulnerability detection tools by majority voting. While they may constitute interesting test data, they do not qualify as a ground truth.

3.2 Structure of the Included Sets

Table 1 lists the datasets that we selected as the basis of our work. They differ regarding the number of assessments per entry, the identification of contracts, the way assessments are specified, and the information provided per contract.

Identification: Usually, contracts are given either by a file with the Solidity source or by a chain address. Only one dataset specifies just an internal identifier, which in most cases contains an address.

Assessments: The majority of datasets provides the assessments in a structured form as csv, json, xlsx or ods files. Five datasets encode the weakness and partly also the judgment in the filepath or use prose.

Contract Information: The datasets may provide chain addresses, Solidity sources from Etherscan or elsewhere, deployment and/or runtime bytecodes.

Crafted and Wild Sets. Depending on the provenance of the contracts, we divide the datasets into two groups. The wild group comprises eight collections of contracts that have been deployed either on the main or a test chain, hence they all provide chain addresses or source code from Etherscan. The crafted sets contain at least some contracts that have not been deployed to a public chainFootnote 3. One set has been obtained from the SWC registry, where it illustrates the SWC taxonomy. Two sets, JiuZhou and SBcurated, are related to tool evaluations. The set NotSoSmartContracts is intended for educational purposes, and the set SolidiFI was generated from Solidity sources by injecting seven types of bugs.

3.3 Summary of Assessments in the Included Sets

Table 1 gives an overview of the assessments in the sets. The first column contains a reference to the publication presenting the set, while the second one gives the number of entries. The subsequent columns quantify the assessments, specifying the total number as well as a breakdown by judgment type. The column for ignored assessments indicates the number of duplicate or contradicting assessments, as discussed in Sect. 5.2.

Table 1. Included GT Sets.

To compare the weaknesses covered by the sets, we map the individual assessments to the taxonomy provided by the SWC registry (Table 2). Section 5.3 discusses the mapping in detail. Properties not represented in the SWC registry remain unmapped, leading to unmapped assessments. The last two columns of Table 1 give the number of weaknesses as defined by the set and the number of covered SWC classes. When the number of weaknesses is larger than the number of SWC classes covered, it either means that there are unmapped assessments or that several weaknesses are mapped to the same SWC class.

4 Unified Ground Truth

In this section, we describe the process of merging the selected sets into a unified ground truth. We extract relevant data items, assign unique identifiers to the entries, repair mishaps, normalize the data to obtain a common format, add missing information from other data sources and investigate data variability.

4.1 Extracting Data from the Original Sets

For each repository selected (Sect. 3), we identify the parts pertaining to a ground truth, and use a Python script to extract relevant items. At a minimum, we need information to identify a contract, a property, and a corresponding judgment.

Most sets have not been designed for automated processing. They contain inconsistencies, errors, and information only intelligible to humans. We encountered numerous invalid Ethereum addresses, inconsistent spellings, invalid data formats, and wrong information (like bytecode not corresponding to the given source code). For the sake of transparency, we left the original sets unchanged and integrated the fixes into the Python scripts.

4.2 Completing the Data

To identify duplicate or contradicting assessments, and to arrive at a consolidated ground truth usable in different scenarios, each contract should be given by its source, deployment and runtime code as well as by its chain address (if deployed). Most repositories contain only some of this information. With the help of data from Ethereum’s main chain and Etherscan’s repository of source code, we were able to complete most missing data.

Contracts with Addresses: Footnote 4 We query the respective chain for the bytecodes, and Etherscan for the source code (if available).

Contracts with Source Code: We use the fingerprint of the source code to look it up in an internal database. If there is a match, we retrieve the deployment address and proceed as above. Otherwise, the source code can be compiled to obtain the corresponding bytecode. Given the variability of compilation, this step most likely will not result in code matching code obtained elsewhere, and is thus inferior when searching for duplicates.

Contracts with Bytecode: The contracts considered here all come with an address or some source code. However, to cross-check and to confirm guesses about the chain, we use fingerprints of any provided bytecode to look up public deployments. Moreover, we extract the runtime code from given deployment code.

Fingerprints. To detect identical contracts, we use fingerprints of the code. For source code, we eliminate comments and white space before computing the MD5 hash. A second type of fingerprint additionally eliminates pragma solidity statements prior to hashing. For bytecodes, we replace metadata sections inserted by the Solidity compiler with zeros before computing the MD5 hash.

4.3 Variability in the Unified GT Set

We portrait the variability with regard to the contract language (Solidity or EVM bytecode) as well as the range and distribution of Solidity versions and time of deployment.

Contract Identification. We need some reference to a contract, be it an address or a source file. Figure 1 depicts the number of entries in the unified GT set, for which we have an address, a source, both, or neither.

Fig. 1.
figure 1

Addresses (orange and yellow) and Solidity source files (yellow and green) in the entries in the unified GT set. (Color figure online)

In the unified set, there are 4 859 entries in total, of which 4 559 (93.8 %) come with a Solidity source and 3970 (81.7 %) with a deployment address. While 3693 (76.0 %) entries are associated with both, address and source, there are 866 (17.8 %) entries with a Solidity source only, and 277 (4.6 %), for which a source file is neither provided not can be retrieved. This concerns 28 entries in the set EverEvolvingGame, 131 in Zeus, and 118 in eThor. Moreover, 23 entries indicate neither an address nor a source file, but refer to a Solidity file without providing it (all in Zeus).

Chains. Entries with address refer to 2731 unique addresses, with the majority (2461) from the main chain, 268 from Ropsten, and one from Rinkeby. For one address in Zeus, we were not able to locate it on any public chain.

Solidity Versions. Solidity, the main programming language for smart contracts on Ethereum and beyond, has been evolving with several breaking changes so far. In the included sets, we see predominantly versions 0.4.x as depicted in the left part of Fig. 2. While the versions 0.4.x were current throughout 2017 up to early 2018, versions 0.8.x started December 2020 and are still current in mid 2023. The highest Solidity version in the GT sets is v0.6.4.

Fig. 2.
figure 2

Distribution of Solidity versions in the included GT sets.

Fig. 3.
figure 3

Distribution of contract deployments/addresses in the included GT sets on a time line (in million blocks).

Deployment Blocks and Forks. To put the addresses into a temporal context, we depict the deployment block in Fig. 3. We count the deployments in bins of 100 000 blocks and depict them on a timeline of blocks (ticks per million blocks).

The latest block in the GT sets is 8 M, while by the end of 2022, the main chain was beyond block 16.3 M. The deployment block also indicates, which EVM opcodes (introduced by a regular fork) were available.Footnote 5 This information may be critical if a detection tool was developed before a particualar opcode was introduced.

5 Consolidated Ground Truth

In this section, we describe the consolidation of the unified GT set. It consists of (i) identifying entries pertaining to the same contract, (ii) marking conflicts within sets, (iii) mapping all assessments to a common taxonomy, (iv) determining the overlaps between the included sets, and (v) analyzing disagreements between the sets.

5.1 Matching Contracts

To detect assessments referring to the same contract, we match the address and the fingerprints of the codes (cf. Sect. 4.2) according to the following considerations:

  • Same address and chain means same contract, since none of the contracts in the sets was deployed via CREATE2.

  • Most assessments are based on the Solidity source code. As the source usually specifies the admissible compiler versions (except when the missing directive is actually the weakness), the semantics of the program is fixed. So, if two source codes have identical fingerprints and the names of the contracts under consideration are the same, the assessments refer to the same contract.

  • Assessments referring to deployment bytecodes with the same fingerprint can be considered as assessing the same contract, unless the checked property is tied to Solidity (like inheritance issues). For the SWC classes, Table 2 indicates the visibility of the weakness by a checkmark in the last column.

  • Assessments referring to runtime codes with the same fingerprint are comparable only if the checked property is guaranteed to be detectable in this part of the code. Typically, this holds for weaknesses related to the contract being called by an adversary. For the SWC classes, Table 2 indicates the visibility of a weakness in the runtime code by a non-parenthesized checkmark in the last column. A checkmark in parentheses indicates that the weakness may occur in the constructor and thus is not necessarily detectable in the runtime code.

Table 2. Coverage of SWC Classes in the Consolidated GT Set.

5.2 Assessments Excluded from the Consolidated Set

For obvious reasons, we ignore assessments where either the judgment is n/a, or where the object of the assessment is ill defined. The first condition affects 397 assessments, mostly from the set CodeSmells. The second one eliminates 153 assessments from the Zeus set, as some contract identifiers do not allow us to extract a valid chain address, and the set does not provide further information.

It is well known that contracts like wallets or tokens have been deployed identically numerous times. Often, this fact is not taken into account when collecting contract samples, such that the same contract may end up in a set multiple times, albeit under different addresses. Therefore, we check the sets for multiple assessments of the same code, to find contradictions and duplicates.

Surprisingly, the Zeus set contains 18 contradictions already on the level of its own identifiers (meaning that the same identifier is listed multiple times, with diverging assessments) and 30 more when applying the criteria laid out in the last section. Moreover, we find 103 conflicts in the set CodeSmells, 6 in Doublade, and 3 in JiuZhou. These assessments are excluded from the consolidated set.

For duplicates, all but one assessment are redundant and can be ignored. We find duplicates in almost every set (the number in parentheses gives the ignored assessments): CodeSmells (853), ContactFuzzer (4), Doublade (34), eThor (6), EthRacer (5), EverEvolvingGame (52), NPChecker (31), SBcurated (16), SolidiFI (7), SWCregistry (1), and Zeus (3009).

For a summary of the exclusions, see the column ‘ignored’ in Table 1.

5.3 Mapping of Individual Assessments to a Common Taxonomy

To compare assessments in different sets, we map the properties of each set to classes of a suitable taxonomy. The SWC registryFootnote 6 provides such a widely used taxonomy with 37 weakness classes. Each has a numeric identifier, a title, a CWE parent and some code samples.

Coverage of SWC Classes. Table 2 shows how well the SWC classes are covered by positive and negative assessments, and how many sets contribute assessments to the class. Popular weaknesses with seven or more contributing GT sets are marked in gray. At the bottom, we add the classes 995–999 to account for weaknesses missing from the SWC registry.

Even after combining all GT sets into a unified ground truth, the coverage of the SWC classes remains highly uneven. This can be attributed to the intention behind most benchmark sets: to support the test of tools for automated vulnerability detection. And tools aim for “interesting” weaknesses.

Comparison of Weaknesses. It is intrinsically difficult to compare weaknesses across GT sets due to (i) vague or missing definitions of weaknesses, (ii) the unclear relationship between definitions, (iii) the ambiguous mapping of a weakness to a corresponding class, and (iv) heterogeneous criteria for structuring weaknesses, mixing cause and effect or different levels of the protocol/software stack. The definitions provided by GT authors are rarely a perfect match for a taxonomy. Therefore, when comparing weaknesses via a taxonomy, we have to check disagreements manually to distinguish mismatches of definitions from contradicting assessments.

5.4 Overlaps

To find disagreements between the cleaned GT sets, we first determine their overlap. For each pair of sets, Table 3 gives the number of non-ignored assessments that map to the same SWC class. The diagonal shows the total number of mapped assessments per GT set. The upper-left block relates wild sets, while the lower-right block concerns the crafted ones. As to be expected, there is more overlap within the wild group than within the crafted one or between the groups. SBcurated mixes crafted and wild contracts, with some crafted ones taken from the SWCregistry. Of 20 498 cleaned assessments, 18 409 appear in only one set, while 2 089 occur in two or more.

Table 3. Overlap of Mapped Assessments in the Consolidated GT Set.

5.5 Disagreements and Errors

In Table 3, overlaps with disagreements are marked gray. Of the 2 098 overlapping assessments, 458 disagree with at least one other, involving eight GT sets and six SWC classes (Table 4).

The disagreements constitute an interesting area of investigation. While some disagreements are due to diverging definitions of weaknesses that were mapped to the same SWC class, quite a few turn out to be inconsistencies under the authors’ original definitions. Table 4 summarizes the results of our manual evaluation. For each affected set, it gives the total number of assessments that disagree with an assessment of another set as well as a breakdown by SWC class. A table entry is marked red if our evaluation revealed assessment errors, giving also the number of such errors.

Since reentrancy is the most popular weakness, it appears in 12 GT sets and gives rise to most overlaps: of the 2 098 overlapping assessments, 1 480 pertain to reentrancy (SWC 107). Thus, it is not surprising that we observe the highest number of disagreements (182) and errors (13) for reentrancy.

Table 4. Number of Disagreements in the Unified GT Set, with the Errors.
Table 5. Number of Manually Checked Assessments, with the Errors.

With 42 disagreements, SWC 104 is second. However, we identified only three errors, with the other disagreements resulting from diverging definitions. While SWC 113 shows no errors, half of the disagreements for SWC 114, 120 and 997 are errors.

To gain further insights into the quality of the assessments, we randomly select 80 assessments from the consolidated ground truth, in order to manually check them. Table 5 shows, for each GT set and SWC class, the number of checked assessments as well as the number of errors.

6 Discussion

6.1 Data Quality

To assess the data quality of the GT sets along the dimensions proposed by Bosu et al. [2], we define scores for each criterion as specified in Table 6. The resulting overview of the data quality is shown in Table 7.

Table 6. Criteria for Data Quality Assessment.
Table 7. Data Quality of Ground Truth Sets.

Accuracy. All sets provide a minimum of data, but we had to complete the data of about two thirds of the sets. Redundancy exists in many wild GT sets, to varying degrees. The main concern regards inconsistencies – the key aspect of a GT – which we encountered in six wild GT sets. We improved the data quality (i) by data completion, (ii) by eliminating redundant and contradictory assessments within sets, and (iii) by resolving disagreements between sets. Thus, we could increase the accuracy in the consolidated GT set in all aspects. However, random inspections revealed further inconsistencies; the overall accuracy would benefit from further checks.

Relevance. The GT sets mostly lack heterogeneity, often provide a smallish amount of data, and above all lack recent data. By merging 13 GT sets, we could improve the amount of data (number of positive and especially negative samples) and some aspects of heterogeneity, like the number of weaknesses covered. However, there is still a bias towards a small range of Solidity versions, deployments between 2016 and 2018, source code that has been published for some reason on Etherscan, and popular vulnerabilities.

6.2 Related Work

Each of the thirteen original datasets can be regarded as distantly related work; see Sect. 3 for a description. Concerning the construction of a unified GT set, we only find AutoMESC [14]. In this work, Soud et al. choose five source code datasets [4, 7, 12, 17, 18] that address 10 vulnerabilities, which are detected by one or more of the tools HoneyBadger, Mythril, Maian, Osiris, Slither, SmartCheck, Solhint. They apply as inclusion criteria: recent (up to three years old), public, Ethereum, corresponding publication or GitHub repo; and exclude commercial and competition datasets, and sets that just provide one sample per vulnerability. For unification, they use a file-ID per contract (without checking for non-obvious duplicates). For consolidation (identifying duplicates, mapping the assessments to a common taxonomy, and resolving contradictory assessments), they discard the original classification, and replace it with a simple majority vote of the seven selected tools if those claim to detect the weakness (after mapping the tool findings to a common taxonomy). They claim that there is neither redundancy nor inconsistency in the five datasets included.

6.3 Challenges in Identifying Weaknesses

Ambiguous Definitions of Weaknesses. Hardly any weakness possesses a commonly accepted, precise definition. As a consequence, seemingly contradictory assessments of a contract by different datasets may actually result from applying subtly different definitions.

Weakness vs. Vulnerability. There is no agreement among dataset authors whether to aim for exploitable or potential issues.

Intended Purpose. The verdict on whether a weakness is considered a vulnerability also depends on the purpose of a contract. An apparent weakness may be actually the intended behavior of the contract (e.g. a faucet that “leaks” Ether).

Contracts in Isolation. The included datasets consider single-contract weaknesses only (discounting the attack contract). However, vulnerabilities may be the result of several interacting contracts. A single contract may not provide sufficient context to be classified as vulnerable on its own.

6.4 Reservations About Majority Voting

Due to the scarcity of GT data, some authors resort to pseudo-GT data. They run several vulnerability detection tools on selected contracts and obtain the judgment by comparing the number of positive results to a threshold. This approach is debatable for the following reasons.

Weakness vs. Vulnerability. Most tools detect code patterns that indicate a weakness, regardless of whether it can be actually exploited. Hence, false positives (and, to a lesser extent, false negatives) are rather the norm than the exception. Thus, majority vote may turn false positives into a positive assessment.

Tool Genealogy. Tools form families by being derived from common ancestors (like Oyente), by implementing the same approach (like symbolic execution, taint analysis, or fuzzing), or by relying on the same basic components (like GigaHorse, Rattle, Z3, or Soufflé). Related tools may misjudge a contract in a similar way and outnumber tools with the correct result.

Diverging Definitions of Weaknesses. Even if labeled the same, the weaknesses detected by any two tools are not quite the same. Rather, we are faced with tools voting on a weakness that is more or less similar to what they can detect.

7 Conclusion

Publicly available ground truth data for smart contract weaknesses is scarce, but much needed. Our consolidated ground truth is an appreciation of the commendable efforts by others and hopefully renders the included GT sets more usable to the community. The consolidated ground truth described in this paper is available from http://github.com/gsalzer/cgt For an extended version of this paper, see [1].

Future Work. Granularity. This unified and consolidated GT set is constructed on contract level. Information on the location of weaknesses within the contracts, like the line number in the source or the offset in the byte code, is available only for two small datasets, and was omitted here. Severity Level. Assigning a severity level to a weakness would further improve the GT set, but is a difficult topic on its own. Updates are important. We invite everyone to contribute by adding GT collections, taxonomies, levels of granularity or severity, proofs and exploits.